By Isabella Calabretta, Courtney Dalton, Richard Griscom, Marta Kołczyńska, Kristina Pahor de Maiti, Ruben Ros
Parliamentary debate transcripts hold a lot of information regarding the dynamics inside the parliament. Legislation is debated on parliamentary benches, resulting in rich discussions on various societal events and developments. By connecting the transcripts with political metadata, such as party affiliation, we are able to analyze how the members of parliament react to discussions pertaining to different kinds of events.
A recent event which had a massive impact across the world is the COVID-19 epidemic. In our work, we use transcripts from four parliaments (Italian, Polish, Slovene, and British) to analyze the differences between debates before and during the epidemic. As a data source, we use the ParlaMint data set (Erjavec et al. 2020), which contains parliamentary transcripts, metadata and linguistic annotations of the transcripts, such as named entities and lemmas.
Our research questions focused on the identification of differences and similarities in parliamentary debates on the COVID pandemic across Italy, Poland, Slovenia and the UK. To this end, we first analysed the country-specific data, and then compared the results across countries. We also mapped the COVID-related debates in time and compared them to the epidemiological situation in each country.
For each country, we asked:
- How do speeches on COVID differ from regular debates?
- Which topics arise in COVID debates? Which topics are shared between the countries and which are country-specific?
- Do the debates highlight any major shifts in topics or priorities over time?
- What is the frequency of COVID-related debates over time, and is there any connection between debates and COVID cases reported?
For identifying characteristic keywords and collocations we used NoSketchEngine, a free tool for exploring corpora. For each country’s parliamentary corpus, we selected only speeches by regular Members of Parliament and of those we created:
- the COVID subcorpus, which includes speeches since November 2019 (which was determined by the authors of the corpora as the beginning of the COVID period), and
- the reference subcorpus, with speeches from regular MPs before November 2019.
Using the “Word list” function, we created a list of top 50 keywords for each language (lower case lemmas) that distinguish the COVID subcorpus from the reference subcorpus. These keywords are determined by calculating the keyness score, which compares the likelihood of a word in two corpora. The keywords are those for which the likelihood in one corpus is much higher than in the other.
To create lists of collocations, we used the Collocations functionality in NoSketchEngine, choosing lower case lemmas for the output. The results are by default sorted by the logDice score, which describes how significant the collocation is.
Collocation measurement can be used to go beyond the association score between two individual words. By taking the top collocates of top collocates a network of terms can be established. Such a network visualises the relations between words and their centrality in the resulting network. Creating networks for specific time periods brings us closer to the narrative within the debates that mention specific terms. We created collocation networks based on the ‘seed term’ virus to investigate the monthly change.
The script takes the seed term and calculates the top five strongest collocates. We increased the window to 15, instead of the default of 5 because this results in less syntagmatic and more paradigmatic relations. In other words, the relation between the seed term and the top collocates will be more about meaning instead of syntax once we increase the window size. After selecting the top collocates, the top five collocates to those are calculated. This is then done again, resulting in three ‘layers’. Subsequently the links between the term and its collocates form the basis of the network. Repetition in source or target words forms the basis for measuring the centrality, visualised as node size in the network.
For mapping keywords across countries, we manually selected the top 20 COVID-related keywords from each parliament and translated them to English. For each word, we checked the collocations to ensure it was indeed COVID-related. Then we used English fastText word embedding model (Grave et al. 2018) to retrieve word vectors. Finally, we mapped the vectors with t-SNE visualizations (van der Maaten and Hinton 2008), which optimizes a 2D projection by placing similar embeddings close together. With this we wanted to see which words co-occur in all parliaments and which are country-specific.
Finally, we plotted timelines of word frequencies (with ggplot2) using relative occurrences (number of mentions to number of all the words in a given speech). We added a curve reporting the number of COVID cases per country to observe the relation between COVID debates and the epidemiological situation.
Our team included researchers from a variety of backgrounds and we attempted to leverage each person’s strength and experience, as well as their language proficiency. We divided the tasks so that most researchers were associated with a single parliamentary corpus, but everyone contributed in unique ways to the research coordination and analysis. Marta Kołczyńska prepared frequency visualizations, Ruben Ros created collocation networks and experimented with time series and topic modelling, Courtney Dalton and Isabella Calabretta provided political and social context to the results, Kristina Pahor de Maiti helped with linguistic techniques, and Richard Griscom helped with visualizations and technical coordination.
Table 1 shows an overview of general statistics for full corpora. National corpora were sampled into two subcorpora, the COVID subcorpus and the Reference subcorpus. Both contained only the speeches from regular MPs, thus excluding the parliamentary speakers (chairpersons) and guests. The COVID subcorpus contained the speeches from the period related to the COVID spread, namely from November 2019 to late 2020 or early 2021.
We provide the country-specific details, such as the political background and division into subcorpora, in next subsections.
|Dataset||Num. words||Num. speakers||Time span|
Italy has a bicameral parliament, with a total of 945 elected members plus a small number of unelected parlamentari. It is composed of the Chamber of Deputies (630 nationally elected deputati) and Senate of the Republic (315 senatori regionally elected).
The corpus ParlaMint-IT 2.0 (Italian parliament) on NoSketch spans from March 2013 to November 2020, with a total of 26,571,966 words. This COVID subcorpus has 2,569,669 words from 3,707 speeches. The opposition leads for word count, but by a minimal difference. The party Lega-Salvini Premier-Partito Sardo d’Azione is the most prolific one with 578,228 words.
The COVID subcorpus lists 10 parties and groups (a group is an alliance between a parliamentary party and a regional one): Lega-Salvini Premier-Partito Sardo d’Azione (group, right wing), Movimento 5 Stelle, Forza Italia Berlusconi Presidente-UDC (group, centre-right wing), Misto [mixed], Partito Democratico (centre-left wing), Fratelli d’Italia (right wing), Italia Viva – P.S.I. (centre), Forza Italia-Berlusconi Presidente (centre-right wing), Per le Autonomie (centre-right wing).
Poland has a bicameral parliament with the Sejm as the lower house (460 MPs) and the Senate as the upper house (100 senators). The most recent term of the parliament (9th of the Sejm and 10th of the Senate) started on 12 November 2019 after the elections held on 13 October 2019. The ruling party is Law and Justice (Prawo i Sprawiedliwość, PiS) who has a majority of 235 seats in the Sejm. The main opposition party, the Civic Coalition (Koalicja Obywatelska, KO) has 134 seats. In the Senate, PiS has 49 senate seats, and KO – 43.
The Polish parliamentary corpus, ParlaMint-PL 2.0 (Polish parliament), covers the period from 16 November 2015 (the start of the 8th term of the Sejm and 9th term of the Senate) until 14 August 2020. The entire corpus includes over 33 million words in almost 100 thousand speeches by 943 speakers, 703 of which are MPs. The COVID subcorpus includes 20,887 speeches and around 2 million words spoken by 477 MPs. The Reference subcorpus includes 88280 speeches and around 16 million words.
In both the COVID and Reference subcorpora, the main opposition party has been speaking more than the ruling party, at least considering the number of words. In the COVID subcorpus, MPs of the main opposition party, KO, spoke much more than MPs of the ruling party, PiS (843,869 words excluding punctuation vs. 569,570 words, respectively). In the reference subcorpus KO’s MPs also spoke more than PiS, but by a smaller margin (3,929,799 vs. 3,790,700 words).
The Slovene parliament is (incompletely) bicameral and consists of the National Assembly (lower house) and the National Council (upper house). The National Assembly has 90 members, 2 of which represent the Hungarian and Italian-speaking ethnic minority in Slovenia. Currently, the 8th National Assembly is in session and is composed of 9 political parties, a group of independent members and 2 representatives of the ethnic minorities.
The ParlaMint-SI 2.0 corpus contains 19,933,836 words from 353 speakers. The COVID subcorpus covers 8 months (November 2019 to July 2020), while the Reference corpus covers 5,2 years (August 2014 to October 2019). The pre-COVID subcorpus consists of approx. 7 mill words, while the COVID subcorpus is roughly 5 times smaller with over 1,5 mill tokens. On average, the speeches from COVID and pre-COVID period were of similar length (approx. 585 tokens per speech), but in the COVID period, MPs delivered almost twice the number of speeches that were uttered in the pre-COVID period.
The British parliament, officially the Parliament of the United Kingdom, is a bicameral legislature. The upper house is the House of Lords, while the lower house is the House of Commons. At present, there are 1,441 seats across both chambers, with 791 members in the House of Lords (including 26 Lords Spiritual, bishops who have no party affiliation) and 650 members in the House of Commons. Currently, the government is formed by the Conservative Party, which holds 33.9% of seats in the House of Lords and 56.2% of seats in the House of Commons. The main opposition consists of the Labour Party, which holds 22.9% of seats in the House of Lords and 30.5% of seats in the House of Commons.
In total, the ParlaMint-GB 2.0 corpus contains about 17.4 million words and 112,017 speeches. Of these, about 8.7 million words and 52,060 speeches are from the COVID subcorpus, which runs from November 2019 to January 2021, and another 8.7 million words and 59,957 speeches are from the reference subcorpus, which runs from January 2019 to November 2019. The majority Conservative party produced the greatest number of speeches in both the COVID and reference subcorpora.
Keywords and collocations
Given the force with which the pandemic swept through the countries, it is not surprising that the datasets exhibit high similarity when looking at the top 20 COVID-related keywords with respect to the pre-COVID period for each country. The Figure below shows the semantic clusters (labeled manually) based on the keywords. Broadly speaking, we can distinguish two different concerns: the pandemic itself and its consequences (section on the left), and reaction to the pandemic and adoption of mitigation measures (section on the right).
To analyse the characteristics of datasets further, we took a closer look at the top 50 keywords – 20 of which are represented in Table 2. We observed that, for all countries, the majority of keywords are COVID-related, while the others indicate other prominent subjects that were discussed in the parliament during this period (legislation related to defense and justice, infrastructure, voting system, foreign affairs, etc.).
Yellow = found in 2 countries. Blue = found in 3 countries. Green = found in all 4 countries.
Several keywords, mainly in relation to the pandemic, are shared among most of the countries (e.g. pandemic, covid (including covid-19), and coronavirus, virus). Not surprisingly, there is also a strong overlap among the keywords pertaining to mitigation measures, for example quarantine and mask (for most of the countries), and keywords such as lockdown, distancing, epidemic, ventilator and voucher (overlap for two countries).
Similar observations to keywords also shine through their most prominent collocations as can be seen in the Table below. On the one side, all parliaments focus on the pandemic as a phenomenon and its consequences (e.g. outbreak, crisis, death, cause, time, infection, emergency, global, consequences, because, impact …), and on the other side, they focus on the measures to be taken (e.g., mitigation, end, against, response, preparedness, handle, recovery, fund, reform, stability, guideline, reopen …).
Green = found in all 4 countries. Blue = found in 3 countries. Yellow = found in 2 countries.
It is somehow revealing that there are several keywords and collocations (economy, liquidity, recession, economic, crisis, fund, voucher …) which indicate that economy, rather than some other policy area, was the main concern of all the parliaments under investigation. Furthermore, among the four countries, mentions of EU financial support only appears on the lists for the Italian and Slovene parliament. Given the strong engagement of Brussels in the management of COVID pandemic, this lack of EU-related mentions could, in a superficial manner, reflect the level of the relationship between the EU and the four countries.
Another common observation is that this kind of analysis with a cross-country dimension should necessarily include close reading since similar concepts often take different verbal forms. In our analysis, this is best seen on to following examples: the uncontrollable spread of a virus is in certain countries more often referred to as epidemic (even when it is clear that the situation became global; e.g. in Slovenia) while in other countries (e.g. in UK) pandemic is used more often. Similarly, the virus itself is referred to in different ways which makes the analysis more difficult: while in Slovenia the term corona is very frequent, it is less so in Italy (probably because this particular word is polysemous in Italian). Lastly, corona (legislative) act and corona package may be specific to the Slovene data set, but the concept is not: the same batch approach to legislation aimed at alleviating the negative consequences of the pandemic is referred to as Cura Italia in the Italian data set and as shield in the Polish data set. In addition, words such as wave, limiting, fight, control … which are, in Slovene and Polish, semantically related to the concepts of water/natural disaster and war, indicate the predominant way of conceptualizing the COVID situation in these two parliaments.
Finally, another common aspect across datasets is that certain measures sparked high polarization. In the Slovene data set we can, for example, find that prominent collocates of (tracking) application, vaccination, and quarantine consist of antonyms, such as obligatory:voluntary, control:freedom, scientific:thinking/believing. Similarly, the collocations from the Italian data set show some strong language revealing mutual accusation over the lack of wearing masks (negationists) and polarized opinions with regard to the use of the application (importance, failure). The same is true for the Polish data set, where pandemic collocates with, for example, fight, on the one hand, and alleged on the other hand; and where the emergency legislation is marked as anti-crisis by one side, and as leaky and so-called by the other side.
There are also some interesting country-specific keywords and collocations. The Italian data set is distinctive in its frequent mentions of chemical elements and health-related problems (e.g. frequent headaches), as well as its very specific measure represented by the keyword scooter (which were given to the people so they could more easily reach vaccination centers). The Slovene keyword and collocation list stands out because of many proper names and words such as respirator which all refer to the scandals related to the purchase of medical equipment. Then, there are strictly procedural words that appear on the Polish and British lists (e.g. unmute, applause), but this might be just due to different transcribing conventions in these parliaments in comparison to Italy and Slovenia. Finally, the top collocations on the UK list, include disproportionate, unequal, black, ethnic, and minority, suggesting that discussions on the impact of COVID-19 often addressed disproportionate outcomes among Black people and other racial or ethnic minorities.
Collocation networks offer an insight in the relations between key terms in parliamentary debates. We used them to acquire a bird’s-eye view over the semantics of the speeches that use the seed term “virus” in the first months of the pandemic. Strikingly, commonalities between countries appear from the networks. They are structured in multiple overarching themes. The first theme that stands out in several languages is the language related to crisis responses. The British network shows relations between, for example, nhs and testing. At this time, there was also gratitude and concern for those employed in the NHS, as seen by the collocates staff, worker, and nurse. From the Slovene network, the narrative of adopting measures [ukrepi] to restrict the spread [širjenje] of the virus in order to “secure life” shows a similar theme of crisis response. Besides the ad hoc measures that were being discussed in the parliaments, the networks also demonstrate the presence of a more forensic language, pertaining to the questions that surrounded the virus in the early months of 2020. The Italian network especially reflects this theme, with terms such as animal [animale], influenza [influenza], pathology [patologia] and bat [pipistrello]. Related questions on the mortality of the new virus also appear in both Italian, Polish and British networks.
The common themes that characterize the collocates across countries also feature distinct temporal features. The immediate crisis response rhetoric is mainly restricted to March 2020. In subsequent months, the networks show an intensification of this rhetoric in multiple ways. Words related to crisis and epidemic, for example, appear in April and May in the British network. Virus also becomes associated with spike and threat, reflecting the continuous increase in COVID-19 cases throughout April. In the Slovene network, epidemic [epidemija] is first mentioned in May’s collocation network. This is again strongly related to measures covering activities, touch, and speech, but the network also includes reason and extremist, which suggests the polarization of the public opinion. In Italy and Poland, there are also signs of the military language that started to appear in Europe, by means of words such as threat [minaccia] and weapon [arma]. Another common trait of the networks for April and May is the renewed focus on legislation. After the crisis management of the first weeks in March, discussions on specific measures such as lockdowns and quarantine (re)appear in all four countries. Early discussions about vaccines also make their appearance in these months. In the United Kingdom, vaccine collocates with summit and world. The former refers to the Gavi Global Vaccine Summit, hosted in the UK in June 2020. Both terms suggest a narrative in which the UK begins to turn its gaze outward to search for vaccine solutions in collaboration with other countries. Vaccine also collocates with treatment, perhaps because both treatments and vaccines offer a way out of the pandemic by helping to ameliorate the devastating effects of COVID-19 on a population.
In the UK in June, virus was associated with lockdown, restriction, wave, transmission, and control. At first glance, this seems perplexing, since June was a month of declining transmission in the UK. However, concordances suggest that these words often occurred in the context of debates over lifting or easing lockdowns and restrictions. It may seem counterintuitive that these words would appear at the end rather than at the beginning of a lockdown. However, recall that the initial lockdowns and restrictions were imposed in haste as an emergency measure. In contrast, easing the restrictions was a lengthier process that was accompanied by more discussion.
The networks also show words and clusters that relate to country-specific debates. In Italy, for example, June features terms such as antibody [anticorpo] and plasma [plasma]. In this month, Italy was in fact testing plasma therapy, and on 14 June, for the Giornata Mondiale del Donatore [Global day of donor], donors who had antibodies for SARS-CoV-2 donated plasma to be serologically tested. Another example is the theme of disinformation that features in the British network for May 2020.
Timeline of keywords
The frequencies of pandemia and epidemia in the Italian subcorpus highlight a lower usage of the word pandemic in the first and second wave by both opposition and coalition. Epidemic has a local implication, and it was broadly used within the first wave, mostly by opposition, but as time goes by, the global outlook switches the use from epidemic to pandemic.
Comparing the frequencies of two most characteristic pandemic-related keywords: pandemia and epidemia, between MPs from the government (dark blue bars in the plots below) and opposition (light blue), it is clear that these words were used more often by opposition MPs throughout the period of the pandemic. The biggest government-opposition gaps were in the early months of the pandemic, between March and May, and declined in the later months.
Based on the two most frequent pandemic-related words (epidemic and pandemic), we observed the activity of the opposition and coalition. It appears that the coalition (light blue) was very active in the heat of the first wave. However, given that it’s actually the opposition (dark blue) showing higher frequencies in March and July, we can assume that they were more active in raising attention to the epidemic-related issues (before the first as well as before the second wave), and that the government’s response was delayed.
The differences in the relative frequencies of the terms pandemic and epidemic among MPs of opposition parties and those of the majority Conservative party indicate that opposition MPs tended to use pandemic at a higher frequency than Conservative MPs until late 2020, and Conservative MPs tended to use epidemic at a higher frequency than opposition MPs throughout 2020.
Comparison of timelines with COVID events
Of the four countries we analysed, the first coronavirus cases were found in Italy and in the United Kingdom, on 31 January 2020 (according to data from the Johns Hopkins COVID-19 Data Repository; Dong, Du, and Gardner 2020). In the British parliament, the first mentions of “coronavirus” on 22 January preceded first infections, and mentions of pandemic-related words increased before an uptick in infection numbers. In Italy the first debates of the coronavirus in the parliament coincided with the first surge of infections.
The first diagnosed cases of coronavirus infection in Poland and Slovenia were over a month later – on 4 and 5 March, respectively. In both countries mentions of pandemic-related keywords in the parliaments were over a week earlier than the first diagnosed cases. After the first wave of COVID cases, the number of mentions of pandemic-related words declined in all countries.
All four countries saw a second wave of COVID infections in the fall of 2020, but these increases in COVID cases were not always accompanied by proportional increases in the mentions of pandemic-related words in parliamentary debates. The share of pandemic-related words increased around the time of the second wave in Italy and Poland, but there was no clear increase in Slovenia and the UK. The reactions to the second wave are hard to compare across countries because of the differences in the coverage by parliamentary datasets.
Although parliamentary data is a rich source for textual analysis, it also comes with characteristic challenges. First, certain issues, though significant on a national scale, may not be discussed in parliament. For example, at the beginning of the pandemic, many emergency restrictions may have been enacted without going through the legislative process, e.g. via executive orders. A proper analysis of these data, then, requires knowing the scope of parliamentary duties in each country. Additionally, parliaments generally go through periods of recess in which members do not meet and no discussion takes place. Because there are no data during those periods, there can be no analysis; whatever issues may have been of country-wide import during those days or weeks are not reflected in our interpretations. Finally, we know that transcriptions of parliamentary proceedings do not always perfectly match what was really said, as transcribers may omit noises of hesitation or otherwise edit the speech of MPs, often unconsciously (cf. Fišer and Pahor de Maiti 2020). Given the focus of our research, we do not believe that such changes would invalidate our analysis, but it is an issue to be aware of.
An initial research question surrounding strategies to fight COVID could not be extensively examined. While an interdisciplinary approach was in this case favourable to encapsulate findings across languages and cultures and to help produce and interpret findings thanks to different expertise, it also required tackling the corpora from different angles. Identifying strategy-related keywords for all corpora could not work in the end because of the rich variety of terms and diverse approach per country. An exploratory method in fact provided more insights than selecting a predefined set of keywords in advance.
Plan for future research
A possible direction for further research would be to check if the party affiliation can be predicted based on the keywords. Based on this, we could compare whether it is easier to predict party affiliation before or during the COVID time and see whether COVID united the speakers or divided them. We have experimented with this already, but the preliminary results have shown that models often learn to “cheat”. Concretely, they learn that a speaker which mentions a party is very likely to be from that party themself. The reason is that the speakers often talk about what the political party thinks about a certain problem rather than what they as an individual think about the problem. For example, a member of SMC might say “The members of SMC feel that this issue is very important”, rather than “I feel this issue is very important”. Reworking the modeling problem in a way that this would not be possible would require language-specific additional keyword filtering, which we currently did not have time to perform.
Beside classifying parties, we could also explore in detail party and speaker-specific words with word enrichment, which exposes statistically significant words in subcorpora. Preliminary results confirmed this is an interesting research direction, which could be fruitfully explored in several dimensions. For example, one could compare speaker’s or party’s word frequencies before and during COVID. Alternatively, one could compare word frequency for a party versus all other parties during COVID.
Dong, E., Du, H., & Gardner, L. (2020). An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases, 20(5), 533–534. https://doi.org/10.1016/s1473-3099(20)30120-1
Erjavec, Tomaž et al. (2021). Multilingual comparable corpora of parliamentary debates ParlaMint 2.0, Slovenian language resource repository CLARIN.SI. http://hdl.handle.net/11356/1388.
Fišer, Darja & Pahor de Maiti, K. (2020). Voices of the Parliament. Modern Languages Open, (1). http://doi.org/10.3828/mlo.v0i0.295
Grave, Edouard, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. (2018). Learning Word Vectors For 157 Languages. Arxiv.Org. https://arxiv.org/abs/1802.06893
Jänicke, Stefan, Greta Franzini, Muhammad Faisal Cheema, and Gerik Scheuermann. “On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges.” In EuroVis (STARs), pp. 83-103. 2015.
van der Maaten, Laurens, and Geoffrey Hinton. (2008). Visualizing Data Using T-SNE. Journal of Machine Learning Research (9). https://jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf