Parliamentary Corpora

Parliamentary corpora are a very important multidisciplinary language resource that can be approached from many research perspectives, including not only political science, but also sociology, history, psychology, and applicative approaches to linguistics, for instance, critical discourse analysis. The good availability of parliamentary proceedings in digitized form and granted access rights to public information in the EU countries have motivated a number of national as well as international initiatives to compile, process and analyse parliamentary corpora.

The CLARIN ERIC infrastructure offers access to 35 parliamentary corpora, covering almost all of the languages spoken in countries that are either members or observers in CLARIN . In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

Below we first provide overviews of the corpora that are already part of the CLARIN infrastructure and then list those that have not yet been integrated.

Note that in 2020 a project was launched focussing on parliamentary debates on the COVID-19 outbreak and the policy measures in response to it under the name of ParlaMint. More details can be found below and on the project page for ParlaMint.

For comments, changes of the existing content or inclusion of new corpora, send us an resource-families [at] clarin.eu (email).

Parliamentary Corpora in the CLARIN Infrastructure

Corpus	Language	Description	Availability
Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0 Size: 1.1 billion words Annotation: tokenised, MSD-tagged (Universal Dependencies), syntactically parsed (Universal Dependencies), named entities Licence: CC BY 4.0	Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, French, Galician, German, Hungarian, Icelandic, Italian, Latvian, Modern Greek (1453-), Norwegian, Polish, Portuguese, Russian, Serbian, Slovenian, Spanish, Swedish, Turkish, Ukrainian	This corpus comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 which were machine translated to English and the translation linguistically annotated. Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are in all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words. The translation to English was done with EasyNMT withOPUS-MT models. Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza , using the English language model. For NER the conll03 model with 4 NE classes was used. The corpus is available for download from the CLARIN.SI repository and for browsing through concordancers noSketchEngine and KonText.	Concordancer (noSketchEngine) (KonText) Download
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 Size: 7.5 million utterances, 1.1 billion words Annotation: tokenised, MSD-tagged (Universal Dependencies), syntactically parsed (Universal Dependencies), named entities Licence: CC BY 4.0	Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, French, Galician, German, Hungarian, Icelandic, Italian, Latvian, Modern Greek (1453-), Norwegian, Polish, Portuguese, Russian, Serbian, Slovenian, Spanish, Swedish, Turkish, Ukrainian	ParlaMint is a multilingual set of comparable corpora containing parliamentary debates mostly starting at the end of 2015 and extending to mid 2022, with each corpus being between 9 and 125 million words in size. The sessions in the corpora are marked as belonging to the COVID-19 period (after October 2019), the pre-Covid period or the period after 24 February 2022. The corpora have extensive meta-data about the speakers (name, gender, party affiliation, MP status), are structured into time-stamped terms, sessions and meetings, with each speech being marked by its speaker and their role (chair, regular speaker). The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpus is available for download from the CLARIN.SI repository and through the concordancer noSketch Engine. Note that the version of the corpus without linguistic mark-up is available for download under a separate CLARIN.SI entry.	Concordancer Download
The sentiment corpus of parliamentary debates ParlaSent-BCS v1.0 Size: 2600 sentences Annotation: sentiment analysis Licence: CC BY-SA 4.0	Bosnian, Croatian, Serbian	This corpus consists of mid-length sentences from the Bosnian, Croatian, and Serbian parliamentary proceedings that are annotated with a 6-level sentiment schema. The date of the speech and the speaker name are given as well. If the speaker is MP, information on party, gender and year of birth are available as well. The corpus is available for download from the CLARIN.SI repository.	Download
Croatian parliamentary corpus ParlaMeter-hr9 1.0 Size: 14.1 million tokens Annotation: tokenised, MSD-tagged, lemmatised, named entities Licence: CC-BY	Croatian	The corpus contains minutes of the National Assembly of the Republic of Croatia and currently covers its VIth mandate from 15 November 2016 to 21 Nomveber 2018. The corpus contains speaker metadata (gender, age, education, party affiliation). The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine, as well as through a dedicated webpage.	Concordancer Download
Parliamentary corpus of first Yugoslavia (1919-1939) yu1Parl 1.0 Size: 34,542 utterances; 578,958 sentences; 13,271,885 words; 15,403 pages Annotation: tokenised, MSD-tagged, lemmatised Licence: CC BY 4.0	Croatian, Serbian, Slovenian	This historical parliamentary corpus contains meeting proceedings of the National Representation of the Kingdom of Yugoslavia from 191 to 1939. The corpus comprises 714 sessions. The source data (scanned images of printed Stenographic Minutes) come from the History of Slovenia - SIstory portal. The images were OCR processed and the results saved as pdf, docx and txt. The documents are multilingual, in Serbo-Croatian and Slovenian, depending on the speaker. Serbo-Croatian is typeset in the Cyrillic (Serbian) or in the Latin (Croatian) alphabet. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Lingua was used for language detection on the sentence level. Roughly 59% of sentences are in Serbian (Cyrillic script), 38% in Croatian (Latin script) and 3% in Slovenian. Some sentences in German and French were also detected. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using CLASSLA for Serbian, Croatian and Slovenian. Words in Serbian (Cyrillic script) have lemmas in Latin script. The corpus is available for download from the CLARIN.SI repository as well as for online browsing through the noSketch Engine and KonText concordancers.	Concordancer (noSketch) Concordancer (KonText) Download
Czech Parliamentary Meetings Size: 88 hours, 0.5 million tokens Annotation: error correction of transcriptions, division into speech sections with speaker information Licence: CC-BY	Czech	The corpus contains recordings of the parliamentary sessions as well as corresponding transcriptions. The corpus is available for download from LINDAT and through the concordancer KonText.	Concordancer Download
Large Corpus of Czech Parliament Plenary Hearings Size: 444 hours Licence: CC BY 4.0	Czech	This corpus contains audio recordings of Czech parliamentary sessions along with the corresponding transcriptions. The whole corpus has been segmented to short audio snippets making it suitable for both training and evaluation of automatic speech recognition (ASR) systems. The corpus is available for download form the LINDAT reposiory.	Download
The Danish Parliament Corpus 2009 - 2017, v2 Size: 40.6 million words Annotation: no linguistic annotation Licence: CC-BY	Danish	The corpus contains Danish parliamentary debates from 2009 to 2017. The corpus is available for download from the DK-CLARIN repository.	Download
Hansard corpus Size: 1.6 billion tokens Annotation: tokenised, PoS-tagged, lemmatised, semantic tagging	English	The corpus contains British parliamentary debates from 1803 to 2005. It is semantically tagged with the USAS semantic tagger and the Historical Thesaurus Semantic Tagger (HTST). The corpus is available through a dedicated concordancer. For the relevant publication, see Rayson et al. (2015)	Concordancer
Parliamentary Debates on Europe at the House of Commons (1998-2015) Size: 190,000 tokens Annotation: contextual and speaker metadata Licence: CC-BY	English	The corpus contains British parliamentary debates from 1998 to 2015. The contextual metadata in the corpus concern the dates of the council meetings, the description of the main topic(s) of the European council meeting, the place where the European Council meeting took place; they also correspond to information about the government and the legislative session. The speaker metadata correspond to name, gender, occupation, parliamentary group, political orientation and the opposition and majority division. The corpus is available for download from Ortolang. For the relevant publication, see Truan and Romary (2021)	Download
EPIC-UdS Size: 350,000 tokens, 20,000 sentences Annotation: tokenised, PoS-tagged, syntactically parsed, speech phenomena Licence: CC BY-NC-SA 4.0	English, German, Spanish	This is a parallel and comparable corpus of speeches held in the European Parliament; the corpus follows the European Parliament Interpreting Corpora tradition of the EPIC and EPICG corpora. It contains original speeches from 2008 to 2013 by English, German, and Spanish native speakers and their interpretation (English to and from German; Spanish to English). All transcripts in the corpus are based on videos of the European Parliament Proceedings published by the European Parliament. Annotation includes typical characteristics of spoken language such as false starts, hesitations and truncated words. To obtain better results for source-target alignment as well as sentence parsing the transcripts were segmented using a main clause approach: compound sentences were segmented separately. For the second version of the corpus, the transcripts were processed clause by clause with the spaCy tools; the data is encoded in CoNLL-U and provides universal PoS tags, fine-grained language-specific PoS tags as well as Universal Dependency syntactic relations. All data was enriched with relevant metadata such as source language, name of original speaker, speech timing, mode of delivery and delivery rate. The corpus is available for download from CLARIN-D (Saarland University B-centre). For the relevant publication, see Przybyl et al. (2022)	Download
Transcripts of Riigikogu (Estonian Parliament) Size: 13 million tokens Annotation: tokenised Licence: CLARIN_ACA	Estonian	The corpus contains Estonian parliamentary debates from 1995 to 2001. The corpus is available for download from a dedicated webpage and through a concordancer on the same webpage.	Concordancer Download
Aalto Finnish Parliament ASR Corpus 2008-2020 Size: 119.3 million words, 3,130 hours of recordings Licence: CLARIN PUB	Finnish	This corpus, which consists of both audio recordings and transcriptions, is extracted from the Finnish parliamentary plenary session transcripts and videos by the Aalto Speech Recognition group. The original session transcripts and videos are available on the websites of the Parliament of Finland (see here and here). The corpus is split into three parts: the 2015–2020 set the 2008–2016 set development and test sets The corpus is available for download from the Language Bank of Finland.	Download
Plenary Sessions of the Parliament of Finland Size: 22.4 million tokens Annotation: tokenised, MSD-tagged, lemmatised, syntactically parsed Licence: CC-BY	Finnish	The corpus contains Finnish parliamentary debates from 2008 to 2016. The corpus is available through the concordancer Korp.	Concordancer
Parliamentary Debates on Europe at the Assemblée nationale (2002-2012) Size: 137,000 tokens Annotation: contextual and speaker metadata Licence: CC-BY	French	The corpus contains French parliamentary debates from 2002 to 2012. The contextual metadata in the corpus concern the dates of the council meetings, the description of the main topic(s) of the European council meeting, the place where the European Council meeting took place; they also correspond to information about the government and the legislative session. The speaker metadata correspond to name, gender, occupation, parliamentary group, political orientation and the opposition and majority division. The corpus is available for download from Ortolang. For the relevant publication, see Truan and Romary (2021)	Download
German Political Speeches Corpus Size: 15,240 speeches, 27 million texts Licence: CC BY-SA 4.0	German	The corpus contains speeches by 200 important political figures for the period between 1982 and 2020. A large part of the corpus contains speeches by the holders of the four highest German state offices: the Federal President, the Federal Chancellor, the President of the Bundestag and Foreign Ministers with terms of offie between 1982 and 2020. The corpus is available for online browsing through the DWDS platform and a subset encoded in XML with 6,685 speeches until 2019 can be downloaded. For the relevant publication, see Barbaresi (2018)	Concordancer Download
Parliamentary Debates on Europe at the Bundestag (1998-2015) Size: 417,000 tokens Annotation: contextual and speaker metadata Licence: CC-BY	German	The corpus contains German parliamentary debates from 1998 to 2015. The contextual metadata in the corpus concern the dates of the council meetings, the description of the main topic(s) of the European council meeting, the place where the European Council meeting took place; they also correspond to information about the government and the legislative session. The speaker metadata correspond to name, gender, occupation, parliamentary group, political orientation and the opposition and majority division. The corpus is available for download from Ortolang. For the relevant publication, see Truan and Romary (2021)	Download
ParlAT beta Size: 75.2 million tokens Annotation: tokenised, linked data (e.g., speaker information)	German (Austrian)	This corpus contains Austrian parliamentary proceedings from 1996 to 2017. Currently in development, ParlAT is planned to be a monitor corpus with new material added over time. For the relevant publication, see Wissik and Pirker (2018)
Carniolan Provincial Assembly corpus Kranjska 1.0 Size: 10.9 million words Annotation: tokenised, MSD-tagged, lemmatised Licence: CC-BY 4.0	German, Slovenian	The corpus contains meeting proceedings of 694 sessions of the Carniolan Provincial Assembly from 1861 to 1913. The source data (scanned and OCR processed pdf documents) originally come from The Digital Library of Slovenia dLib.si and History of Slovenia - SIstory portals. The documents are bilingual, in Slovenian and German, depending on the speaker. German was first typeset in the Gothic script and later on in Latin. The documents were automatically processed and the following data extracted: titles, agenda, attending, start and end of the session, speakers, and comments. Language was detected on the sentence level, roughly 58% sentences are in Slovenian and 42% in German. Linguistic annotation (tokenisation, MSD tagging and lemmatisation) was added using Trankit for Slovenian and German, while Lingua is used for language detection. The documents are in the Parla-CLARIN compliant XML format. Each session in one file. For the relevant publication, see Marolt et al. (2023)	Download
Hellenic Parliament Minutes (1989-1994, 1997-2018) Size: 181 million words Licence: CC-BY-NC	Greek	The corpus contains Greek parliamentary debates for two periods: 1989-1994 and 1997-2018. The corpus is available for download from the CLARIN:el repository.	Download
Speeches of Politicians in the Greek Parliament Size: 258,036 words Licence: CC-BY-NC	Greek	This corpus contains speeches delivered by 5 members of parliament: Dimitris Anagnostakis, Nikos Tsoukalis, Paros Koukoulopoulos, Niki Founta, and Panayiotis Kammenos. The corpus is available for download from the CLARIN:el repository.	Download
European Parliament Proceedings Parallel Corpus 1996-2011, parallel corpus Greek-English Size: 31.9 million words (English), 1.2 million sentences (Greek) Annotation: sentence aligned Licence: CC ZERO	Greek-English	This corpus is a bilingual Greek-English subset of the Europal parallel corpus. The corpus is available for download from the CLARIN:EL repository.	Download
The Icelandic Parliamentary Corpus Size: 238 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY 4.0	Icelandic	This corpus contains debates in the Icelandic parliament (Alþingi) from 1911 to 2017. The corpus is available for download from CLARIN-IS (as a part of the Icelandic Gigaword Corpus) and for search through the concordancer Korp. For the relevant publication, see Steingrímsson et al. (2018)	Concordancer Download
Corpus of the Saeima (the Parliament of Latvia) Size: 21 million words Annotation: tokenised, msd-tagged, lemmatised, syntactically parsed, named entities	Latvian	This corpus contains parliamentary debates from seven parliamentary terms (5th–12th Saeima) covering years 1993–2017. The available metadata for each utterance includes the date and type of the parliamentary session and speakers’ names and affiliations. The corpus is available for online browsing through the noSketch Engine (CLARIN-LV) concordancer. For the relevant publication, see Darģis et al. (2018)	Concordancer
Lithuanian Parliament Corpus for Authorship Attribution Size: 23.9 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CLARIN PUB	Lithuanian	The corpus contains Lithuanian parliamentary debates from 1990 to 2013. It is annotated with Lemuoklis (morphological analyzer for lemmatization) and MaltParser (generation of dependency tags). The corpus is available for download from the repository of CLARIN-LT.	Download
Norwegian Parliamentary Speech Corpus Size: 140 hours; 65,000 sentences; 1.2 million words Licence: CC-ZERO	Norwegian	This corpus consists of audio recordings of meetings in Stortinget (the Norwegian parliament), and corresponding orthographic transcriptions in either Norwegian Bokmål or Norwegian Nynorsk, as well as various metadata about the speakers. The official proceedings from the meetings are also included in the corpus for reference. Transcription was first done automatically; subsequently, the output of the automatic process was manually checked and corrected by trained linguists and philologists. Finally, all transcriptions were proofread to ensure consistency and accuracy. The audio files in the corpus contain the speech of entire days of plenary meetings from 2017 and 2018 (or, if a meeting lasts more than six hours, the first six hours of a day). The corpus is available for download from the Norwegian Language Bank. For the relevant publication, see Solberg and Ortiz (2022)	Download
Proceedings of Norwegian Parliamentary Debates Size: 29 million tokens Annotation: tokenised, sentence segmentation, speaker metadata (name, party, time, type of utterance) Licence: NLOD	Norwegian	The corpus contains Norwegian parliamentary debates from 2008 to 2015. The corpus is available through the concordancer Corpuscle.	Concordancer
Talk of Norway Size: 63.8 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: NLOD	Norwegian	The corpus contains Norwegian parliamentary debates from 1998 to 2016. The corpus is available for download from the CLARINO repository. For the relevant publication, see Lapponi et al. (2018)	Download
Polish Parliamentary Corpus Size: 300 million tokens Annotation: tokenised, MSD-tagged, named entities, etc.	Polish	The corpus contains Polish parliamentary debates from 1991 to 2017. It is annotated with Morfeusz SGJP (morphological analyser), Pantera (disambiguating tagger), Spejd (shallow parser), Nerf (named entity recognizer). The corpus is available for download from a dedicated webpage and through the concordancer NKJP. For the relevant publication, see Ogrodniczuk (2018)	Concordancer Download
PTPARL Corpus Size: 1 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CLARIN RES	Portuguese	The corpus contains Portuguese parliamentary debates from 1970 to 2008. It is annotated with LX-Tokenizer, LX-Tagger, MBT, MBLEM (lemmatisation). The corpus is available for download from the CLARIN PORTUGAL repository. For the relevant publication, see Généreux et al. (2012)	Download
Slovenian parliamentary corpus ParlaMeter-sl 1.0 Size: 41 million tokens Annotation: tokenised, MSD-tagged, lemmatised, named entities Licence: CC-BY	Slovenian	The corpus contains minutes of the National Assembly of the Republic of Slovenia and currently covers the VIIth mandate from 1 August 2014 to 22 June 2018. The corpus contains speaker metadata (gender, age, education, party affiliation). The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine, as well as through a dedicated dedicated webpage. For the relevant publication, see Ljubešić et al. (2018)	Concordancer Download
Slovenian parliamentary corpus siParl 3.0 Size: 213 million words Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY	Slovenian	The corpus contains Slovenian parliamnetary debates from 1990 to 2022. It differs from the SlovParl 2.0 corpus (listed below) in that it contains only basic meta-data about the speakers, a typology of sessions and structural and editorian annotations. The corpus is available for download from the CLARIN.SI repository and through the concordancers KonText and noSketchEngine.	Concordancer Download
Slovenian parliamentary corpus SlovParl 2.0 Size: 3.2 million tokens Annotation: tokenised, PoS-tagged, lemmatised Licence: CC-BY	Slovenian	The SlovParl corpus contains minutes of the Assembly of the Republic of Slovenia for the legislative period 1990-1992, i.e. it covers the period before, during, and after Slovenia became an independent country in 1991. The corpus comprises 232 sessions, 58,813 speeches and 10.8 million words. The corpus contains extensive meta-data about the speakers, a typology of sessions etc. and structural and editorial annotations. For the relevant publication, see Pančur and Šorn (2016)	Concordancer Download
Riksdag’s Open Data Size: 1.25 billion tokens Annotation: tokenised, lemmatised Licence: CC-BY	Swedish	The corpus contains Swedish parliamentary debates from 1971 to 2016. It is annotated with Sparv. The corpus is available for download from Spr?kbanken (all entries with "Riksdag's Open Data" in the subtitle) and through the concordancer Korp. For the relevant publication, see Borin et al. (2016)	Concordancer Download
Europarl: European Parliament Proceedings Parallel Corpus 1996-2011 Size: 33.7 million tokens Annotation: sentence/aligned Licence: CC0	21 languages	This corpus contains parliamentary debates from the European Parliament from 1996 to 2011. The corpus is available for download from a dedicated webpage.	Download

Corpus

Language

Description

Availability

Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0

Size: 1.1 billion words
Annotation: tokenised, MSD-tagged (Universal Dependencies), syntactically parsed (Universal Dependencies), named entities
Licence: CC BY 4.0

Bosnian, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, French, Galician, German, Hungarian, Icelandic, Italian, Latvian, Modern Greek (1453-), Norwegian, Polish, Portuguese, Russian, Serbian, Slovenian, Spanish, Swedish, Turkish, Ukrainian

This corpus comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 which were machine translated to English and the translation linguistically annotated.

Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are in all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words. The translation to English was done with EasyNMT withOPUS-MT models. Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza , using the English language model. For NER the conll03 model with 4 NE classes was used.

The corpus is available for download from the CLARIN.SI repository and for browsing through concordancers noSketchEngine and KonText.

Parliamentary Corpora

Parliamentary Corpora in the CLARIN Infrastructure

Other Parliamentary Corpora

CLARIN-Funded Project: ParlaMint

Additional Materials

Publications on the parliamentary corpora