Datasets | Centre for Computational Linguistics, Psycholinguistics, and Sociolinguistics

From author profiling to classification of emotion

Various corpus resources have been developed at CLiPS. Upon request, many of these are available for a wider audience. We have listed these resources below.

MFAQ

Description

MFAQ is a multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.

Languages

We collected around 6M pairs of questions and answers in 21 different languages.

Creator(s)

CLiPS Research Center, University of Antwerp, Belgium; Maxime De Bruyn, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann, and Walter Daelemans. 2021. MFAQ: a Multilingual FAQ Dataset. In Proceedings of the 3rd Workshop on Machine Reading for Question Answering, pages 1–13, Punta Cana, Dominican Republic. Association for Computational Linguistics.

URL

https://huggingface.co/datasets/clips/mfaq

Acknowledgement

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

MQA

Description

MQA is a Multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl. Questions are divided in two types: Frequently Asked Questions (FAQ) and Community Question Answering (CQA).

Languages

We collected around 234M pairs of questions and answers in 39 languages.

Creator(s)

CLiPS Research Center, University of Antwerp, Belgium; Maxime De Bruyn, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

URL

https://huggingface.co/datasets/clips/mqa

Acknowledgement

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

VaccinChatNL

Description

VaccinChatNL is a Flemish Dutch FAQ dataset on the topic of COVID-19 vaccinations in Flanders. It consists of 12,833 user questions divided over 181 answer labels, thus providing large groups of semantically equivalent paraphrases (a many-to-one mapping of user questions to answer labels). VaccinChatNL is the first Dutch many-to-one FAQ dataset of this size.

Languages

Dutch (Flemish)

Creator(s)

CLiPS Research Center, University of Antwerp, Belgium; Jeska Buhmann, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Jeska Buhmann, Maxime De Bruyn, Ehsan Lotfi, and Walter Daelemans. 2022. Domain- and Task-Adaptation for VaccinChatNL, a Dutch COVID-19 FAQ Answering Corpus and Classification Model. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3539–3549, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

URL

https://huggingface.co/datasets/clips/VaccinChatNL

Acknowledgement

This research received funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” programme.

TwiSty Corpus

Description

TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category.

Languages

Spanish, Portuguese, French, Dutch, Italian, German

Creator(s)

Ben Verhoeven (1), Walter Daelemans (1), Barbara Plank (2)

(1) CLiPS Research Center, University of Antwerp, Belgium
(2) University of Groningen, The Netherlands

Citation

If you use this dataset in your research, make sure to cite the following paper:

Verhoeven, B., Daelemans, W., & Plank, B. (2016) TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

ISLRN

883-383-734-892-8

License

DOI

https://doi.org/10.5281/zenodo.4638948

Acknowledgement

This research is supported by a doctoral grant from the FWO Research Council - Flanders for the first author. We thank Guy De Pauw and Tom De Smedt for technical support. Part of this research was carried out in the framework of the AMiCA (IWT SBO-project 120007) project, funded by the Flemish government agency for Innovation by Science and Technology (IWT).

CLiPS Stylometry Investigation (CSI) Corpus

Description

The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available, both on the author (gender, age, sexual orientation, region of origin, personality profile) and on the document (timestamp, genre, veracity, sentiment, grade). The current version of the corpus was assembled in February 2016. Previous versions of the corpus are available from the authors via e-mail request.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp; Ben Verhoeven, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Verhoeven, Ben & Daelemans Walter. (2014) CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.

ISLRN

094-431-805-015-9

License

DOI

https://doi.org/10.5281/zenodo.4639616

Acknowledgement

We would like to express our gratitude to Katrien Verreyken, Shanti Verellen, Sarah Van Hoof, Dominiek Sandra and Reinhild Vandekerckhove (University of Antwerp) for their help in collecting all the data. This corpus was first constructed within the framework of the AMiCA project, funded by the Flemish Agency for Innovation through Science and Technology (IWT), but its further development is supported by a PhD grant of FWO - Research Foundation - Flanders for the first author.

AuCoPro Semantics

Description

The AuCoPro-Semantics dataset serves for the automatic semantic analysis of compounds. It contains semantically annotated noun-noun compounds (NN) from Dutch and Afrikaans, split in two annotation rounds per language. The semantic annotation was performed with annotation guidelines based on those of Ó Séaghdha (2008). Another part of the dataset contains other nominal compounds (XN) in Dutch, that were annotated using a newly developed annotation scheme.

Languages

Dutch & Afrikaans

Creator(s)

Ben Verhoeven (1), Gerhard B. van Huyssteen (2) & Walter Daelemans (1)

(1) CLiPS Research Center, University of Antwerp
(2) Centre for Text Technology (CTexT), North-West University, South Africa

Citation

If you use this dataset in your research, make sure to cite one of the following two papers:

Verhoeven, B., Daelemans, W., & Van Huyssteen, GB. (2012). Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Pretoria, South Africa. 29-30 November. pp. 121-125. ISBN: 978-0-620-54601-0.

Verhoeven, B., & van Huyssteen, G. B. (2013). More Than Only Noun-Noun Compounds: Towards an Annotation Scheme for the Semantic Modelling of Other Noun Compound Types. In: Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic Annotation. Potsdam, Germany.

License

DOI

https://doi.org/10.5281/zenodo.4643727

Acknowledgement

This dataset was created within the 'Automatic Compound Processing (AuCoPro)' project that was funded by the Dutch Language Union (Nederlandse Taalunie), the Department of Arts and Culture (DAC) of South Africa and the National Research Foundation (NRF) of South Africa.

deLearyous

Description

The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity". In addition to having been annotated with discrete class labels (8 octants and a "neutral" class), the dataset also contains fine-grained annotations, with continuous values across the defining axes.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp: Frederik Vaassen, Walter Daelemans; e-Media Lab, Groep T: Jeroen Wauters, Frederik Van Broekhoven, Maarten Van Overveldt, Koen Eneman

Citation

If you use this dataset in your research, make sure to cite one of the following papers:

Vaassen, Frederik & Daelemans Walter (2011). Automatic Emotion Classification for Interpersonal Communication. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Portland, Oregon, USA.

Vaassen Frederik, Wauters Jeroen, Van Broeckhoven Frederik, Van Overveldt Maarten, Eneman Koen, & Daelemans Walter. (2012). deLearyous: Training Interpersonal Communication Skills Using Unconstrained Text Input. (Patrick Felicia, Ed.). In: Proceedings of ECGBL 2012, The 6th European Conference on Games Based Learning. Cork, Ireland.

License

DOI

https://doi.org/10.5281/zenodo.4643731

Acknowledgement

This dataset was created in the context of the IWT-TETRA project deLearyous (2010-2012).

Personae Corpus

Description

The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. We make available the original texts, a syntactically annotated version of the texts, and the metadata.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp; Kim Luyckx, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Luyckx, Kim & Daelemans, Walter (2008). Personae, a Corpus for Author and Personality Prediction from Text. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco.

License

DOI

https://doi.org/10.5281/zenodo.4643756

Acknowledgement

The construction of the corpus was made possible by a grant from the Flemish Research Foundation (FWO) for the 'Computational Techniques for Stylometry for Dutch' project.

The Dutch Audio Description Corpus

Description

The Dutch Audio Description corpus is the first corpus of its kind and includes the transcribed texts of 39 audio described Dutch films and TV series, in total 154,570 words and 3,074 minutes of video. This Dutch AD corpus was used to extract a series of quantitative data regarding the language of AD, namely frequency counts of parts of speech, words, lemmas, collocations and the calculation of other relevant text statistics such as reading speed, word and sentence length, text readability and type token ratios (a statistical measure reflecting lexical variety). The data registered here include the corpus files (XML-files) of the transcribed audio descriptions, the multimodal concordancer developed for the project and the raw data extracted from the corpus as part of the PHD project during which this corpus was developed.

Language

Dutch

Creator(s)

Nina Reviers, Aline Remael, Reinhild Vandekerckhove

Citation

If you use this dataset in your research, make sure to cite the following paper:

Reviers, Nina; Remael, Aline; Reinhild Vandekerckhove. (2017). Dutch Audio Description Corpus [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1035175

DOI

https://doi.org/10.5281/zenodo.1035175

Acknowledgement

This dataset was created in a collaborative work between Artesis University College, Antwerp and University of Antwerp

PAN19 Authorship Analysis: Cross-Domain Authorship Attribution

Description

Authorship attribution is an important problem in information retrieval and computational linguistics but also in applied areas such as law and journalism where knowing the author of a document (such as a ransom note) may enable e.g. law enforcement to save lives. This edition of PAN focuses on cross-domain attribution in fanfiction, a task that can be more accurately described as cross-fandom attribution in fanfiction. In more detail, all documents of unknown authorship are fanfics of the same fandom (target fandom) while the documents of known authorship by the candidate authors are fanfics of several fandoms (other than the target-fandom). In contrast to the PAN-2018 edition of this task, we focus on open-set attribution conditions, namely the true author of a text in the target domain is not necessarily included in the list of candidate authors.

Languages

English, French, Italian and Spanish

Creator(s)

Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein

Citation

Kestemont, Mike, Stamatatos, Efstathios, Manjavacas, Enrique, Daelemans, Walter, Potthast, Martin, & Stein, Benno. (2019). PAN19 Authorship Analysis: Cross-Domain Authorship Attribution [Data set]. CLEF 2019 Labs and Workshops, Notebook Papers. Switzerland: Zenodo. http://doi.org/10.5281/zenodo.3530313

DOI

https://doi.org/10.5281/zenodo.3530313