Datasets

Various corpus resources have been developed at CLiPS. Upon request, many of these are available for a wider audience. We have listed these resources below.

 

TwiSty Corpus

Description

TwiSty is a corpus developed for research in author profiling. It contains personality (MBTI) and gender annotations for a total of 18,168 authors spanning six languages. We distribute the Twitter ids of these authors as well as the ids of their available tweets at the time of corpus development. The tweets have undergone language identification and can be found in a Confirmed (as belonging to the language in which the author is situated) and Other category.

Languages

Spanish, Portuguese, French, Dutch, Italian, German

Creator(s) 

Ben Verhoeven (1), Walter Daelemans (1), Barbara Plank (2)

    (1) CLiPS Research Center, University of Antwerp, Belgium
    (2) University of Groningen, The Netherlands

Citation

If you use this dataset in your research, make sure to cite the following paper: 

Verhoeven, B., Daelemans, W., & Plank, B. (2016) TwiSty: a multilingual Twitter Stylometry corpus for gender and personality profiling. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). Portorož, Slovenia.

ISLRN

883-383-734-892-8

License 

DOI

https://doi.org/10.5281/zenodo.4638948

Download

Zip (380 MB)

Acknowledgement  

This research is supported by a doctoral grant from the FWO Research Council - Flanders for the first author. We thank Guy De Pauw and Tom De Smedt for technical support. Part of this research was carried out in the framework of the AMiCA (IWT SBO-project 120007) project, funded by the Flemish government agency for Innovation by Science and Technology (IWT).

CLiPS Stylometry Investigation (CSI) Corpus

Description

The CSI corpus is a yearly expanded corpus of student texts in two genres: essays and reviews. The purpose of this corpus lies primarily in stylometric research, but other applications are possible. There is a vast amount of meta-data available, both on the author (gender, age, sexual orientation, region of origin, personality profile) and on the document (timestamp, genre, veracity, sentiment, grade). The current version of the corpus was assembled in February 2016. Previous versions of the corpus are available from the authors via e-mail request.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp; Ben VerhoevenWalter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Verhoeven, Ben & Daelemans Walter. (2014) CLiPS Stylometry Investigation (CSI) corpus: A Dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.

ISLRN

094-431-805-015-9

License 

DOI

https://doi.org/10.5281/zenodo.4639616

Download

Zip (3 MB)

Acknowledgement

We would like to express our gratitude to Katrien Verreyken, Shanti Verellen, Sarah Van Hoof, Dominiek Sandra and Reinhild Vandekerckhove (University of Antwerp) for their help in collecting all the data. This corpus was first constructed within the framework of the AMiCA project, funded by the Flemish Agency for Innovation through Science and Technology (IWT), but its further development is supported by a PhD grant of FWO - Research Foundation - Flanders for the first author.

AuCoPro Semantics

Description

The AuCoPro-Semantics dataset serves for the automatic semantic analysis of compounds. It contains semantically annotated noun-noun compounds (NN) from Dutch and Afrikaans, split in two annotation rounds per language. The semantic annotation was performed with annotation guidelines based on those of Ó Séaghdha (2008). Another part of the dataset contains other nominal compounds (XN) in Dutch, that were annotated using a newly developed annotation scheme.

Languages

Dutch & Afrikaans

Creator(s)

Ben Verhoeven (1), Gerhard B. van Huyssteen (2) & Walter Daelemans (1)

    (1) CLiPS Research Center, University of Antwerp
    (2) Centre for Text Technology (CTexT), North-West University, South Africa

Citation

If you use this dataset in your research, make sure to cite one of the following two papers:

Verhoeven, B., Daelemans, W., & Van Huyssteen, GB. (2012). Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa (PRASA). Pretoria, South Africa. 29-30 November. pp. 121-125. ISBN: 978-0-620-54601-0.

Verhoeven, B., & van Huyssteen, G. B. (2013). More Than Only Noun-Noun Compounds: Towards an Annotation Scheme for the Semantic Modelling of Other Noun Compound Types. In: Proceedings of the 9th Joint ISO - ACL SIGSEM Workshop on Interoperable Semantic Annotation. Potsdam, Germany.

License 

DOI

https://doi.org/10.5281/zenodo.4643727

Download

Zip

Acknowledgement 

This dataset was created within the 'Automatic Compound Processing (AuCoPro)' project that was funded by the Dutch Language Union (Nederlandse Taalunie), the Department of Arts and Culture (DAC) of South Africa and the National Research Foundation (NRF) of South Africa.

deLearyous

Description

The deLearyous dataset is a Dutch (Flemish) dataset for emotion classification following the framework of Leary's Rose, also known as the Interpersonal Circumplex. The dataset contains 11 conversations that were annotated on the sentence level with their position on Leary's Rose, in function of the two defining dimensions: "dominance", and "affinity". In addition to having been annotated with discrete class labels (8 octants and a "neutral" class), the dataset also contains fine-grained annotations, with continuous values across the defining axes.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp: Frederik Vaassen, Walter Daelemans;  e-Media Lab, Groep T: Jeroen Wauters, Frederik Van Broekhoven, Maarten Van Overveldt, Koen Eneman

Citation

If you use this dataset in your research, make sure to cite one of the following papers:

Vaassen, Frederik & Daelemans Walter (2011). Automatic Emotion Classification for Interpersonal Communication. In: Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011). Portland, Oregon, USA.

Vaassen Frederik, Wauters Jeroen, Van Broeckhoven Frederik, Van Overveldt Maarten, Eneman Koen, & Daelemans Walter. (2012). deLearyous: Training Interpersonal Communication Skills Using Unconstrained Text Input. (Patrick Felicia, Ed.). In: Proceedings of ECGBL 2012, The 6th European Conference on Games Based Learning. Cork, Ireland.

License 

DOI

https://doi.org/10.5281/zenodo.4643731

Download

Zip

Acknowledgement

This dataset was created in the context of the IWT-TETRA project deLearyous (2010-2012).

Personae Corpus

Description

The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level. We make available the original texts, a syntactically annotated version of the texts, and the metadata.

Language

Dutch

Creator(s)

CLiPS Research Center, University of Antwerp; Kim Luyckx, Walter Daelemans

Citation

If you use this dataset in your research, make sure to cite the following paper:

Luyckx, Kim & Daelemans, Walter (2008). Personae, a Corpus for Author and Personality Prediction from Text. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). Marrakech, Morocco.

License 

DOI

https://doi.org/10.5281/zenodo.4643756

Download

Zip

Acknowledgement

 The construction of the corpus was made possible by a grant from the Flemish Research Foundation (FWO) for the 'Computational Techniques for Stylometry for Dutch' project.