Machine learning approaches to modeling language use of individuals and demographic groups

Date: 28 November 2016

Venue: UAntwerpen - Stadscampus - Hof van Liere, Frederik de Tassiszaal - Prinsstraat 13 - 2000 Antwerpen (route: UAntwerpen, Stadscampus)

Time: 3:00 PM - 6:00 PM

Organization / co-organization: Faculty of Arts

PhD candidate: Janneke van de Loo

Principal investigator: Dr Guy De Pauw, Prof Walter Daelemans

Short description: PhD defence Janneke van de Loo - Faculty of Arts


Applications in societally relevant natural language processing

The thesis describes research on the automatic induction of models of language use in two natural language processing applications with a high societal impact. In both applications, the purpose of the induced model is to make a connection between language data and information that has to be extracted from the data. We show that even when the data contains a lot of noise, such as grammatical and spelling errors, promising and useful results can be achieved with supervised and weakly supervised machine learning based on low-level features.

The first application, developed in the project ALADIN, is a self-learning vocal interface for physically impaired users to control their environment through spoken commands. The vocal interface automatically adapts to each individual user by learning the user's pronunciation, vocabulary and grammar. The system is trained with a limited set of example commands and associated controls provided by the user. The research described in the thesis addresses the induction of the grammar, which models the compositionality of the commands and establishes a connection between the commands and their meanings. We describe a concept tagging system based on hierarchical hidden Markov models, which we extended with parameter sharing mechanisms and a retraining phase, resulting in a decrease of the amount of training material needed.

The second application, developed in the project AMiCA, is an author profiling module that automatically estimates people’s age and gender based on the posts they write on social networking websites. This information can be used to aid the detection of harmful online conduct such as grooming by pedophiles, who often provide deceptive profiles. In our age and gender prediction experiments, we used token and character n-gram features extracted from the texts to learn classification models. Binary age classification, i.e. predicting whether an author is older or younger than a specific age boundary, was carried out with a range of different age boundaries. The results show a rise in the classification scores as the age boundary increases. Furthermore, we show that use-case applicable performance levels can be achieved for the classification of minors versus adults, thereby providing a useful component in a cybersecurity monitoring tool.

Contact email: