Past Colloquia

Here you find a list of all the research colloquia held at CLiPS so far.

16 October - Mattia di Gangi

16 October 2019 - Mattia di Gangi

 

  • Date 
    Wednesday 16th of October 2019, 15:00 - 16.00

  • Location
    Annexe, Building R - Stadscampus, Rodestraat 14, Antwerpen 2000.

  • Title
    Challenges and Developments in Direct Speech-to-Text Translation

  • Abstract
    Direct speech-to-text translation is a recent research area that aims to develop models able to translate effectively speech in one language into text in another language. Unlike the traditional approaches that concatenate (at least) a machine translation (MT) and an automatic speech recognition (ASR) models, the translations have to be obtained without intermediate discrete representation (e.g. transcription). Some motivations that guide research in this area are: overcoming error propagation in the model cascade, reducing inference latency, and translate endangered languages, most of which do not have a formal scripting system. Despite the ambitious goal, the results to date are far from satisfying and remain worse than the ones from cascade systems. The reasons are essentially threefold: i) scarce availability of data; ii) low data effectiveness of such models when using training data from ASR or MT; and iii) technical difficulties in using state-of-the-art sequence-to-sequence models. In this talk I will introduce the task and describe our work to overcome the problems of data scarcity and neural architectures available. In particular, I will introduce MuST-C, our recently proposed Multilingual Speech Translation Corpus, and the neural architecture of S-Transformer, both of which contributed to increase significantly the translation quality in only one year. The talk will end with some research questions that can guide researchers interested in entering in this exciting new field.

  • Bio
    Mattia is a third-year Ph.D. student at the University of Trento, Italy. In Trento, he is pursuing his research in the group of Machine Translation (MT) at Fondazione Bruno Kessler (FBK), where he could study many aspects before landing to his current research topic, direct speech-to-text translation. His research experience also includes an internship in 2016 at the CNR (National Council of Research) of Palermo, Italy, and one in 2019 at Amazon AI, in East Palo Alto, California. He received his M.Sc. in Computer Science in the context of a double degree program by the University of Palermo, Italy and the University Paris-Est Marne-la-Vallée, France.

28 May 2019 - Roeland van Hout

28 May 2019 - Roeland van Hout [CANCELLED]

 

  • Date 
    Tuesday 28th of May 2019, 16:00 - 17.00

  • Location
    Annexe, Building R - Stadscampus, Rodestraat 14, Antwerpen 2000.

  • Title
    Big data suggest strong constraints of linguistic similarity on adult language learning

  • Abstract
    When adults learn new languages, their speech often remains noticeably non-native even after years of exposure. These non-native varieties can have far reaching socio-economic consequences for learners. Many factors have been found to contribute to a learners’ proficiency in the new language. Here we examine a factor that is outside of the control of the learner, linguistic similarities between the learner’s native language (L1) and the new language (Ln). We analyze the speaking proficiencies of about 50,000 Ln learners of Dutch with 62 diverse L1s. We find that language background accounts for a large proportion of variance in proficiency (~17%) and that almost 80% of this effect can be explained by combining measures of phonological, morphological, and lexical similarity between the L1 and the Ln. These results highlight the constraints that a learner’s native language imposes on language learning, and inform theories of L1-to-Ln transfer during Ln learning and use. As predicted by some proposals, we also find that L1-Ln phonological similarity is better captured when subcategorical properties (phonological features) are considered in the calculation of phonological similarities.

  • Bio
    Roeland Van Hout is a variational linguist and a sociolinguist with a strong focus on statistical data processing. His research covers a wide range of topics, the most prominent ones being dialect variation in the Dutch language area, linguistic perceptions and attitudes with respect to varieties of Dutch and the acquisition of Dutch as a second or third language.

16 May 2019 - Padraic Monaghan

16 May 2019 - Padraic Monaghan

Please register through the CLiPS website.

  • Date 
    Thursday 16th of May 2019, 11:00 - 12.00

  • Location
    Annexe, Building R - Stadscampus, Rodestraat 14, Antwerpen 2000.

  • Registration
    Register now

  • Title
    Multiple cues and contingency of use in early language learning 

  • Abstract
    The child's language environment is notoriously noisy in terms of constraining how words relate to meaning and specifying syntactic relations between categories in utterances. In this talk I will present corpus, computational, and behavioural studies of how multiple cues may conspire to constrain word to meaning mappings in language learning. Once the whole communicative environment is considered, then multiple information sources that highlight word meanings and syntactic relations emerge. I will show that individually these information sources are noisy, but together they are able to constrain language learning, and that care-givers appear to use multiple cues adaptively to minimise children's uncertainty about word meanings.

  • Bio
    Padraic has a joint appointment as Professor of English Linguistics at the University of Amsterdam, and Professor of Cognition in the Department of Psychology at Lancaster University. His research focuses on language acquisition and language change, at the interface between linguistics, psychology and computer science. He has investigated the different profiles of language acquisition for monolingual and bilingual speakers, the importance of early language skills on reading development, and the cognitive effect of different reading interventions on reading processes. He is a member of the ESRC International Centre for Language and Communicative Development at Lancaster University, and a Research Associate at the Max Planck Institute for Psycholinguistics.

7 May 2019 - Sylvia Jaki

Slides

Date

Tuesday 7 May 2019, 14:30-15.30

Location

C.104 – Stadscampus, Antwerpen 2000. Enter through building B (Prinsstraat 13) or building D (Grote Kauwenberg 18).

Title

Case Studies of Hate Speech on Social Media: Analysis and Automatic Detection

Abstract

- Initially praised for their ability to connect people worldwide, social media have also turned out to be drivers of polarisation and radicalisation. Polarisation can be observed on a daily basis, given the striking amount of offensive and discriminatory communicative acts towards individuals or groups, so-called hate speech. Due to the detrimental effects of hate speech on individuals and online discourses in general, some ground-breaking regulatory measures have been taken against hate speech and fake news recently, such as German NetzDG. In this context, Machine Learning, more specifically automatic detection systems, come into play, as automatic detection, and potentially also removal, is often discussed as an effective strategy to fight hate speech and fake news online.

The first part of this talk will be dedicated to the qualitative and quantitative analysis of communicative acts ranging from uncivil to toxic, in order to provide a general overview of the problem. The second part will focus on automatic hate speech detection – both on its potential and pitfalls as well as challenges that come with it. This talk will be based on three case studies, namely, German right-wing hate speech on Twitter (Jaki & De Smedt 2018), misogynist hate speech in the forum Incels.me (Jaki, De Smedt et al. 2018), and comments on the official Facebook pages of the main political parties as well as their leading candidates before the German federal elections 2017.

Bio

Sylvia Jaki graduated from the University of Munich, Germany, in 2008. She holds a PhD in Linguistics from the International Doctoral Programme LIPP (now Graduate School Language & Literature Munich, Class of Language), with a dissertation focussing on phraseological modifications in newspaper headlines. Since 2013, Sylvia Jaki has taught and pursued new research directions at the Department of Translation & Specialised Communication at the University of Hildesheim, where she is now in charge of the Master’s Medientext und Medienübersetzung (Media Text and Media Translation). Her main areas of interest include media language, audiovisual translation, science communication to non-expert audiences, verbal humour, and, most recently, hate speech.

25 October 2018 - Stephan Raaijmakers

Date: Thursday 25 October 2018, 14:30

 

Location: Annexe, Building R - Stadscampus, Rodestraat 14, Antwerpen 2000

 

SCHEDULE:

 

14.30: 'Memory-based learning revisited' talk by Stephan Raaijmakers

15:30: Closing

 

See below for more detailed information about the topic and the speaker. The foreseen duration of the talk is 60 minutes, with 10 minutes for question and discussion at the end.

 

------------------------------------------------------------------------------------------

 

TITLE

Memory-based learning revisited

 

ABSTRACT

Some 20 years after the major successes of MBL for language learning emerged, it is time to re-assess its position amidst the flurry of deep learning models. We will report whether the benefits of MBL for modeling exceptional data still hold, compared to current deep learning approaches. In addition, we will highlight a potential new use of MBL: as a method for explaining deep learning through latent space analysis of models.

 

BIO

Stephan Raaijmakers is a senior scientist at the Data Science Department of TNO. He obtained a PhD in machine learning-based text mining from Tilburg University in 2009. He currently is technical coordinator of two machine learning-intensive European research projects in the security domain, and is specialized in deep learning-based approaches to NLP. Within TNO, he is the scientific lead of a research group on explainable AI. He currently writes a monograph on Deep Learning for NLP (due 2019), and is an active reviewer and PC-member of conferences like EMNLP and ICWSM.

26 June 2018 - Malvina Nissim & Paolo Rosso

This colloquium consists of two talks:

The blessing and the curse of lexical information in author profiling (Malvina Nissim)

Automatic Misogyny Identification (AMI) in Twitter (Paolo Rosso)

Below you can find more detailed information about the presentations and the speakers.

 

========================================================================

 

Title

The blessing and the curse of lexical information in author profiling

Abstract

State-of-the-art performance on author profiling (e.g., gender and age) for English in social media is around 80%. How realistic is this figure? Although it is known that function words are good predictors for this task, most best performing systems are trained on word and character n-grams, thus heavily leveraging lexical information, too. And it works. But because of this, shifts of topic or domain affect performance seriously. If we create conditions to reduce the influence of lexical information, we might be able to identify features that are indeed more relevant for profiling. I will describe experiments where we get rid of lexical information still preserving some linguistic structure and see how well we can do with what's left. We stretch this to see whether free of lexical information we can perform cross-language profiling, and we also compare our models to human performance in quite an interesting way!

Bio

Malvina Nissim is Associate Professor in Language Technology at the University of Groningen. She has extensive experience in sentiment analysis and author identification and profiling, as well as in modelling the interplay of lexical semantics and pragmatics, especially regarding the computational treatment of figurative language and modality. She is the author of 100+ publications in international venues, is a member of the main associations in the field, and annually reviews for the major conferences and journals. She has recently co-chaired the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017), and she is the general chair of the Seventh Joint Conference on Lexical and Computational Semantics (*SEM 2018). She is also active in the field of resource and system evaluation, as both organizer and participant of shared tasks, and is interested in the philosophy behind them. She graduated in Linguistics from the University of Pisa and obtained her PhD in Linguistics from the University of Pavia. Before joining the University of Groningen, she was a tenured researcher at the University of Bologna (2006-2014), and a post-doc at the Institute for Cognitive Science and Technology of the National Research Council in Rome (2006) and at the University of Edinburgh (2001-2005). In 2017, she was elected as the 2016 University of Groningen Lecturer of the Year.

 

------------------------------------------------------------------------------------------------------------------

 

Title

Automatic Misogyny Identification (AMI) in Twitter

Abstract

Hate speech may take different forms in online social media. Most of the investigations in the literature are focused on detecting abusive language in discussions about religion, immigration and sexual orientation. In this talk, I will address the problem of automatic misogyny identification (AMI) in Twitter, which entails identifying cases of aggressiveness and hate speech towards women. Moreover, I will also present the results obtained at two shared tasks on exactly this topic: AMI-IberEval-2018 (https://amiibereval2018.wordpress.com/) on Spanish and English Twitter data, and AMI-EvalIta-2018 (https://amievalita2018.wordpress.com/) on English and Italian Twitter data.

Bio

Paolo Rosso is an Associate Professor at the Technical University of Valencia. He obtained a PhD in Computer Science from Trinity College University in Dublin in 1999. His work centers on Natural Language Processing and Information Retrieval in social media and he has extensive experience in the field of author profiling and opinion mining. He has actively contributed to the field of controlled system evaluation and shared tasks as Deputy Steering Committee Chair for the Conference at the CLEF Intitiative and by chairing the PAN shared task collective for the past nine editions (2009-2018). He was recently part of the Organizing Committee for the European Chapter of the Association for Computational Linguistics (EACL 2017) and has taken up roles as editor and reviewer of many notable conferences and journals in computational linguistics.

11 April 2018 - Yaakov HaCohen-Kerner

Title

Stance classification of tweets using Skip Char Ngrams

Abstract

In this research, we focus on automatic supervised stance classification of tweets. Given test datasets of tweets from five various topics, we try to classify the stance of the tweet authors as either in FAVOR of the target, AGAINST it, or NONE. We apply eight variants of seven supervised machine learning methods and three filtering methods using the WEKA platform. The macro-average results obtained by our algorithm are significantly better than the state-of-art results reported by the best macro-average results achieved in the SemEval 2016 Task 6-A for all the five released datasets. In contrast to the competitors of the SemEval 2016 Task 6-A, who did not use any char skip ngrams but rather used thousands of ngrams and hundreds of word embedding features, our algorithm uses a few tens of features mainly character-based features where most of them are skip char ngram features.

Bio

Yaakov Hacohen-Kerner is an associate professor in computer science at the Jerusalem College of Technology (JCT) - Machon Lev, Jerusalem, Israel. He graduated in statistics and computer science. His master degree and Ph.D. are in computer science. All these three degrees are from Bar-Ilan University, Ramat-Gan, Israel. His doctoral project has been prized both by The Information Processing Association of Israel and the "Ben-Gurion Fund for the Encouragement of Research". He developed the "Judge's Apprentice", a case-based decision-support system for judges (at bench-trials) for enhancing uniformity in sentencing. His research interests are: Torah and Science, Artificial Intelligence, Case-Based Processing, Intelligent Text Processing and Game Playing.

16 October 2017 - Current trends in Psycholinguistics

This colloquium consists of three talks:

Trial-by-trial discrimination learning in the lexical decision task (Harald Baayen)

The impact of word prevalence on word processing efficiency: The English data (Marc Brysbaert)

The ISLA project: How bilinguals and L2 learners process idiomatic expressions (Ton Dijkstra)

Below you can find more detailed information about the presentations and the speakers.

 

========================================================================

 

Title

Trial-by-trial discrimination learning in the lexical decision task

Abstract

In my presentation, I will present the results of a computational modeling study addressing trial-to-trial learning in the lexical decision task.  The phenomenon of speaker accomodation shows that the fine details of how we speak can change rapidly when we interact with others.  Do such changes also take place when experimental subjects interact with a computer that presents them with words and non-words? To address this question, I made use of the NDL implementation of error-driven learning.  In one experiment, a network was trained on a corpus, and then, without further learning, applied to the sequence of lexical decision trials.  In a second experiment, a network was again trained on a corpus, but now, for each of the trials, the predictions of the network were obtained first, after which the weights of the network were adjusted, thus allowing the model to keep learning as the experiment unfolded.  Statistical modeling showed that measures obtained from the second network outperformed the corresponding measures extracted from the first network. The measures from the second network also outperformed classical measures such as frequency and neighborhood density.  These results show that the mental lexicon, rather than being a static repository of fixed representations, is a dynamic system that is continuously optimized as speakers interacts with their environment.

Bio

Harald Baayen currently holds the Professorship of Quantitative Linguistics at the University of Tübingen. After a Bachelor degree in Theology and a Master degree in General Linguistics obtained from the Vrije Universiteit Amsterdam, he obtained his PhD in General Linguistics from the same university. Later, he moved to the Max-Planck-Institut für Psycholinguistik in Nijmegen, where he became professor of Quantitative Linguistics. He moved to the University of Alberta, Edmonton, in 2007, where he held the professorship of Quantitative Linguistics until 2011, when he finally moved to Tübingen. In 2011 he received the Alexander von Humboldt research award from the Humboldt Foundation and, since 2012, he is an elected member of the Academia Europea. 
Prof. Baayen is a member of the editorial board of several prominent journals in linguistics and psycholinguistics. He contributed to the development of resources that are used worldwide like the CELEX database. He introduced new statistical techniques that have become the state-of-the-art in psycholinguistics such as Mixed-Effects Models and Generalised Additive Mixed Models. Moreover, he developed computational models that are contributing to our understanding of language processing and acquisition, such as Naïve Discriminative Learning. He is the author of one of the most used books about quantitative analysis of linguistic data using R and of dozens of papers on the leading journals in linguistics, psycholinguistics and cognitive science.

 

------------------------------------------------------------------------------------------------------------------

 

Title

The impact of word prevalence on word processing efficiency: The English data

Abstract

Brysbaert et al. (2016) showed that word prevalence is a good predictor of word processing times in the Dutch language. It is defined as the percentage of people who know a word. Word prevalence explains 7% of lexical decision times, after the effects of word frequency, age-of-acquisition, word length, and similarity to other words are taken into account. In this talk, I will present both the Dutch and the English word prevalence norms and discuss their contribution to accounting for the variance of lexical decision and naming times in lexicon projects. I also discuss the reasons why word prevalence is an important variable to be taken into account in future studies.

Bio

Marc Brysbaert is full professor at the Department of Experimental Psychology of Ghent University. He got his Master degree and his PhD in psychology at the Katholieke Universiteit Leuven. He stayed at the KU Leuven as a post-doc before moving for a first time to Ghent University where he stayed until 2001. He then worked at the Royal Holloway, University of London and finally moved back to Ghent in 2008 where he still conducts his research activities. 
During his career, Prof. Brysbaert has received several prestigious awards. He is also a member of many prestigious professional bodies, including the Experimental Psychology Society, the Psychonomic Society, and the Association for Psychological Science. He has been editor for the Quarterly Journal of Experimental Psychology and associate editor of many leading journals in the fields of experimental psychology, psycholinguistics and cognitive science. He authored countless papers published in the leading journals of the field and his pioneering work on mega studies in psycholinguistics has paved the way to the use of big data in psycholinguistics.

 

------------------------------------------------------------------------------------------------------------------

 

Title

The ISLA project: How bilinguals and L2 learners process idiomatic expressions 

Abstract

One summer day, the Netherlands were suffering very hot weather with increasing risks of thunder. This inspired the Dutch newspaper De Volkskrant to the headline ‘Onweersbuien zetten hittegolf op de tocht’ (‘Thunder storms jeopardize heat wave’). While the Dutch newspaper title is clear to natives, it poses serious difficulties to L2 learners of Dutch who are not familiar with the expression “op de tocht zetten” (‘to put in a draught’ = to jeopardize), because its meaning cannot be inferred from the composing words (although context may help). Second Language (L2) learners have even more difficulties in actively using such expressions in language production. Because of its opacity and multifaceted complexity, idiomatic language is much harder to acquire and master for L2 learners than individual words.
In the Idiomatic Second Language (ISLA) project, we investigate how learners acquire, comprehend, and use such formulaic language in their L2. In the talk, I will present an up-to-date overview of the project, and zoom in on several RT and EEG studies on the bilingual processing and representation of idiomatic expressions. By analyzing the contribution of various dimensions (task, context, frequency, transparency, ...) to the results, these empirical studies help in solving the larger puzzle of bilingual sentence processing by hooking together as constituent pieces. 

Bio

Ton Dijkstra is full professor in Psycholinguistics and Multilingualism at the Faculty of Arts of the Radboud University and at the Faculty of Social Sciences of the Donders Institute in Nijmegen. He got his Bachelor degree in psychology at the University of Tilburg and his Master degree in the same subject in Nijmegen, where he also completed his PhD. He continued his research in Nijmegen, becoming associate professor in 2001 and then full professor in 2007. Over the years, his teaching activity has covered a broad range of topics including experimental psychology, computational psycholinguistics, cognitive science, research design, language acquisition, and multilingualism.
Prof. Dijkstra is best known for his work on computational modeling of bilingual word recognition and in particular for the Bilingual Interactive Activation (BIA) model and its extension, the BIA+. His research focuses on word translation in bilinguals, combining behavioral experiments, computational modeling, and imaging studies, and on second language acquisition. His work has appeared in over a hundred journal articles published in some of the most prestigious venues in linguistics, psycholinguistics, and cognitive science.

 

========================================================================

30 November 2016 - Polina Panicheva

Title

Distributional semantics in identification and applications of lexical anomalies

Abstract

Lexical anomalies are viewed as choices of lexical items in context violating certain contextual restrictions. The notion can be effectively applied to measuring syntagmatic association strength in compositions, analyzing lexical errors by native speakers and L2-learners, and figurative language constructions, especially conceptual metaphor. State-of-the-art semantic models, i.e. word- embeddings, present strong and simple techniques for semantic analysis of syntagmatic (collocations) and paradigmatic (synonyms, semantic regularities) relations. The talk covers recent advances in applying word-embeddings to association strength measurement, detection and interpretation of lexical errors in Russian, and metaphorical expressions in English.

Bio

Polina Panicheva is a doctoral student at the Department of mathematical linguistics and a research engineer at the Department of general psychology, Saint Petersburg State University, Russia. Additionally, she received an MSc in Information Technology from ITT Dublin, Ireland in 2011. Her research interests include computational approaches to subjective language and stylometry, distributional semantics and lexical anomalies in Russian. She is working on a project on the linguistic markers of aggression, stress, health and well-being in social networks, and her PhD project is concerned with distributional approaches to lexical anomalies.

12 April 2016 - Annie Zaenen

Title

Factivity: semantics or pragmatics

Abstract

Linguistic literature traditionally assumes that factivity, illustrated in (1)  is a characteristic of particular lexical items together with their constructional environment.

(a) John knows that the earth is round.

(b) John doesn’t know that the earth is round. 

In (1) (a) and (b) the speaker is committed to the view that the earth is round. It is assumed that this commitment is signaled by the use of 'know that S’.

Predicates like ‘believe that’ do not show factive behavior:

(2) (a) John believes that the earth is round.

(b) John doesn’t believe that the earth is round.

In (1) the sentential complement is assumed to be presupposed whereas in (2) it is asserted. 

In this talk, I will look at examples of several lexical items and their complements that cast doubt on the hypothesis that the factive behavior in (1) is only due to the linguistic form of the sentence. I will mainly concentrate on the behavior of evaluative adjectives in the constructions  in (3) and (4):

(3) John was stupid to leave early.

(4) It was stupid of John to leave early. 

and show with corpus and experimental data that in constructions like (4) the content of the infinitival clause influence the judgment. For instance, sentences like (5) are judged not to be factive by a substantial minority of spekars (it is understood as implying that John did not waste money):

(5) John was not stupid to waste money. 

I will discuss a couple of potential explanations for this without coming to a firm conclusion. But whatever the explanation, it is important for NLP applications drawing inferences about speaker commitments not to assume that the behavior of factive items is context independent. 

Bio

Annie Zaenen received a Ph.D at Harvard in 1980 on a dissertation on Extraction Rules in Icelandic. With Joan Maling she focused the attention of the syntax community on phenomena such as Icelandic quirky case and its consequences for the universality of Chomsky's that-trace filter. She has contributed to the theory of Lexical Functional Grammar (LFG)  with the development of notions such as long-distance dependencies, functional uncertainty and the difference between subsumption and equality. She also developed a morphological analyser for French that, after some revisions, became an Inxight product. After a stint as an area manager at the Xerox European Research Center near Grenoble, France, in the 1990s, she moved back to the Bay Area doing research since 2001. She retired from PARC in 2011 and continues working at CSLI and teaching Linguistics at Stanford. She is the editor of an online CSLI journal, LiLT (Linguistic Issues in Language Technology). 

24 February 2016 - Ewan Dunbar

Title

The first year of life and the first years of unsupervised speech recognition: How we are using big corpora to understand infant language development

Abstract

This talk is a briefing on the state of the art in modelling early development of speech perception and lexical acquisition using big speech corpora without annotations, which is a problem that has now brought engineers and computational psycholinguists together under the banner of "unsupervised speech recognition." I'll summarize what we think we know today about how infants start to learn the sounds and words of their native language, and what that tells us about building a reasonable computational model, and I'll briefly sketch out the recent history of joint applied/cognitive research on unsupervised ASR and infant speech development. Then I'll zoom in on some of the best results from the 2015 ZeroSpeech unsupervised ASR challenge at Interspeech, and, in particular, a model in which we learn proto-words using spoken term discovery in order to bootstrap the learning of proto-phonemes. Then I'll briefly talk about some new research in which we evaluate what dimensions/features are coded in speech representations, which we hope will allow us to better tie empirical psycholinguistics together with computational modelling.

Bio

Ewan Dunbar is currently a postdoctoral fellow at the Laboratoire de Sciences Cognitives et Psycholinguistics, a highly interdisciplinary lab that involves the Ecole des Hautes Etides en Science Sociales (EHESS), the Centre National de la Recherche Scientifique (CNRS) and the Ecole Normale Supérieure (ENS) and is hosted at the Département d'Etudes Cognitives of the ENS in Paris. He started off studying Linguistics and Computing at the University of Toronto and then got a MA in Linguistics from the same university, with a thesis on the acquisition of morphophonology. In 2008, he moved to the University of Maryland where he did a PhD in Linguistics on statistical knowledge and learning in phonology, under the supervision of William Idsardi and Naomi Feldman. His interest in language has always proceded along with that for computational modeling, and his research efforts have found a home in the Synthetic Language Learner Project, which brought together researchers with diverse backgrounds to try to implement a computational model of early language acquisition and test its predictions with behavioural experiments and brain imaging techniques. You can find more information on his personal webpage, http://ewan.website

29 January 2016 - Gertjan Van Noord

Title

Improving Automatically Parsed Dutch Treebanks

Abstract

In this presentation, we will describe the efforts (some in vain) to improve the available automatically parsed Lassy Large treebank. We describe some aspects of the Alpino parser, and some recent attempts at improving the parser. Alpino is a hybrid system in which a hand-written grammar and large dictionary is combined with a statistical disambiguation component. The disambiguation component uses co-occurrence information extracted from large treebanks for improved disambiguation accuracy. We describe a recent experiment to add word embedding features to the disambiguation component. We further zoom in on the part-of-speech annotation layer of the existing Lassy Large treebanks, suggesting that the part-of-speech labels, originally provided by a separate POS-tagger, are of questionable quality. We analyse some of the reasons for this, and describe our efforts to provide part-of-speech labels as a side-effect of parsing, and we provide some initial experimental results indicating a huge potential improvement of POS-tagging accuracy using the parser as a tagger.

Bio

Gertjan Van Noord is professor of Language Technologies at the Rijksuniversiteit Groningen (RUG), where he has been working since 1999. He obtained his M.A. in General Linguistics at the University of Utrecht with a major in Computational Linguistics, the subject which he further pursued in a PhD at the same university focusing on Reversibility in Language Processing. During his PhD he also spent one year at the University of Saarland, in Saarbrücken, working on Bidirectional Linguistic Deduction. In 1990, he was one of the initiators of the CLIN meetings, which he contributed to shape and that have been promoting the study of Computational Linguistics in the Low Countries over the past 26 years. He supervised many PhD students and post-docs which now work in top universities worldwide. Among the many conferences and workshops he chaired and organized, in 2006 he chaired the European Chapter of the Association for Computational Linguistics (EACL), and in 2009 he was elected in the Executive Board of the Association for Computational Linguistics (ACL), becoming Vice-President Elect in 2012 and President in 2014.

27 January 2016 - Efstathios Stamatatos

Title

Authorship Verification: A Fundamental Task in Authorship Attribution

Abstract

Authorship attribution is a task of increasing importance in computer science and it is associated with a wide range of applications, from literary research to forensic examinations. The most common framework for testing attribution algorithms is a text classification problem: given known sample documents from a small, finite set of candidate authors, which if any wrote a questioned document of unknown authorship? It has been commented, however, that this may be an unreasonably easy task. A more demanding problem is author verification where given a set of documents by a single author and a questioned document, the problem is to determine if the questioned document was written by that particular author or not. This may more accurately reflect real life in the experiences of professional forensic linguists, who are often called upon to answer this kind of question. Authorship verification is a fundamental task in authorship attribution since any given problem can be decomposed into a series of verification cases. In this talk we will focus on the recent efforts of the PAN evaluation campaigns to establish a common and challenging evaluation framework for this task. The main stylometric methods suitable for this task together with basic verification models will be reviewed and available resources will be presented. Open research questions will be discussed and the relationship of authorship verification to other relevant tasks will be highlighted.

Bio

Dr. Efstathios Stamatatos received the diploma degree in electrical engineering (1994) and the doctoral degree in electrical and computer engineering (2000), both from the University of Patras, Greece. In the past, he has worked at the Polytechnic University of Madrid (1998) as a visiting researcher, the Austrian Research Institute for Artificial Intelligence as a post-doc researcher (2001-2002) and the Technological Educational Institute of Ionian Islands (2003-2004) as an assistant professor. Since 2004 he is a member of the faculty staff of the Department of information and Communication Systems Engineering, University of the Aegean (currently an associate professor). His research interests include text mining, natural language processing, information retrieval, and machine learning. He is the director of the Artificial Intelligence Lab., University of the Aegean, and has co-organized several international evaluation campaigns on plagiarism detection, authorship attribution and social software misuse. Home page: http://www.icsd.aegean.gr/lecturers/stamatatos/

16 September 2015 - Tony Veale

Title

Telling Stories By Putting Narrative Substance on Image Schemas

Abstract

What is a hero without a quest? And what is a quest that does not transform its hero in profound ways? The scholar Joseph Campbell has argued that our most steadfast myths persist because they each instantiate, in their own way, a profoundly affecting narrative structure that Campbell calls the monomyth. Campbell sees the monomyth as a productive schema for the generation of heroic stories that, at their root, follow the image-schematic pattern of a journey either literally or figuratively. Many ancient tales subconsciously instantiate this journey schema, while many modern stories – such as George Lucas’s Star Wars – are consciously written so as to employ Campbell’s monomyth schema as a narrative deep-structure. So Campbell’s monomyth (and, indeed, the folkloristic morphology of Vladimir Propp) can be subsumed under a more abstract, yet ultimately experientially-grounded, mental structure called the Source-Path-Goal (SPG) schema. Cognitive linguists argue that any purposeful action along a path – from going to the shops to undertaking a quest – activates an instance of the SPG schema in the mind. But the SPG is just one of the pervasive image schemas in human thought that shape our understanding of human experiences and the stories we tell about them. Other spatially-grounded schemas, such as vertical movement, connection/disconnection and containment, are freighted with narrative potential. In this talk I explore computational ways of converting this potential into working story-telling code.

Bio

Tony Veale is a senior lecturer in the department of Computer Science at University College Dublin (UCD), Ireland. He has been a researcher in the areas of Computational Linguistics, Cognitive Science, Cognitive Linguistics and Artificial Intelligence since 1988, both in industry and in academia. He obtained a B.Sc (hons) in Computer Science from University College Cork (UCC) in 1988, and an M.Sc in Computer Science in 1990, before joining Hitachi Dublin Laboratory in 1990. He received his Ph.D in Computer Science from Trinity College, Dublin in 1996. He has divided his career between academia and industry. In the latter, he has developed text-understanding and machine translation systems (in particular, the translation of English into American Sign Language, ASL), as well as natural-language-processing tools, and patented web-based question-answering technology. He was, from 2002–2007, the academic coordinator for UCD's unique international degree programme in Software Engineering, which UCD delivers in Shanghai at Fudan university; he continues to deliver courses on this degree. He is the author of Exploding The Creativity Myth: The Computational Foundations of Linguistic Creativity (Bloomsbury Academic, 2012) and a founder member of the international Association for Computational Creativity (ACC). He is the coordinator of PROSECCO - PROmoting the Scientific Exploration of Computational Creativy (http://prosecco-network.eu), a 3-year coordination action involving 7 universities from 6 countries, whose goal is to foster research about computational creativity, a discipline exploring the capabilities of computers to perform tasks that would be considered creative by unbiased human observers.

16 September 2015 - Pablo Gervás

Title

In Search of Appropriate Abstractions for the Computational Synthesis of Narrative

Abstract

The synthesis of narrative, whether to capture the essence of an existing narrative or to generate a draft for a new one, requires an appropriate vocabulary of representational elements. Because narrative is presented to us in many forms – text, speech, pictures, film – the shared essence of this narrative must be something that can be abstracted from all these different forms, and which we aim to represent conceptually if we want to manage it computationally. There have been a number of attempts to formulate representations of this type in literary studies, but usually not very rigorous in their computational approach – mostly because they were only intended as descriptive or explanatory formalisations rather than generative ones. My talk will focus on recent work at UCM, in the context of the WHIM project, to review some of these existing attempts and to distill from them a set of abstractions suitable to represent the essence of narratives in a way that can work both to represent existing narratives and to generate drafts for new ones. The approaches reviewed range from Propp's Morphology of the Folk Tale and Polti`s Thirty Six Dramatic Situations to Booker's Seven Basic Plots. The abstractions that have been found most relevant concern an elementary unit of plot, much in the vein of Propp's character functions, a set of narrative roles played by characters, and a set of long-range dependencies between elements in a plot. Based on abstractions such as these, a generative procedure has been proposed that, when provided with the necessary knowledge resources tailored to a particular domain, allows construction of acceptable narratives.

BIO

Pablo Gervás works as associate professor (profesor titular de universidad) at the Departamento de Ingeniería del Software e Inteligencia Artificial, Facultad de Informática, Universidad Complutense de Madrid. He is the director of the NIL research group (http://nil.fdi.ucm.es) and also of the Instituto de Tecnología del Conocimiento (http://www.itc.ucm.es). In recent years, Gervás has taken part in the organization of several scientific meetings on topics related to creativity. He was founding member of the Computational Creativity Working Group (WG4) of initiative COST 282: “Knowledge Discovery in Science and Technology,” funded by the European Commission. His research is on computational creativity, processing natural language input, generating natural language output, building resources for related tasks, and generating stories. In the area of creative text generation, he has done work on automatically generating metaphors, formal poetry, fairy tales, and short films. He is now involved in the following projects related to computational creativity: "WHIM: The What If Machine" (http://www.whim-project.eu/) and "ConCreTe: Concept Creation Technologies" (http://conceptcreationtechnology.eu/), and he is also involved in the PROSECCO initiative.

10 October 2014 - Rieks op den Akker

Title

Studies in interpersonal stance taking in interrogative interviews: towards building virtual suspect characters

Abstract

In the Dutch COMMIT project Interaction for Natural Access, the research group Human Media Interaction of the University of Twente studies the conversational genre of police interrogation, with the eventual aim to create virtual humans that can play the role of a suspect character in an educational game for police officers. For analysing interpersonal stance taking in these interviews, Leary's Rose is used (similar as in the deLearyous project). In this talk, the findings of this project will be presented. Questions that will be addressed are:
I) Is Leary's Rose a valuable theoretical framework for understanding stance taking in police interviews? II) What is the relation between turn-taking and stance taking in interrogative interviews? III) Is it possible to identify non-verbal behaviors and expressions that distinguish perceived stances?

BIO

Rieks op den Akker is assistant professor at the Human Media Interaction group of the department of Electrical Engineering, Mathematics and Computer Science of the University of Twente in Enschede, the Netherlands. He studied mathematics and computer science and graduated in theoretical computer science. He gives lectures in Mathematics, Artificial Intelligence, Speech and Language Processing and Conversational Agents and Dialog Systems. Research focuses currently on virtual characters and smart coaching systems.

10 October 2014 - Carlo Strapparava

Title

Emotions, Humour, and Persuasion: computational explorations of creative language

Abstract

Dealing with creative language and in particular with affective, persuasive and even humorous language has often been considered outside the scope of computational linguistics. Nonetheless, it is possible to exploit current NLP techniques starting some explorations about it. We briefly review some computational experiences about these typical creative genres.

Bio

Carlo Strapparava is a senior researcher at FBK­‐irst in the Human Language Technologies Unit. His research activity covers artificial intelligence, natural language processing, cognitive science, user models, adaptive hypermedia, lexical knowledge bases, word­‐sense disambiguation, affective computing and computational humor. On June 2011, he was awarded with a Google Research Award on Natural Language Processing, specifically on the computational treatment of affective and creative language.

24 January 2014 - Andrea Ravignani

Title

Brains hate randomness: Patterning skills for music and language in humans and other animals

Abstract

Human beings are excellent at perceiving and producing sensory structures. In particular, cognitive abilities for patterning seem crucial in language and music processing. The comparative approach, testing a range of animal species, can help unveil the evolutionary history of such patterning abilities. Here, I present experimental data and ongoing modeling work in humans and other primates. I compare monkeys' and humans' skills in processing sensory dependencies in auditory stimuli, a crucial feature of human cognition. As pattern production and perception abilities have been shown to differ in humans, the same divide could exist in other species. I present ongoing work using "electronic drums" I developed specifically for apes, which will allow chimpanzees to spontaneously produce non-vocal acoustic patterns. To reconstruct ancestral states of human temporal patterning skills, I present ongoing work using agent-based models of acoustic communication. I conclude by outlining the research I will do during my visit, namely exploring structural similarities between linguistic and musical rhythm.

19 November 2013 - Harald Baayen

Title

Naive discrimination learning as a framework for modeling aspects of language processing

Abstract

Naive discrimination learning is an approach to language processing that is insprired by information theory (Shannon, 1948) on the one hand, and learning theory in psychology (Wagner & Rescorla, 1972) on the other hand.   Instead of understanding grammar in terms of a formal calculus with an alphabet of symbols and rules for combining elementary symbols into well-formed strings, we think of the grammar of a language as comprising a code and overt signals. The signal (speech, writing, gesture, whistling, ...) is not necessarily decompositional in the item-and-arrangement sense.  Instead, cues distributed over the signal are allowed to be jointly predictive (thereby questioning the hypothesis of the dual articulation of language).   The code that language users share, albeit approximately due to variation in life experience, allows them to encode experience into the signal, or decode experience from the signal.   Crucially, we see the signal as targeting a reduction in uncertainty about the encoded experience.   Experiences, in all their richness, are much richer in information (in bits) than could be encoded in a short speech signal (in bits).  For instance, the words "pride and prejudice" will bring to mind a certain book and movies of that book, which are much richer than less than a second (or a few centimeters of writing) can ever encode.   

This general approach to language will be illustrated with three examples.  The first example addresses the myth that our cognitive faculties would decline as we age.  I will show that the psychological tests supposedly documenting cognitive decline all make use of language materials in a way that is uninformed about the consequences of prolonged experience with language as we grow older.   The second example addresses the question of morphological processing in reading.  Naive discrimination learning predicts that cues that are unique to a complex word are most predictive for that word.  Data from eye-tracking experiments indicate this prediction is correct.  The final example focuses on syntax and semantics.   Various algorithms have been developed  for estimating the semantic similarity of words using word co-occurrence restrictions across documents (LSA) or within windows of text (HAL, HiDEx).  I will present ongoing research in our group which indicates that a discrimination learning approach in which words are cues for words  makes predictions about word-to-word semantic similarity that are very similar to those of HiDEx.   This suggests naive discrimination theory may provide a straightforward rationale for why semantic vector space models work,  based on simple locally restricted learning events. 

14 May 2013 - Tim Van de Cyrus

Title

The computation of word meaning

Abstract

In the course of the last two decades, significant progress has been made with regard to the automatic extraction of word meaning from large-scale text corpora using unsupervised machine learning methods. The most successful models of word meaning are based on distributional similarity, calculating the meaning of words according to the contexts in which those words appear. The first part of his tutorial provides a general overview of the algorithms and notions of context used to calculate semantic similarity. We will look in some detail at dimensionality reduction, an unsupervised machine learning technique that is able to reduce a large number of contexts to a limited number of meaningful dimensions. In the second part of this tutorial, participants will gain some hands-on experience with regard to the computation of semantic similarity. Participants will have the chance to construct a number of distributional models and perform dimensionality reduction calculations using a designated Python library for semantic similarity.

8 March 2013 - Koenraad De Smedt

Title

INESS and CLARINO: some infrastructure activities in Norway

Abstract

This talk (with demo) will give an overview of some activities related to research infrastructures for linguistics and language studies. The INESS project (Infrastructure for the Exploration of Syntax and Semantics) is aimed at creating an innovative eScience environment based on treebanking. It is not just a repository for treebanks, with tools for searching and filtering, but also supports the interactive building of new, dynamic treebanks, including parallel treebanks, all via a web interface. The recently started CLARINO project (CLARIN Norway) is not only wider in the scope of its resources but also involves a larger national and international cooperation. It focuses on corpus exploration, termbank access, linguistic analysis portals and interactive dynamic presentation of philological editions.

31 January 2013 - Tony Veale

Title

The Creative Web: Computational Creativity as a Web-Service

Abstract

As a sub-field of AI, Computational Creativity (CC) does not distinguish itself through distinct algorithms or representations, but through its goals and its philosophy. The primary goal of CC is to imbue computers with the kind of self-evaluating and self-filtering generative capabilities that are deemed "creative" when observed in humans. The driving philosophy of CC as a field frowns on "pastiche" -- the reverse-engineered exploration of a sweet-spot of outputs in the distinctive style of a particular artist or creator -- and on "mere" generation -- the formulaic, script-based generation of well-formed outputs that are not subsequently evaluated or critiqued, and which are never rejected as uninteresting by the system itself. CC aims to develop generative software that can appreciate its own outputs, and even be occasionally surprised by these outputs. 

In this talk I explore how the web can be used as a force-magnifier for both theoretical and engineering progress in the field of computational creativity. I propose that emerging CC technologies be integrated and pooled to provide an architecture of creative web services that provide important CC processes and services in an on-demand fashion. In this vision of a Creative Web, web services will provide creativity on tap to third-party software applications; these services will include ideation services (such as metaphor invention), composition services (such as conceptual blending) and framing services (such as poetry generation, joke generation, emotionally-grounded explanations and analyses, etc.). Specifically, I will describe some existing services that have been designed in UCD and KAIST to instantiate this vision. These web services include the interpretation and generation of affective metaphors, the analysis of conceptual blends in both propositional and emotional terms, and the rendering of metaphors and blends as novel poems that display some small measure of insight and imagination.

15 October 2012 - Remi van Trijp

Title

Linguistic Assessment Criteria for Explaining Language Change: A Case Study on Syncretism in German Definite Articles

Abstract

The German definite article paradigm, which is notorious for its case syncretism, is widely considered to be the accidental by-product of diachronic changes. In this presentation, I argue instead that the evolution of the paradigm has been motivated by the needs and constraints of language usage. This hypothesis is supported by experiments that compare the current paradigm to its Old High German ancestor (OHG; 900-1100 AD) in terms of linguistic assessment criteria such as cue reliability, processing efficiency and ease of articulation. Such a comparison has been made possible by ‘bringing back alive’ the OHG-system through a computational reconstruction in the form of a processing model. The experiments demonstrate that syncretism has made the NHG-system more efficient for processing, pronunciation and perception than its historical predecessor, without however harming the language’s strength at disambiguating utterances.

14 September 2012 - Menno van Zaanen

Title

Formal and Empirical Language Modelling

Abstract

In the context of computational linguistics, there are several tasks that are typically tackled with the help of language models.  These models provide a rough description of the language under consideration, however they are not good enough to fully describe the language.  Taking the limitations of the language models into account, they are often used to filter the output of other components by removing sentences that are clearly incorrect. A completely different field of research, grammatical inference, deals with finding good representations of languages, often in the shape of grammars, given a set of example sentences.  This field is mainly interested in formal classes of grammars and aims to show which classes of grammars can be learnt efficiently under certain conditions. I this talk, I will describe language models and their use in computational linguistics, as well as results in grammatical inference.  Finally, I will show some work that tries to bridge the two fields.

28 June 2012 - Gerhard B. van Huyssteen

Title

Voice user interface design for multilingual emerging markets: lessons from research and development projects in South Africa

Abstract

Multilingual emerging markets hold many opportunities for the application of spoken language technologies, such as automatic speech recognition (ASR) or test-to-speech (TTS) technologies in interactive voice response (IVR) systems. However, designing such systems requires an in-depth understanding of the business drivers and salient design decisions pertaining to these markets. South Africa is a prototypical example of a developing world nation, where numerous communities face many barriers to information access, including infrastructure, distance, literacy and language. IVRs and other voice user interface services could play an important role in addressing these barriers. and bridging the information gap as mobile phones are by far the most widespread form of ICTs in developing world regions. Hence, South Africa poses interesting challenges to and opportunities in the multilingual IVR market. Nonetheless, there are very few companies specialising in multilingual IVR design in SA, and very little local research in this domain is available. The VUI designer is therefore left in the dark and often has to make design decisions based on intuition. In our current research programme we attempt to address these needs and gaps from two angles, viz. a business analysis angle, and a voice user-interface design angle. On the one hand we are trying to understand the business drivers in multilingual emerging markets, and on the other hand how these drivers influence design choices. This presentation will introduce these two angles with support from our research and development work, including an investigation into 34 selected South African IVRs, as well as research and development projects for a government department and a veterinarian company. We find that very few IVRs have a multilingual offering, and that only a handful has some form of speech input (which is only available in English). Persona and gender choice are low on design priorities for South African IVRs, and cost is the major driver for multilingual IVRs overshadowing the many positive business drivers in support of multilingual IVRs.

Bio

Prof. Van Huyssteen is professor in Afrikaans morphology and language technology at North-West University (Potchefstroom, South Africa). He is also owner of a small start-up company, Trifonius, which specialises in voice user interface design, as well as technologies for Afrikaans and other South African languages. In collaboration with numerous partners, he has been closely involved in a variety of voice user interface projects, including projects for government, South Africa’s largest veterinarian company, a large banking group in South Africa, as well as an international telecommunications company.

5 March 2012 - Alexis Palmer

Title

Evaluating automation strategies for documenting endangered languages

Abstract

Languages are dying at the rate of two each month. It is estimated that by the end of this century half of the approximately 6000 extant spoken languages will cease to be transmitted effectively from one generation of speakers to the next. Those working to document and preserve endangered languages face an immense amount of work with strong time pressure, small budgets, and limited human resources. In this talk I describe joint work with Jason Baldridge, Katrin Erk, and Taesun Moon investigating the effectiveness of various methods from machine learning and computational linguistics in cutting the cost of linguistic annotation for language documentation. Using data from the Mayan language Uspanteko, we assess the potential of active learning and semi-automated annotation through a series of timed annotation experiments that consider annotation expertise, example selection methods, and suggestions from a machine classifier.

Bio

Alexis Palmer is a postdoctoral researcher at the MMCI Cluster of Excellence, Department of Computational Linguistics and Phonetics of Saarland University.

19 January 2012 - Marco Baroni

Title

You shall know a word by the (visual) company it keeps: Towards a multimodal distributional semantics

Abstract

Many computational models of lexical semantics rely on the distributional hypothesis, that is, the idea that the meaning of a word can be approximated by the set of linguistic contexts in which the word occurs. In practice, this contextual distribution is encoded in a vector recording the word co-occurrence frequencies with a set of collocates in a large text corpus. On closer inspection, the distributional hypothesis is actually making two separate claims: 1) that meaning is approximated by context, and 2) that we can limit ourselves to linguistic contexts. The latter restriction has probably been adopted by computational linguists more out of necessity than out of theoretical beliefs: It is easy to extract the linguistic contexts in which a word occurs from corpora, whereas, until recently, it was not clear how other kinds of contexual information could be harvested on a large scale. But this has changed: Thanks to the Web, we now have access to huge amounts of multimodal documents where words co-occur with images (tagged Flickr pictures, illustrated news stories, YouTube videos...). And thanks to progress in computer vision, we can represent images in terms of automatically extracted discrete features, that can in turn be treated as visual collocates of the words associated with the images, enriching the vector-based representation of words with visual information. In this talk, I will briefly introduce the relevant techniques from computer vision, and report the results of the ongoing experiments from our lab in which we combine text- and image-derived collocates to derive distributional vectors that paint a richer picture of word meaning.

Bio

Marco Baroni is a tenured researcher in the CLIC group of CIMeC, the Center for Mind/Brain Sciences of the University of Trento. He is also a member of the DISCoFDepartment. His  research areas are computational linguistics and cognitive science. In 2011, he was awarded an ERC Starting Grant for the 5-year COMPOSES project on compositionality in distributional semantics, which is now his main focus of  research.

16 September 2011 - Alberto Barrón Cedeño

Title

Automatic Detection of Plagiarism: Cut and paste, paraphrases and translation 

Abstract

Nowadays text can be easily found, manipulated, combined, translated, and re-used. As a result, plagiarism, the unacknowledged re-use of text, occurs in a scale previously unseen. In this talk an overview of models for automatic plagiarism detection is offered, including standard frameworks for its evaluation. Special attention is paid to models focused on translated plagiarism. The problem of detecting cut and paste plagiarism seems to be solved by state of the art models. Nevertheless, paraphrase and, in particular translated plagiarism are still far from being considered solved. In this talk some avenues for future research are proposed.

Bio

Alberto Barrón Cedeño is a PhD student in the Natural Language Engineering Lab (Universidad Politécnica de Valencia) under the supervision of Paolo Rosso. His main research interests are information extraction, plagiarisim detection, and text similarity analysis.

16 September 2011 - Paolo Rosso

Title

Figurative language: The specific case of irony detection

Abstract

Figurative language is one of the most arduous topics facing natural language processing. Unlike literal language, the former takes advantage of linguistic devices, such as metaphor, analogy, ambiguity, irony, sarcasm, satire and so on, in order to project more complex meanings which, most of times, represent a real challenge, not only for computers, but for humans as well. This is the case of humour and irony. Each device exploits different linguistic strategies to be able to produce an effect (e.g., ambiguity and alliteration regarding humour; similes regarding irony). Sometimes the strategies are similar (e.g., use of satirical or sarcastic utterances to express a negative attitude). These devices suppose cognitive capabilities to abstract and meta-represent meanings out of the "physical" words. In this communicative layer, communication is more than sharing a common code, but being capable to infer information beyond syntax or semantics; i.e., figurative language implies information not grammatically expressed to be able to decode its underlying meaning: if this information is not unveiled, the real meaning is not accomplished and the figurative effect is lost. This kind of information supposes a great challenge because it points to social and cognitive layers quite difficult to be computationally represented. However, despite the inconveniences that figurative language supposes, the approaches to automatically process figurative devices, such as humour, irony or sarcasm, seem to be largely encouraging.

In this framework, in this talk we will aim at showing how two specific domains of figurative language - humour and irony - may be automatically handled by means of considering linguistic devices, such as ambiguity and incongruity, and meta-linguistic devices, such as emotional scenarios and polarity (in irony a polarity negation phenomenon occurs). We especially focus on discussing how underlying knowledge, which relies on shallow and deep linguistic layers, may represent relevant information to automatically identify figurative usages of languages. In particular, and contrary to the most of the researches which deal with figurative language, we aim at identifying figurative usages regarding language in social media. This means that we do not focus on analyzing prototypical jokes nor literary examples of irony, rather, we try to find patterns in texts whose intrinsic characteristics are quite different to the ones described in the specialized literature. For instance, a joke which exploits phonetic devices to produce a funny effect, and a tweet in which humour is self-contained in the situation. Considering this scenario, we suggest a set of features which work together as a system: no single feature is particularly humorous or ironic, but all together provide a useful linguistic inventory for detecting humour and irony at textual level.

Bio

Paolo Rosso is an associate professor at the Natural Language Engineering Lab of the Universidad Politécnica de Valencia). He presents joint work with Antonio Reyes, a PhD student at his lab working on irony detection. Among Paolo Rosso's main research interests are plagiarism detection, irony detection, sentiment analysis, and automatic humour recognition.

9 December 2010 - Janneke Van De Loo

Title

Automatic Detection of Syntactic Errors in an ASR-based CALL System

Abstract

We present a new method, called SynPOS:  Syntactic analysis using POS-tags. SynPOS is applied to a corpus of spoken human-machine interactions. The results show that language learners of Dutch often make syntactical errors, that there are many different types of syntactical errors, and that their frequencies vary a lot. This information can be used next to select errors and develop exercises for CALL systems.

6 December 2010 - Efstathios Stamatatos

Title

Classification Methods in Modern Authorship Analysis

Abstract

During the last decade, text categorization has been substantially developed providing effective methods able to deal with thousands of documents and multiple categories. Beyond topic, style can also be used as a discriminator factor. Authorship analysis deals with the personal style of the authors of electronic documents. Typical authorship analysis tasks include authorship attribution (a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available), authorship verification (to decide whether a given text was written by a certain author or not), and author profiling (to extract information about the age, education, sex, etc. of the author of a given text). According to the properties of a particular application, a specific setting of text categorization task has to be defined: single-label vs. multi-label classification, hierarchical vs. flat classification, closed-set vs. open-set classification. In this presentation, we examine the main classification paradigms for style-based text categorization focusing on how they treat the writing style: cumulatively for each class or individually for each document. In more detail, we distinguish two main approaches: (i) Profile-based paradigm aiming at extracting only one representation vector per author and (ii) Instance-based paradigm aiming at extracting one representation vector per document. Several state-of-the-art methods are examined and a detailed comparison is provided based on factors such as the type of representation they can handle, the required computational time cost, the ability to handle short texts and imbalanced training data, etc. suggesting their suitability for certain authorship analysis problems.

6 December 2010 - Harald Baayen

Title

Exploring the potential of naive discriminative learning for the analysis of (psycho)linguistic data

Abstract

In 1972, Rescorla and Wagner formulated recurrence equations for human and animal learning that have proved to be surprisingly fruitful in psychology.  Danks (2003) introduced a technical innovation that makes it possible to very efficiently estimate the state of the learning system when it is in equilibrium.  In my presentation, I will two present examples demonstrating that Rescorla-Wagner-Danks discriminative learning has much to offer for linguistic and psycholinguistic modelling as well as data analysis. First, I will introduce a computational model predicting lexical decision latencies for visual comprehension based on naive discriminative learning.  The model is very sparse in free parameters, yet explains a wide range of empirical findings, including whole-word and phrasal frequency effects, without having to posit separate representations for complex words or phrases.  In other words, the model combines excellent predictions with extreme representational parsimony. Second, I will discuss examples where naive discriminative learning appears to out-perform logistic mixed models fitted to the same data.  Furthermore, naive discriminative learning provides the researcher with sufficient detail to pinpoint a potential weakness of the mixed-effect regression modeling approach.  For the data set examined thus far in this line of research, it seems that naive discriminative learning has potential to be developed into a  statistical tool complementing other classifiers such as logistic and polytomous mixed-effects models, random forests, and nearest-neighbor based methods.

27 May 2010 - Pierre Isabelle

Title

The Evolving Research Landscape in Statistical Machine Translation

Abstract

After briefly introducing the basic concepts of statistical machine translation, we will present a non-technical overview of recent and current research in that area. Our survey will cover a broad spectrum of themes, including the following ones: I) System evaluation; II) Acquisition of training data; III) Phrase-based models: training and "decoding"; IV) Word order and syntax; V) System adaptation; VI) Multilingual coverage; VII) System combination techniques; VIII) SMT-based tools for translators. We will also discuss the extent of recent technological progress in translation accuracy and system performance.

Bio

Pierre Isabelle is Group Leader for Interactive Language Technologies at the National Research Council of Canada (NRC-CNRC), Institute for Information Technology.

1 March 2010 - Yoshimasa Tsuruoka

Title

Scalable Natural Language Processing and Biomedical Text Mining

Abstract

This talk will cover our recent research efforts for building efficient and scalable text mining applications. I will start by giving a brief demonstration of our text mining system for discovering previously unknown relations between biomedical concepts such as genes and diseases from a large collection of documents. The second topic of the talk will be an efficient online machine learning algorithm that allows us to create compact probabilistic models for various types of learning problems that we face in building text mining and natural language processing components. Finally, I will talk about some task-specific models and efficient algorithms for different types of natural language analysis including part-of-speech tagging and syntactic parsing

Bio

Yoshimasa Tsuruoka (homepage) is associate professor at the School of Information Science of the Japan Advanced Institute of Science and Technology (JAIST) at the University of Tokyo.

2 February 2010 - Gerhard van Huyssteen

Title

Some Thoughts on Rule-Based Conversion between Dutch and Afrikaans

Abstract

We describe the development and performance of a rule-based Dutch-to-Afrikaans convertor, with the aim to transform Dutch text so that it looks more like an Afrikaans text. Our convertor is a first step in creating a Afrikaans-to-Dutch convertor, which could be used for fast-tracking the development of resources for resource-scarce languages from a well-sourced language to a closely-related, under-sourced language. The rules we used is based on systematic orthographic, morphosyntactic and lexical differences between the two languages, which we describe in detail. We take a modular approach to the design of the system, in order to facilitate easy adaptations and changes during experimentation - specifically for linguists/language students, who might not have a good command of Perl. We report on various evaluation results, and discuss various strategies for further optimisation and future work.

Bio

Gerhard van Huyssteen is Research Group Leader at the HLT research group of the Meraka Institute, CSIR, South-Africa. The work presented during the colloquium is joint work with Suléne Pilon, Centre for Text Technology, North-West University.