Research team

Centre for Computational Linguistics and Psycholinguistics (CLiPS)

Expertise

Development of systems for natural language processing / Computer models for language acquisition and language processing / Text analytics / Computational stylometry / Corpus acquisition, annotation and exploitation

Intelligent document management through automatic topic discovery. 01/08/2021 - 31/07/2025

Abstract

In this project, we will develop technology for the automatic extraction of topics from documents (topic modeling) to enhance Textgain's intelligent document management software product (Ocelot). For the company, this will lead to attracting new customers and diversification of income streams. For research, this project will investigate and develop innovative techniques and methods for topic modeling using zero-shot learning and contextual embeddings.

Researcher(s)

Research team(s)

Advancing the Open Humanities Service Infrastructure (CLARIAH-VL). 01/02/2021 - 31/01/2025

Abstract

CLARIAH-VL constitutes the Flemish contribution to the European DARIAH (Digital Humanities) and CLARIN (Computational Linguistics) research infrastructures (ERICs). Building on the work of these landmark ERICs, CLARIAH-VL will join the efforts of their respective Flemish consortia towards further development and valorisation of high-quality, modular, user-friendly tools, resources, and services by and for humanities researchers. The infrastructure brings together 22 research teams representing a range of disciplines from the universities of Ghent, Antwerp, Leuven and Brussels and the Dutch Language Institute. CLARIAH-VL will continue catering to the highly diverse and multilingual composition of digital humanities data inherent in European long term history, culture, environment and society. To facilitate and (semi-)automate as many aspects of the workflows of humanities researchers as possible, each service component of the infrastructure will need to take full advantage of the most recent advances in the fields of machine learning, linked data and semantic technologies especially with regard to digital text and image analysis.

Researcher(s)

Research team(s)

European Language Equality (ELE). 01/01/2021 - 30/06/2022

Abstract

Twenty-four official languages and more than 60 regional and minority languages constitute the fabric of the EU's linguistic landscape. However, language barriers still hamper communication and the free flow of information across the EU. Multilingualism is a key cultural cornerstone of Europe and signifies what it means to be and to feel European. Many studies and resolutions, as noted in the recent EP resolution "Language equality in the digital age", have found a striking imbalance in terms of support through language technologies and issue a call to action. This project answers this call and lays the foundations for a strategic agenda and roadmap for making digital language equality a reality in Europe by 2030. The primary goal of ELE is to prepare the European Language Equality Programme, in the form of a strategic research, innovation and implementation agenda and a roadmap for achieving full digital language equality in Europe by 2030. This programme will be prepared jointly with the whole European Language Technology, Computational Linguistics and language-centric AI community, as well as with representatives of relevant initiatives and associations, language communities and RML groups. The consortium includes all relevant scientific and industrial stakeholders from all Member States and Associated Countries and engages them in the process. The whole community is included in the project through external consultation sessions. The project plan is fully optimised towards this key goal of preparing the strategic agenda and roadmap and of involving the whole European LT community. Ensuring appropriate technology support for all European languages will create jobs, growth and opportunities in the digital single market. Equally crucial, overcoming language barriers in the digital environment is essential for an inclusive society and for providing unity in diversity for many years to come. The ELE project provides a roadmap and framework to achieve this.

Researcher(s)

Research team(s)

Project website

Research Centre on Representatives and their Communication (RCRC). 01/01/2020 - 31/12/2025

Abstract

In a context of, across Western democracies, an increasing popular dissatisfaction with political representation, PREPINTACT examines the beliefs, attitudes and behavior of three types of individual intermediary actors— politicians, interest group leaders and journalists—in tandem with the parallel beliefs, attitudes and behavior of ordinary citizens. It argues that in order to get a better grip on how representation works, we need to focus on individual intermediaries. We examine the up- and downstream flows of information that form the core of representation and that connect society with the government system. PREPINTACT has a special interest in political inequality and hypothesizes that disadvantaged societal groups are less adequately represented. Within that general framework, the consortium launches a number of specific, comparative research projects using a range of methods combining social science (experiments, surveys, interviews…) with computational linguistics approaches. The concrete projects look into the accuracy of intermediaries' perception of public opinion, the social bias in their personal networks, the selective communication to their voters/members/audience, the role of social media in reinforcing their attitudes, how they represent within their organizations (parties, media organizations...) etc. Taken together, these projects constitute a never seen, in-depth analysis of how individual intermediaries make representative democracy work (or not).

Researcher(s)

Research team(s)

Dialect Syntax Revisited 01/01/2020 - 31/12/2024

Abstract

The scientific research network Re-Examining Dialect Syntax (REEDS-network) brings together linguistic researchers from Flanders, Europe and the US from different empirical and theoretical backgrounds and with complementary expertise, in an attempt to arrive at a deeper, more rounded and better grounded understanding of dialect syntax in particular and language variation in general.

Researcher(s)

Research team(s)

European Language Grid. 12/12/2019 - 30/06/2022

Abstract

ELG will strengthen the commercial European Language Technology landscape by establishing a pan-European marketplace. CLiPS is National Competence Centre (NCC) for Belgium. ELG has set up 32 NCCs to establish a strong European network. They will act as regional bridges to the project. The NCCs will support ELG in collecting regional information about companies, research centres, resources, services and projects. They will organise regional ELG workshops and promote ELG in their area and establish bridges to funding agencies.

Researcher(s)

Research team(s)

Reputation and Structural Reforms of Public Organizations: Explaining Temporal Dynamics. 01/11/2019 - 31/10/2022

Abstract

This proposal studies the temporal dynamics between the reputation of public organizations and the structural reforms they experience. Public organizations perform important services in society. When performance of these organizations is perceived to be problematic, political and administrative actors are often urged to initiate structural reforms (e.g., reshuffling tasks between organizations, merging or changing the legal status of organizations). Therefore, reforms have symbolic value as signals to society that problems concerning public sector performance are being perceived and acted upon. However, no studies have examined on a large sample how the perceived performance of public organizations (i.e., their reputation) affects these organizations' chances for being reformed. Neither do we know how structural reforms in turn impact organizations' future reputations. This proposal addresses these gaps. The dynamics between reputation and reforms through time are studied on a diverse set of 60 Flemish public organizations. Specific attention goes to the moderating role of reputation management strategies of these organizations and to several organizational and environmental conditions. The proposal will benefit from the most recent developments in machine learning techniques to automatically collect data on the multiple facets of the reputation and reputation management of public organizations. Advanced statistical models allow to analyze the complex relations through time.

Researcher(s)

Research team(s)

Accommodation and non-accommodation in adolescents' informal online writing: Social determiners and linguistic effects. 01/10/2019 - 19/01/2023

Abstract

The proposed study will analyze how teenagers adapt their informal online writing to their conversation partner, and by which social and contextual factors this process of accommodation is influenced. Since linguistic accommodation remains largely un(der)explored for social media writing, the project fills a gap. It will investigate the impact of multiple aspects of adolescents' socio-demographic profile and their interaction on a wide range of linguistic and pragmatic features. We will examine whether divergent patterns of linguistic adjustment can be observed for teenagers with distinct socio-demographic profiles, and which language features appear to be most or least affected. A major distinction will be made between analyses of robust intergroup accommodation and in-depth diachronic analyses of accommodation between particular individuals. This unique design might lead to challenging sociolinguistic findings with respect to the profile of (non-)accommodators. While it will primarily increase our understanding of the social, linguistic and pragmatic parameters that govern accommodative language behavior, it may in the end also open up a unique perspective on language change. Moreover, on a more general, theoretical level, this project aims to accurately delimit the concept of accommodation, in order to answer the fundamental question of whether we can unambiguously discriminate between true accommodation and other instances of linguistic adaptation.

Researcher(s)

Research team(s)

Flanders AI. 01/07/2019 - 31/12/2021

Abstract

The Flemish AI research program aims to stimulate strategic basic research focusing on AI at the different Flemish universities and knowledge institutes. This research must be applicable and relevant for the Flemish industry. Concretely, 4 grand challenges 1. Help to make complex decisions: focusses on the complex decision-making despite the potential presence of wrongful or missing information in the datasets. 2. Extract and process information at the edge: focusses on the use of AI systems at the edge instead of in the cloud through the integration of software and hardware and the development of algorithms that require less power and other resources. 3. Interact autonomously with other decision-making entities: focusses on the collaboration between different autonomous AI systems. 4. Communicate and collaborate seamlessly with humans: focusses on the natural interaction between humans and AI systems and the development of AI systems that can understand complex environments and can apply human-like reasoning.

Researcher(s)

Research team(s)

The linguistic landscape of hate speech on social media. 01/01/2019 - 31/12/2022

Abstract

Hate speech online is a widespread social phenomenon that frequently receives a lot of media attention. We are interested in the language that is being used to express hate in social media, specifically hate against migrants and LGBT people. After gathering enough examples from public Facebook pages, we will develop methods to automatically analyze the language in these texts. The analysis will be on different levels. Some simple forms of analysis include counting words, looking at spelling mistakes, and investigating grammatical aspects. In the more complex analysis we will examine the use of metaphors, the context of the hate speech and how the hate speech can be implicit in the text, rather than overtly present. Apart from the linguistic description of this phenomenon, we strive to build systems that can automatically recognize hate speech in social media text. The project is in cooperation with research groups in Slovenia and targets Dutch, Slovene, and English.

Researcher(s)

Research team(s)

Solving Combinatorial and Probabilistic Problems in Natural Language. 01/01/2018 - 31/12/2021

Abstract

This project wants to develop a fully automated approach to solving exercises about combinatorics and probability that can be found in introductory textbooks on discrete mathematics. The ability to solve such problems is an important cognitive and intellectual skill as it is evaluated as part of academic admission tests such as SAT, GMAT and GRE. The combinatorics and probability questions will be formulated in natural language and the task will be to automatically answer these questions. We shall develop a two-step approach for tackling this task. In the first step, a question formulated in natural language will be analysed and transformed into a high-level model specified in a declarative language. In the second step, the high-level model will be solved solved using the inference mechanisms of for the declarative modeling language. The language and its solvers will be based on principles of probabilistic programming, is an increasingly popular programming paradigm. While the immediate goal is to solve textbook exercises, the long term goal is to contribute to the automation of probabilistic and combinatorics problem solving and to enable the modeling and programming for such problems in natural language, two goals that are highly relevant to cognitive computing and artificial intelligence

Researcher(s)

Research team(s)

Intelligent Neural Systems as InteGrated Heritage Tools (INSIGHT). 15/12/2016 - 31/07/2022

Abstract

The INSIGHT project aims to advance the application of automated algorithms from the field of Artificial Intelligence to support cultural heritage institutions in their effort to keep up with their ongoing annotation initiatives for their expanding digital collections. We will focus on recent advances in Machine Learning, where the application of neural networks (Deep Learning) has recently led to significant breakthroughs, for instance, in the fields of Natural Language Processing and Computer Vision. We will determine how state-of-the-art algorithms can be used to (semi-)automatically catalogue and describe digital objects, especially those for which no, little or incomplete metadata is available. The project focuses on making the digital collections of two federal museum clusters in Brussels ready to be exported to Europeana, i.e. the Royal Museums of Fine  Arts of Belgium and Royal Museums of Art and History.

Researcher(s)

Research team(s)

'I represent the people, and my opponent does not!' The effects of representative claims on citizens' feeling of being (un)represented. 01/11/2020 - 31/10/2021

Abstract

Many studies examine popular resentment with politics. It seems that citizens have the feeling that they are not properly being represented by the politicians and parties they elect. One should ask, where do these feelings come from? In this project, I argue that we might find part of the explanation in politicians' communication, particularly, in the representative claims they make. Politicians claim to represent others every day (e.g. I represent women) and claim that other politicians do not (e.g. He does not represent the people). Being mentioned as 'the represented' might make some people feel well-represented, while others might feel ignored (or relatively deprived) and therefore feel unrepresented. Yet, at present, very little to no research has defined what it means to feel (un)represented, let alone measured the concept. Similarly, little empirical research has been done on politicians' representative claims. Consequently, we know close to nothing about the possible effects of these claims on the extent to which people feel (un)represented by politicians and their parties. This project aims to tackle these gaps in literature in three steps. First, by operationalizing and measuring 'feeling (un)represented'. Second, by measuring and analyzing the various claims Flemish politicians make. Lastly, the two first studies will serve as necessary input to experimentally test the effects of representative claims on citizens' feeling of being (un)represented.

Researcher(s)

Research team(s)

Artificial intelligence for creative language use. 01/12/2019 - 30/11/2021

Abstract

Recent progress in Natural Language Processing (NLP) has resulted in reliable pattern matching techniques (mostly based on deep neural networks) for many NLP tasks (text to speech, speech to text, text generation, text translation, multimodality, text analysis, …). The creative use of language (e.g. in advertising slogans, song texts, humor, irony, metaphor, …) has remained out of reach of current approaches. We will investigate how the improved stated of the art in 'literal' language processing can push the design of creative language processing systems. Valorisation roadmap: The research address two types of users and applications: (i) professional writers who will be able to use tools to generate ideas and concepts (puns, jokes, titles, short texts with metaphors) and (ii) language enthusiasts who will be provided with tools that can boost their output by producing examples and ideas. Approach: 1. Development of proofs of concept of domain-dependent creative writing 2. Design of applications in copy-writing 3. Design of applications in entertainment writing

Researcher(s)

Research team(s)

Big Data of the Past for the Future of Europe (Time Machine). 01/03/2019 - 29/02/2020

Abstract

Europe urgently needs to restore and intensify its engagement with its past. Time Machine will give Europe the technology to strengthen its identity against globalisation, populism and increased social exclusion, by turning its history and cultural heritage into a living resource for co-creating its future. The Large Scale Research Initiative (LSRI) will develop a large-scale digitisation and computing infrastructure mapping millennia of European historical and geographical evolution, transforming kilometres of archives, large collections from museums and libraries, and geohistorical datasets into a distributed digital information system. To succeed, a series of fundamental breakthroughs are targeted in Artificial Intelligence and ICT, making Europe the leader in the extraction and analysis of Big Data of the Past. Time Machine will drive Social Sciences and Humanities toward larger problems, allowing new interpretative models to be built on a superior scale. It will bring a new era of open access to sources, where past and on-going research are open science. This constant flux of knowledge will have a profound effect on education, encouraging reflection on long trends and sharpening critical thinking, and will act as an economic motor for new professions, services and products, impacting key sectors of European economy, including ICT, creative industries and tourism, the development of Smart Cities and land use. The CSA will develop a full LSRI proposal around the Time Machine vision. Detailed roadmaps will be prepared, organised around science and technology, operational principles and infrastructure, exploitation avenues and framework conditions. A dissemination programme aims to further strengthen the rapidly growing ecosystem, currently counting 95 research institutions, most prestigious European cultural heritage associations, large enterprises and innovative SMEs, influential business and civil society associations, and international and national institutional bodies.

Researcher(s)

Research team(s)

CLARIAH-VL: Open Humanities Service Infrastructure. 01/02/2019 - 31/01/2021

Abstract

CLARIAH-VL: Open Humanities Service Infrastructure is the Flemish contribution to the European DARIAH and CLARIN infrastructures. It brings together and extends the portfolio of services enabling digital scholarship in the Arts and Humanities offered by the DARIAH-VL Virtual Research Environment Service Infrastructure (VRE-SI; Hercules & FWO 2015-2018) with the digital tools and language data that are offered through CLARIN-DLU/Flanders. The consortium which includes the network of Digital Humanities Research Centres at the universities of Antwerp, Brussels, Ghent and Leuven has been extended with the Dutch Language Institute (INT) – the CLARIN-ERIC certified B-Centre for Flanders. CLARIAH-VL will implement a modular research infrastructure embedding high-quality, user-friendly tools and resources into the workflows of humanities researchers in the five focus areas of linguistics; literature; socio-economic history; media studies; ancient history and archaeology. CLARIAH-VL aims to provide sustainable services, while fostering experimental development and innovation. Offering an open infrastructure which facilitates public humanities is a guiding principle for CLARIAH-VL. It will ensure the accessibility and relevance of the humanities to the general public, specific (heritage) community groups and policy makers. It will make it technically possible to share knowledge, including sharing and co-creating knowledge with non-specialist users, such as facilitating citizen science and crowdsourcing projects. Furthermore, by implementing international best practices in FAIR (Findability, Accessibility, Interoperability and Reusability) Research Data Management (RDM), CLARIAH-VL will pave the way to Flemish participation in the European Open Science Cloud.

Researcher(s)

Research team(s)

The role of semantics in modeling the bilingual mental lexicon. 01/10/2018 - 18/06/2020

Abstract

Bilinguals, people who simultaneously know and use two or more languages, are an interesting source of clues for discovering the internal make-up of our language system. Specifically, it is interesting how bilinguals are able to reliably access the right words in the right language without making mistakes, even though languages contain significant amounts of overlap in terms of semantics, orthography and phonology. In computational psycholinguistics, we model phenomena such as word retrieval via computer models. Despite the fact that we do not have access to the actual word store embedded in our mind, modeling can provide us with clues as to how it is organized, more particularly, by constructing models that can simulate key findings in psycholinguistic experiments. Having said that, current models for bilingual word reading can account for most of the facts, but largely underspecify a crucial component of our day-to-day word retrieval: meaning. Moreover, and related to this shortcoming, most models of word access have only modeled words in isolation. In reality, however, words are always embedded in sentences and larger linguistic and non-linguistic contexts, which also influence the way we access our words. By creating models of sentence processing, we can make sure that meaning has a more central role in our models, and thereby give new explanations for several phenomena in bilingual word processing.

Researcher(s)

Research team(s)

Sabbatical Leave Project, 2018-2019 01/10/2018 - 30/09/2019

Abstract

Two sub-projects are addressed: in stylometry, methodological issues are addressed, especially related to personality prediction from text: feature optimization, data acquisition and quality, model selection, and especially explanation of trained machine learning models. In machine learning for natural language, approaches are investigated on how to combine knowledge and reasoning with the currently predominant deep learning "black boxes".

Researcher(s)

Research team(s)

CATCH 2020: Computer-Assisted Transcription of Complex Handwriting. 01/05/2018 - 30/04/2021

Abstract

CATCH 2020 aims to provide a working infrastructure for the computer-assisted transcription of complex handwritten documents. It will do so by building on the existing Transkribus platform for Handwritten Text Recognition (HTR) – which allows us to process handwritten textual documents in a way that is similar to how OCR processes printed textual documents.. Rather than producing flat transcripts of digital facsimile images, however, CATCH 2020 will produce structured texts, providing tools to add textual and linguistic dimensions to the transcription by combining the state of the art of the research field of textual scholarship with the state of the art of the research field of computational linguistics.

Researcher(s)

Research team(s)

Project website

Timemachine. 01/10/2017 - 30/09/2020

Abstract

What if you could travel through time as easily as we travel through space? With the Time Machine consortium, we work towards a large-scale FET Flagship project to build a large-scale simulator capable to map more than 2000 years of European history. This big data of the past, a common resource for the future, will trigger pioneering and momentous cultural, economic and social shifts. Understanding the past undoubtedly is a prerequisite for understanding present-day societal challenges and contributes to more inclusive, innovative and reflective societies. Researchers from all over the world are spearheading joint forces within the Time Machine FET Flagship project to reinvigorate the past through one of the most ambitious projects ever on European culture and identity. The fundamental idea of this project is based on Europe's truly unique asset: its long history, its multilingualism and interculturalism.

Researcher(s)

Research team(s)

Optimization of the adaptability of clinical information extraction systems: deep learning and use of feedback propagation techniques. 01/09/2017 - 31/08/2021

Abstract

Large amounts of unstructured medical data (for example clinical notes) are today available, which offers opportunities for optimization of healthcare quality and patient security. Although Natural Language Processing technology already offers great tools and solutions to automate the processing of medical documents, performance of this technology often decreases with changes of the extraction context (medical specialty, hospital, physician's writing style). This project will study the possibility of a scalable NLP engine able to adapt to such new contexts. To reach this goal, we will explore and combine approaches based on deep neural networks, the human-in-the-loop paradigm and persistent learning. The project is a collaboration with LynxCare Clinical Informatics, a medical IT company focusing on promoting access to medical information and reducing administrative costs in hospitals.

Researcher(s)

Research team(s)

How political news affects and is affected by citizens in the social media age. Theoretical challenges and empirical opportunities 01/01/2017 - 31/12/2020

Abstract

In a democracy, citizens need knowledge about politics. The mass media are traditionally considered as key actors in providing this necessary information. Ample studies on agenda-setting and framing have shown time and again that the news media have a profound influence on what people know, and how they think about politics. The question is to what extent it is possible to maintain many of these classic insights in the digital era. The increasing importance of the Internet and in particular social media as a means of communication and information has likely changed how people learn about what is going on in the world, and about politics more specifically. For instance, the agenda-setting and framing role of the media is challenged, because social media use puts the underlying causal mechanism, from mass media to the public, into question. More and more journalists are influenced by discussions on blogs, Facebook, Twitter and other platforms. In addition, politicians have more digital opportunities to directly influence the public while bypassing the traditional media. In short, we aim to study consume and engage with political news and how they are affected by it, but also on how journalists and politicians are, in turn, influenced by people's engagement with the news. Digital media not only challenge some of the established theoretical insights but simultaneously also offer new opportunities to study how information spreads and how the public deals with it. Today, it is possible to map all online news and all citizens' digital reactions to it (comments, likes, tweets). This makes it possible to study much more accurately agenda-setting processes by how people interact with news. Framing, as well, can be studied now much more precisely and especially drawing on much larger samples of citizens and media messages. In addition, analyzing digital text and expressed opinion in social media allows demographic and attitudinal profiling of citizens that could strongly increase our knowledge of the individual moderators of agenda-setting and framing effects. To make sense of this unprecedented source of written language and digital behaviour, we opt for a multidisciplinary collaboration between computational linguistics, data mining and social sciences. The appropriateness of social scientific theories of agenda-setting and framing will be put to the test in a digital context by means of big data analyses. Computational linguistics techniques will be used to automatically analyze the topics addressed in social media text, the opinions expressed about these topic, and the profiles of the social media users expressing these opinions. The possibilities of digital text analysis, however, go beyond testing classic media effects theories such as agenda-setting and framing. Our ambition is to use the new data opportunities to develop new theoretical insights by discovering underlying patterns in an inductive fashion. By applying data mining techniques on the data of users' digital behavior and searching for underlying patterns, we may obtain insights into which events, persons and topics ordinary citizens 'like' and want to 'share'. Concretely, we aim to study one planned major political event, the 2019 Belgian election campaign, and one non-planned or unexpected event in the course of 2018. We expect that the information flows in both types of events are structurally different. For each event we plan a survey and a large quantitative data collection covering about four weeks, with content drawn from all major online news websites, and the social media platforms Twitter and Facebook.

Researcher(s)

Research team(s)

The role of semantics in modeling the bilingual mental lexicon. 01/10/2016 - 30/09/2018

Abstract

Bilinguals, people who simultaneously know and use two or more languages, are an interesting source of clues for discovering the internal make-up of our language system. Specifically, it is interesting how bilinguals are able to reliably access the right words in the right language without making mistakes, even though languages contain significant amounts of overlap in terms of semantics, orthography and phonology. In computational psycholinguistics, we model phenomena such as word retrieval via computer models. Despite the fact that we do not have access to the actual word store embedded in our mind, modeling can provide us with clues as to how it is organized, more particularly, by constructing models that can simulate key findings in psycholinguistic experiments. Having said that, current models for bilingual word reading can account for most of the facts, but largely underspecify a crucial component of our day-to-day word retrieval: meaning. Moreover, and related to this shortcoming, most models of word access have only modeled words in isolation. In reality, however, words are always embedded in sentences and larger linguistic and non-linguistic contexts, which also influence the way we access our words. By creating models of sentence processing, we can make sure that meaning has a more central role in our models, and thereby give new explanations for several phenomena in bilingual word processing.

Researcher(s)

Research team(s)

Deep linguistic features for computational stylometry. 01/10/2016 - 30/09/2018

Abstract

The goal of stylometry is to understand and model how variations in writing style are related to (properties of) the author of a text. This research provides insight into how psychological and sociological properties of the author such as age, gender, region, personality, and others, are reflected in his or her idiolect. Such models can also be used to predict these author properties on the basis of text analysis. Applications range from literary studies to forensic science.

Researcher(s)

Research team(s)

ACCUMULATE: Acquiring crucial medical information using language technology. 01/01/2016 - 30/06/2020

Abstract

The ACCUMULATE project will automatically recognise crucial information in the free text of clinical reports written in English and Dutch by designing, developing and evaluating advanced language technology (LT) for deep semantic processing of the texts that are often morpho-syntactically not well-formed. An additional focus is on easy portability of the technology across domains and languages and on the use of visualisation techniques.

Researcher(s)

Research team(s)

Periodization in Literary History: A Computational Model of the History of Dutch Literature. 01/10/2015 - 30/11/2015

Abstract

In literary history, scholars commonly divide the temporal series of events which they are discussing into periods (e.g. Romanticism). This process is called periodization and it is considered an important task of historical literary scholarship. In spite of its present-day relevance, periodization remains a surprisingly controversial process: some of the most influential models in literary history are considered a 19th-century inheritance, of which the present-day validity is often questioned nowadays. The objective of this project is to build a computational model of the history of Dutch-language literature in the Low Countries (13th-20th century). This diachronic model will use techniques from computational text analysis ("Distant Reading") to track changes in the stylistic and thematic characteristics of texts. Importantly, this will be a bottom-up model: it will be created in a data-driven manner, instead of setting out from existing (potentially preconceived) hypotheses. This model will be carefully interpreted and compared to the state of the art in traditional literary scholarship. This will allow us to verify and better understand the validity of established periodization models of Dutch literary history. This project will greatly contribute to the ongoing international debate about the integration of traditional, "close reading" methods in literary studies and new, computational methods for "distant reading".

Researcher(s)

Research team(s)

Digital Humanities Flanders. 01/01/2015 - 31/12/2019

Abstract

This is a fundamental research project financed by the Research Foundation – Flanders (FWO). The project was subsidized after selection by the FWO-expert panel. Its aim is to initiate cooperation between research groups.

Researcher(s)

Research team(s)

The interaction of gender and social class in Flemish online teenage talk. 01/01/2015 - 31/12/2018

Abstract

Social class differences in teenage speech remain largely unexplored, while gender has been focused on in quite a lot of sociolinguistic research on adolescent peer group language. The interest in gender differences has also pervaded the research on informal computer-mediated communication (CMC) and more specifically on the online writing practices of adolescents in chat or texting media, but then again, the link with social class is generally absent. Yet some studies (though not on CMC) suggest that gender differences manifest themselves in different ways in different social class groups. The present research is a first attempt to fill this gap, by focusing on the interaction between social class and gender in Flemish chat language produced by adolescents with a low versus a high level of education.

Researcher(s)

Research team(s)

Deep linguistic features for computational stylometry. 01/10/2014 - 30/09/2016

Abstract

The goal of stylometry is to understand and model how variations in writing style are related to (properties of) the author of a text. This research provides insight into how psychological and sociological properties of the author such as age, gender, region, personality, and others, are reflected in his or her idiolect. Such models can also be used to predict these author properties on the basis of text analysis. Applications range from literary studies to forensic science.

Researcher(s)

Research team(s)

A publicly available Economic Uncertainty Index for all G8 countries using text mining techniques. 01/10/2014 - 30/09/2015

Abstract

In this project we focus on the question: how can we measure economic policy uncertainty (EPU)? We recently proposed an EPU index for Belgium, by mining online news articles from all major newspapers. Given the promising results, we aim to apply this text mining-based methodology to other countries as well (G8 countries), and create a publicly available website where the index is automatically updated for all countries on a weekly basis.

Researcher(s)

Research team(s)

Text analytics web services for profiling and opinion mining. 01/02/2014 - 31/01/2015

Abstract

Our aim is to implement commercial web services for automatic opinion detection and author profiling (age, gender, personality, education, dialect) in text. In this project we will develop the core technology: data mining and annotation, machine learning and setting up the server. In a follow-up project we will then launch a spin-off company. This kind of language technology is useful for a wide range of big data applications, and does not yet exist for Dutch, and only in part for English.

Researcher(s)

Research team(s)

Data fusion and structured input and output Machine Learning techniques for automated clinical coding. 01/01/2014 - 31/12/2017

Abstract

This project will improve the state of the art in automated clinical coding by analyzing heterogeneous data sources and defining them in a semantic structure and by developing novel data fusion and machine learning techniques for structured input and output.

Researcher(s)

Research team(s)

Bootstrapping operations in language acquisition: a computational psycholinguistic approach. 01/01/2014 - 31/12/2017

Abstract

The acquisition of abstract linguistic categories is investigated. Computational models of bootstrapping operations are constructed in order to investigate how knowledge from one domain can be instrumental in acquiring knowledge of another domain. In our simulations the language addressed to very young children is used in an attempt to elucidate how grammatical categories and grammatical gender are acquired given a combination of distributional, phonological and morphological bootstrapping.

Researcher(s)

Research team(s)

Evaluation of tools within the SUCCEED project. 25/10/2013 - 24/10/2014

Abstract

This project represents a formal service agreement between UA and on the other hand the University of Alicante. UA provides the University of Alicante research results mentioned in the title of the project under the conditions as stipulated in this contract.

Researcher(s)

Research team(s)

Automatic Monitoring for Cyberspace Applications (AMiCA). 01/01/2013 - 31/12/2016

Abstract

This project represents a research agreement between the UA and on the onther hand IWT. UA provides IWT research results mentioned in the title of the project under the conditions as stipulated in this contract.

Researcher(s)

Research team(s)

Project website

Digital Archive of Belgian Neo-Avant-garde Periodicals (DABNAP). 01/01/2013 - 31/12/2014

Abstract

Post-war artists' periodicals are a prime example of the neo-avant-garde DIY ethos, and simultaneously constitute a crucial source of information about this movement. This project aims to digitize a substantial and representative corpus of Belgian neo-avant-garde periodicals. Subsequently, innovative language processing tools will be applied in order to extract and visualize the network of artists who were behind the periodicals.

Researcher(s)

Research team(s)

Project website

A medieval Stylome? Exploring the Universal Stylome Hypothesis in medieval prose. 01/10/2012 - 30/09/2015

Abstract

In this project I will further explore the applicability of the Stylome Hypothesis in medieval literature: 1. I will apply computational stylometry to medieval prose. Because so many (anonymous) medieval prose texts survive, stylometric techniques for authorship attribution in prose are highly relevant. The proposed case study targets religious prose (13th/14th century) from Brabant. 2. Throughout medieval Europe, a lot of Latin literature was produced. I propose to extend my research to Latin, via the original case study of the Flemish monks (11th century) who were attracted by English nobility to write Latin biographies.

Researcher(s)

Research team(s)

Automatic Compound Processing. 01/07/2012 - 31/12/2013

Abstract

The central problem to be addressed in this project concerns a multidisciplinary investigation into sharing of knowledge and resources between closely-related languages, specifically relating to the automatic processing of compounds. Specifically, we will explore the possibility to create new knowledge about closely- related languages, and efficiently develop additional, more advanced resources for (a) compound segmentation; and (b) the semantic analysis of compounds.

Researcher(s)

Research team(s)

Abstract rules or statistical learning? The impact of lexical and sublexical homophony in spelling and reading homophonous verb forms. 01/01/2012 - 31/12/2015

Abstract

Homophone intrusions in the spelling of regularly inflected Dutch verb forms are used to address a central question in psycholinguistics – and cognitive science in general: do people rely on symbolic mental rules or on a knowledge base that captures the co-occurrence probabilities in the learning domain (statistical learning)? Earlier findings in our research group indicated an effect of homophone dominance in the pattern of intrusion errors when spelling homophonic verb forms: such errors occur more often when the target is the lower-frequency homophone and the intruder the higher-frequency homophone. This is compatible with a statistical learning view but cannot reject a rule-based account enriched with a frequency-sensitive mechanism. To disentangle the two accounts we will compare error patterns in the lexical and sublexical domains. An effect of homophone dominance at the sublexical level cannot be explained by a rule model. Errors in the lexical and sublexical domains will be studied in spelling and reading tasks. Finally, we will attempt to simulate the experimental patterns with two types of computational models: a symbolic model, using morphemes and rules, and a memory-based model, storing whole word forms only and using a similarity metric that can 'discover' patterns in its memory store. Together, the experimental and simulation data should enable us to formulate an answer to the question about mental rules.

Researcher(s)

Research team(s)

'Authorship', composition and textual interconnectedness of three 16th-century mystical texts. Die evangelische peerle, Vanden tempel onser sielen, and the Arnhem mystical sermons. A stylometric approach. 01/01/2011 - 31/12/2014

Abstract

The aim of this project is to merge traditional methods of literary analysis with those of stylometry to provide tools to translate nuanced perceptions into verifiable observations. The study of the particular case of Arnhem mystical texts has a double objective: (1) to gain a deeper understanding in the textual culture and the interconnectedness of original texts of the Arnhem mystics; (2) to further the effectiveness, applicability and acceptability in literary studies of the combineds method of close reading and computational techniques.

Researcher(s)

Research team(s)

Analyzing the impact of news and market reports on Belgian stock prices through text mining. 01/01/2011 - 31/12/2013

Abstract

In this project we will investigate how general and stock market specific news items can be analysed with advanced text mining techniques to automatically predict the effect on Belgian stock prices. Insights will be obtained into which news providers and which combinations of words have the largest effect. The developed system will be evaluated as a trading tool, as well as decision support system for investors.

Researcher(s)

Research team(s)

The end rhyme in Middle Dutch epic literature (ca. 1200-1500): development and relationship to authorship and genres. 01/10/2010 - 30/09/2012

Abstract

Nearly all of Middle Dutch narrative literature (ca. 1200-1500) was written in rhyming couplets, which is why rhyme words are extremely suitable for the comparative study of Middle Dutch epic texts and authors. My research specifically focuses on three aspects: (a) the evolution of rhyme in the vernacular epic poetry of the medieval Low Countries; (b) the usefulness of rhyme words for authorship verification and attribution; (c) the correlation between rhyme word vocabulary and epic subgenres. My methodology is mainly borrowed from literary stylistics, computational stylometry and computational language technology. As such, this project envisages a quantitative study into the stylistic creativity of Middle Dutch epic poets.

Researcher(s)

Research team(s)

AMICA - Automatic monitoring for cyberspace applications. 01/10/2010 - 30/09/2011

Abstract

This project represents a research agreement between the UA and on the onther hand IWT. UA provides IWT research results mentioned in the title of the project under the conditions as stipulated in this contract.

Researcher(s)

Research team(s)

Statistical Relational Learning of Natural Language. 01/01/2010 - 31/12/2013

Abstract

This project wants to investigate how techniques of statistical relational learning can be used for natural language processing. The focus will be on challenging natural langauge processing tasks, such as semantic role labeling, where syntac and semantic depedencies, structured and unstructured data, local and global models, and probabilistic and logical information must be combined with one another. For what concerns statistical relational learning, the emphasis will lie on probabilistic extensions of the programming language Prolog. The project does not only aim at obtaining improved natural language processing techniques but also better algorithms and systems for statistical relational learning.

Researcher(s)

Research team(s)

A Safer Internet: (Semi)automatically Recognizing Internet Paedophilia in Multilingual Online Social Networks. 01/01/2010 - 31/12/2013

Abstract

In this project we on the one hand propose a methodology to (semi)automate the manual control of peer-to-peer networks and on the other hand a methodology for the automatic extraction and analysis of stylistic characteristics (associated to personality, age group and deceptive language usage) which we want to apply to both individual internet paedophiles and groups of paedophiles in chat rooms.

Researcher(s)

Research team(s)

Project website

Interpersonal communication training of natural language interaction with autonomous virtual characters (deLearyous). 01/01/2010 - 31/12/2012

Abstract

The goal of the deLearyous project is to develop an interactive serious 3D-game for training interpersonal communication skills in a professional context, e.g., employer-employee or customer-seller relations. The game allows trainees to interact woth virtual autonomous characters who react in a realistic and expressive way to the input of the trainee. In this way, the trainee can exercise different behavioural patterns and roles in a safe virtual environment. The role of CLiPS in the project is to develop algorithms and methods for emotion analysis of text, topic detection in text, and dialogue management.

Researcher(s)

Research team(s)

Project TST Tools for Dutch as Web services in a Workflow (TTNWW). 01/01/2010 - 30/09/2012

Abstract

This project represents a formal research agreement between UA and on the other hand the Flemish Public Service. UA provides the Flemish Public Service research results mentioned in the title of the project under the conditions as stipulated in this contract.

Researcher(s)

Research team(s)

A web service for stylometry and readability research for the Dutch language (STYLENE). 01/01/2010 - 31/12/2011

Abstract

The goal of this project is to implement a robust, modular system for stylometry and readability research on the basis of existing techniques for automatic text analysis and machine learning, and the development of a web service that allows researchers in the humanities and social sciences to analyze texts with this system. In this way, the project will make available to researchers recent advances in research on the computational modeling of style and readability.

Researcher(s)

Research team(s)

The computational learnability of morphologically complex languages. 01/10/2009 - 30/09/2012

Abstract

Goals of the project: Traditional spell checkers make use of an extensive word list. If a word does not occur in this list, it is marked as a spelling error. More recent systems (e.g. Németh 2009) approach the problem of spell checking for agglutinating languages from a different angle: a word is considered as a spelling error, if it cannot be generated by an underlying morphological model of the language. In this project, we investigate how such a spell checker can be used as a tool in the automatic induction of a morphotactic system for Swahili.

Researcher(s)

Research team(s)

Towards a synthesis of knowledge based and data-based methods in computer linguistics. 01/10/2009 - 30/09/2010

Abstract

Hybrid systems for natural language processing that combine deep analysis, based on linguistic insight, with inductive data-oriented methods, can provide a significant improvement of the accuracy and applicability of computational linguistics. There are, however, many different ways in which this kind of hybridisation can be achieved. In this project, I will look at cognitive science as an inspiration source for new hybrid approaches. This work will build on earlier work on memory-based language processing as a cognitively relevant model.

Researcher(s)

Research team(s)

Machine learning for data mining and its applications. 01/01/2009 - 31/12/2013

Abstract

The research community aims at strengthening and coordinating the Flemish research about machine learning for datamining in general, and important applications such as bio-informatics and textmining in particular. Flemish participants: Computational Modeling Lab (VUB), CNTS (UA), ESAT-SISTA (KU Leuven), DTAI (KU Leuven), ISLab (UA).

Researcher(s)

Research team(s)

Artificial Creativity in visual communication and arts: an algorithm for inventive and evolving development of concepts and visualization of data. 01/10/2008 - 30/09/2012

Abstract

Using common techniques in Artificial Intelligence a software algorithm is developed that summarizes, interprets and processes textual content (or data sets). In an attempt to simulate human creativity the key concepts in this content are interrelated and recombined into creative and innovative graphical solutions and visualizations. The visual output evolves as the source data changes and expands.

Researcher(s)

Research team(s)

The end rhyme in Middle Dutch epic literature (ca. 1200-1500): development and relationship to authorship and genres. 01/10/2008 - 30/09/2010

Abstract

Nearly all of Middle Dutch narrative literature (ca. 1200-1500) was written in rhyming couplets, which is why rhyme words are extremely suitable for the comparative study of Middle Dutch epic texts and authors. My research specifically focuses on three aspects: (a) the evolution of rhyme in the vernacular epic poetry of the medieval Low Countries; (b) the usefulness of rhyme words for authorship verification and attribution; (c) the correlation between rhyme word vocabulary and epic subgenres. My methodology is mainly borrowed from literary stylistics, computational stylometry and computational language technology. As such, this project envisages a quantitative study into the stylistic creativity of Middle Dutch epic poets.

Researcher(s)

Research team(s)

FlaReNet: Fostering Language resources Network. 01/09/2008 - 01/09/2011

Abstract

International cooperation and re-creation of a community are the most important drivers for a coherent evolution of the Language Resource (LR) area in the next years. FlaReNet will be a European forum to facilitate interaction among LR stakeholders. Its structure considers that LRs present various dimensions and must be approached from many perspectives: technical, but also organisational, economic, legal, political. The Network addresses also multicultural and multilingual aspects, essential when facing access and use of digital content in today's Europe.

Researcher(s)

Research team(s)

NEON: subtitling in Dutch. 01/06/2008 - 31/05/2009

Abstract

In this project, CNTS develops a system for automatic subtitling on the basis of the output of speech recognition. Such a system allows the simplification and shortening of sentences when needed without making them ungrammatical and without loosing their essential meaning. As a methodology, a combination of rule-based and statistical techniques was chosen. In the project, we cooperate among others with the Belgian and Dutch television and with the speech recognition research group of the University of Leuven.

Researcher(s)

Research team(s)

Text Mining on heterogeneous knowledge bases. An application to optimised discovery of disease relevant genetic variants 01/07/2007 - 30/06/2011

Abstract

The project proposes a methodology for text mining with heterogeneous information sources and its application to molecular genetics/genomics and knowledge management. State of the art text analysis and graph-based data mining techniques will be extended to make the methodology possible, and the methodology will be applied in a biomedical application (ranking of candidate disease-causing genes) and a knowledge management application (person profiling from www information).

Researcher(s)

Research team(s)

Project website

Computational Techniques for Stylometry for Dutch. 01/01/2007 - 31/12/2010

Abstract

In this project we investigate a methodology for the automatic extraction and analysis of style that we want to apply to both individual authors (authorship attribution, both fiction and non-fiction) and groups of authors (extraction of stylistich characteristics associated to gender and age). This methodology covers several aspects: (1) Automatic linguistic analysis of documents by means of available text analysis tools on the level of morphological structure, part of speech, global syntactic structures and semantic roles (subject, object, temporal, location) for the construction of potentially relevant stylistic characteristics. (2) Unsupervised and supervised learning techniques for selecting characteristics with high information value and constructing a model of authorial style. (3) Evaluation of these models by (a) comparison with stylistic analyses in linguistics and literary science and (b) empiric testing of the predictive power of the models.

Researcher(s)

Research team(s)

Gravital: parsing and problem-solving in natural language as an engine for generating visual communication and art. 01/01/2007 - 31/12/2008

Abstract

The project addresses the application of parsing of natural language and problem solving as tools for the generation of visual communication and art. Within the context of the NodeBox application, we will adapt the MBSP shallow parser to the domain of design and visual communication and help integrating it into the NodeBox application.

Researcher(s)

Research team(s)

Linguistic description of resource-scarce languages using machine learning techniques. 01/10/2006 - 30/09/2009

Abstract

Linguistically annotated corpora are an important tool in the development of Natural Language Processing (NLP) applications. For commercially interesting languages, these corpora can be used to induce accurate and robust NLP tools to process new data. If no such corpora exist, which is by definition the case for resource-scarce languages, the traditional data driven algorithms are largely useless. This project investigates the automated linguistic description of minority languages on the basis of alternative classification techniques. The algorithms researched in this project avoid the need for annotated data in the target language by automatically inducing a classification, either on the basis of free text (technique: "unsupervised learning") or by using existing annotated corpora in another language (technique: "knowledge transfer"). The methodology proposed in this project allows for a hitherto largely unexplored systematic comparison and evaluation of these techniques.

Researcher(s)

Research team(s)

DAESO - Detecting and exploiting semantic overlap. 01/06/2006 - 31/05/2009

Abstract

The well-known fact that similar information can be expressed in many different ways is one of the major challenges in building robust NLP applications. It is commonly assumed that such applications can be improved with knowledge of how natural language expressions relate to each other, for instance in terms of paraphrases (same semantic content, different wording) or entailments (one expression implied by the other). DAESO investigates the detection of semantic overlap between Dutch sentences and the exploitation of this knowledge in a range of NLP applications. For this purpose, tools will be developed for the automatic alignment and classification of semantic relations (between words, phrases and sentences) for Dutch, as well as for a Dutch text-to-text generation application which fuses related sentences into a single grammatical sentence, which may be a generalization, a specification or a reformulation of the input sentences. To facilitate development and testing of these tools, an annotated monolingual Dutch parallel/comparable corpus of 1M words will be developed, consisting of pairs of texts that express comparable information. The utility of the resources and tools will be demonstrated in the context of three applications: (1) question-answering systems (improved recall, more complete answers), (2) information extraction (improved recall), and (3) summarization (beyond extraction: sentence compression, sentence fusion, anaphora resolution).

Researcher(s)

Research team(s)

Project website

Computational Linguistics and Language and Speech Technology. 01/01/2006 - 31/12/2010

Abstract

CLIF is the Flemish organization for computational linguistics, language technology and speech technology. The goal of the association is to stimulate research cooperation among the groups and the development of tools en resources the development of which is impossible by individual participating groups.

Researcher(s)

Research team(s)

Coreference Resolution for Extracting Answers. (STEVIN - COREA) 01/05/2005 - 31/10/2007

Abstract

Coreference resolution is a key ingredient for the automatic interpretation of text. It has been studied mainly from a linguistic perspective, with an emphasis on establishing potential antecedents for pronouns. Practical applications, such as Information Extraction (IE), summarization and Question Answering (QA), require accurate identification of coreference relations between noun phrases in general. Computational systems for assigning such relations automatically, require the availability of a sufficient amount of annotated data for training and testing. For Dutch, annotated data is scarce and coreference resolution systems are lacking. In COREA, a robust system for assigning such relations automatically will be developed, and we will investigate the effect of making coreference relations explicit on the accuracy of systems for IE and QA.

Researcher(s)

Research team(s)

Semi-supervised learning of Information Extraction. 01/10/2004 - 31/12/2005

Abstract

Information Extraction (IE) is concerned with extracting relevant data from a collection of structured or semi-structured documents. Current systems are trained using annotated corpora that are expensive and difficult to obtain in real-life applications. Therefore in this project we want to focus on the development of IE systems using semi-supervised learning, a technique that makes use of a large collection of un-annotated and easily-available data.

Researcher(s)

Research team(s)

Situational Factors in Producing Inflected wordforms: a Psycholinguistic and Computational Approach. 01/01/2004 - 31/12/2007

Abstract

The production of inflected word forms like plural of past tenses is traditionally assumed to be a process that relies primarily on morphological, phonological and syntactical characteristics of the base form. Although descriptive grammars also mention metalinguistic factors in this context, they receive no attention in recent influential models of language production such as Steven Pinker's 1999 Words and Rules theory. However, in a recent experiment, we demonstrated that Dutch speakers do rely on metalinguistic information when producing plurals for Dutch pseudowords. Not only do these results undermine Pinker's assumption that Dutch has two default plurals that are applied solely on the basis of phonological information, but they also question whether models that have a rule-bases component are essentially capable of capturing metalinguistic information.

Researcher(s)

Research team(s)

Supercomputing cluster. 01/01/2004 - 31/12/2006

Abstract

Researcher(s)

Research team(s)

    PASCAL - Pattern Analysis, Statistical Modelling and Computational Learning. 01/12/2003 - 29/02/2008

    Abstract

    Pattern Analysis, Statistical Modelling and Computational Learning. (PASCAL) The objective of this FP6 network of excellence is to build a Europe-wide Distributed Institute which will pioneer principled methods of pattern analysis, statistical modelling and computational learning as core enabling technologies for multimodal interfaces that are capable of natural and seamless interaction with and among human users. The role of CNTS in the network is the application of machine learning techniques to problems in natural language processing.

    Researcher(s)

    Research team(s)

    The use of very large textcorpora in the automatic discovery of structure in natural language. 01/10/2003 - 28/02/2005

    Abstract

    Large repositories of language samples exist today. Some examples are the text on the internet, and texts and dictionaries in many languages. However, these corpora are not always used when examining language hypotheses, or fundamental language questions. This gap is in the process of being filled, and this research hopes to be part of this development. The general aim is to arrive at a better use of existing language technologies in order to test specific hypotheses about the structure and function of language and about language change and typology.

    Researcher(s)

    Research team(s)

    Biological Text Mining (BioMinT). 01/01/2003 - 31/03/2006

    Abstract

    The goal of the BioMinT project is to develop a generic text mining tool that (1) interprets diverse types of query, (2) retrieves relevant documents from the biological literature, (3) extracts the required information, and (4) outputs the result as a database slot filler or as a structured report. The consortium consists of biologists (University of Manchester, Swiss Institute of Bioinformatics) and data/text mining groups (CNTS Antwerp, PharmaDM, Austrian research Institute for AI, University of Geneva AI Lab).

    Researcher(s)

    Research team(s)

    Semi-supervised learning of Information Extraction. 01/01/2003 - 30/09/2004

    Abstract

    Information Extraction (IE) is concerned with extracting relevant data from a collection of structured or semi-structured documents. Current systems are trained using annotated corpora that are expensive and difficult to obtain in real-life applications. Therefore in this project we want to focus on the development of IE systems using semi-supervised learning, a technique that makes use of a large collection of un-annotated and easily-available data.

    Researcher(s)

    Research team(s)

    FLaVoR : Flexible Large Vocabulary Recognition : Incorporating linguistic knowledge sources through a modular recogniser architecture. 01/10/2002 - 30/09/2006

    Abstract

    In this project we investigate whether the 'all-in-one' strategy currently used in speech recognizers, in which task-specific, syntactic, and lexical knowledge are fused into a single model based on simple formalisms, can be replaced by a modular architecture in which apart from acoustic-phonetic and intonational features, also generic and domain-specific linguistic information sources can be used.

    Researcher(s)

    Research team(s)

    Functions of audiovisual prosody. 01/10/2002 - 30/09/2005

    Abstract

    This research proposal is concerned with a functional approach to verbal and visual prosody in spoken conversations. The problem to be addressed in the project is about the combined use of specific auditive cues (such as intonation, tempo, voice quality and pausing) and specific visual cues (such as facial expressions and specific body gestures) for marking different dialogue phenomena. First, we will explore how audiovisual prosody can be exploited to highlight the information status of words. Then, we will investigate how it can be used to signal whether or not the process of information exchange in a dialogue is going well. Next, we will explore how it can support the turn-taking mechanism in spontaneous interactions. Finally, we will see to what extent audiovisual prosody may reflect speakers' emotions and attitudes. The results of these different substudies will be integrated in one coherent, functional model of audiovisual prosody. All the questions will be tackled from the point of view of both the speaker and the listener, and from a crosslinguistic perspective. Insight into functional aspects of audiovisual prosody is relevant from both a theoretical and applied perspective. First, it is remarkable to observe that this important communicative device is still largely unexplored. Knowledge about how audiovisual prosody works may yield new insights into how people mark important words, deixis, turn-taking, discourse structure, etc. and more general into how languages can differ in the way they signal linguistic and paralinguistic phenomena. Second, there is an increasing interest in computer interfaces that rely on what is termed `embodied conversational agents', i.e., specific software components that appear to users as animated characters. To make these agents `believable' and `communicative', it is important to know in full detail how specific auditive and visual parameters contribute to speech communication.

    Researcher(s)

    Research team(s)

    Techniques for the incorporation of linguistic knowledge in machine learning of language. 01/10/2002 - 30/09/2003

    Abstract

    Two fundamental problems in computer linguistics are the cost of annotating text (introducing linguistic information), and collecting enough data. I want to tackle these problems simultaneously in a theoretical and experimental study, in which I will study the effects of i)feature selection and construction methods: techniques which enable us to determine which linguistic sources are important when approaching linguistic tasks, and ii)methods such as active learning, expectation-maximization, cotraining, and bootstrapping: these methods make the annotation of corpora faster or redundant.

    Researcher(s)

    Research team(s)

      Multilingual subtitling of multimedia content (MUSA). 01/09/2002 - 28/02/2005

      Abstract

      MUSA aims at the creation of a multimodal multilingual system that converts audio streams into text transcriptions, translates the transcriptions in other languages and then generates subtitles from these translated transcriptions. MUSA will operate in English, French and Greek. A state-of-the-art Speech Recognition system will be enhanced and improved to meet the project settings. An innovative Machine Translation scenario will be designed that combines a Machine Translation engine with a Translation Memory and a Term Substitution module. The Antwerp group will be involved in sentence condensing for subtitle generation, performed by an automatic analysis of the linguistic structure of the sentence.

      Researcher(s)

      Research team(s)

      Machine learning for data mining and its applications. 01/01/2002 - 31/12/2006

      Abstract

      The research community aims at strengthening and coordinating the Flemish research about machine learning for datamining in general, and important applications such as bio-informatics and textmining in particular. Flemish participants: Computational Modeling Lab (VUB), CNTS (UA), ESAT-SISTA (KU Leuven), DTAI (KU Leuven), ADReM (UA).

      Researcher(s)

      Research team(s)

      Semaduct : combining deductive and inductive techniques for lexical semantics. 01/01/2002 - 31/12/2005

      Abstract

      Goal of the project is to confront and integrate deductive and inductive approaches to computational linguistics in the area of lexical semantics. Subprojects include the combination of supervised and unsupervised machine learning methods for semantic knowledge acquisition and disambiguation, the incorporation of linguistic semantic knowledge in inductive approaches, and the refinement of existing semantic tag sets with machine learning techniques.

      Researcher(s)

      Research team(s)

      OntoBasis: Extraction of ontologies from text. 01/01/2002 - 31/12/2005

      Abstract

      The main goal of CNTS for this project is the application and adaptation of shallow parsing technology for (i) extraction of lexons (ontological relations from unstructured and semi-structured sources, (ii) evaluation of ontologies, and (iii) adaptation of ontologies (e.g. WordNet) to specific domains. A secondary goal is to investigate the use of ontologies to improve text analysis using shallow parsing.

      Researcher(s)

      Research team(s)

      Text Analysis and Machine Learning for Prosody. 01/01/2001 - 31/12/2004

      Abstract

      The aim of the project is to perform empirical investigations to determine whether adequate prosody can be generated on the basis of two methods that have recently shown success in other language processing domains: (a) robust analysis of text by analyses and metrics from information retrieval and information extraction, and (b) advanced machine learning systems and meta learners.

      Researcher(s)

      Research team(s)

      Language acquisition by children with cochlear implants: A longitudinal investigation 01/01/2001 - 31/12/2004

      Abstract

      In this project we study the auditory development, the speech and language acquisition in congenital deaf children with a cochlear implant (CI) implanted during their second year of life. Our aim is to systematically investigate the effect of the CI on different aspects of language and speech development: ? The effect of a CI on the auditory level; ? The effect of a CI on the articulatory level (the speech); ? The effect of a CI on language acquisition and communicative development. In essence, we want to investigate how access to the auditory information evolves and what impact that access to spoken language has on the child's own spontaneous speech and language. The scientific aims of the research proposal are (i) descriptive and (ii) fundamental. (i) Descriptive: a longitudinal description of the auditory development and speech-, language- and comminicative development after a CI. On the basis of this description we will be able to provide an answer to the following questions: Does language acquisition after a CI proceed in a qualitatively and/or quantitatively similar fashion as that in normal hearing babies? What is the level of spoken language development in CI-babies, as compared to normal hearing babies? Is there a qualitatively and/or quantitatively difference in the auditory development, speech- and language development between babies, depending on the age at which they receive a CI? (ii) Fundamental psycholinguistic aims: ? Study of the perception of segmental and supra-segmental characteristics of speech in relation to its production: ? Study of the phonological development on the segmental and suprasegmental level, focussing on the evolution of truncation patterns. ? Study of the lexical and morphosyntactic acquisition, focussing on the evolution of `function words' or closed class words with respect to open class words, an opposition related to perceptual salience. ? Study of communicative development, focussing on (1) the use and place of speech versus (conventional) signs, (2) the use of interactional means (attention seeking/fixing/'), (3) the magnitude and use of types of interaction turns by child and adult conversation partner.

      Researcher(s)

      Research team(s)

      Action b/c of the action plan for Dutch in language and speech technology. 01/01/2001 - 31/12/2001

      Abstract

      Making an inventory of the available resources (software components and databases ) for industrial development of Dutch language technology, and formulating advice about the relative priority of investments in the development of these resources.

      Researcher(s)

      Research team(s)

        Atranos: automatic transcription and normalisation of speech 01/10/2000 - 30/09/2004

        Abstract

        The project aims at contributing to the development of better products for the automatic verbatim transcription of speech, and for the conversion of these transcriptions to a form that is better adapted to the needs of the end-user. One application which will be studied as a case study is the generation of subtitles for the benefit of hearing-impaired people. CNTS will investigate learning techniques for the transcription of out-of-vocabulary items, and statistical techniques for aligning and predicting subtitle text from transcriptions.

        Researcher(s)

        Research team(s)

        Topic spotting and tracking in newspaper text. 01/07/2000 - 31/07/2003

        Abstract

        The aim of this project is (i) to automatically find important new topical "concepts" in unrestricted text, and (ii) to track the evolution of the connotation and definition of such concepts through time in newspaper text and WWW. From a scientific point of view, this project investigates the usefulness of the combination of statistical and information-theoretic techniques used in information retrieval and statistical natural language processing, and of language engineering components such as shallow parsers for this task.

        Researcher(s)

        Research team(s)

          Scientific research Community for Computational Linguistics and Language and Speech Technology 01/01/2000 - 31/12/2004

          Abstract

          The goal of this scientific research community (CLIF, Computational Linguistics in Flanders), is to bring together the academic research expertise on language and speech technology for Dutch present in Flanders. CLIF will promote and facilitate fundamental, multidisciplinary, and application-oriented research in this area and provide advice to users of language and speech technology.

          Researcher(s)

          Research team(s)

          Neural networks and genetic algorithms for language and speech technology. 01/04/1999 - 31/12/2000

          Abstract

          Basic research into the applicability of neural networks and genetic algorithms in language and speech technology, in the context of implementation on evolutionary hardware. Integration of these techniques wit existing statistical and machine learning methods for automatic disambiguation in natural language analysis. The project is in cooperation with Flanders Language Valley (FLV).

          Researcher(s)

          Research team(s)

            Extending Computational Grammars by Learning. 01/05/1998 - 30/04/2001

            Abstract

            The focus of the network Learning Computational Grammars (LCG) is the investigation of ways to improve computational grammars by applying machine learning techniques to current best practice in Computational Grammar. LCG seeks improvement through the application of a range of machine learning techniques, including both symbolic and statistical techniques. The scientific goal is to provide a characterization of the algorithms capable of learning (important fragments of) language. This responds to a challenge of theoretical linguistics - how is language acquisition possible, and may have practica! application in natural language processing (NLP). In this network (in which UIA cooperates with Groningen, Tuebingen, SRI Cambridge, University College Dublin, Suissetra Geneve en Xerox Research Centre Grenobles), UIA focusses on memory-hased learning techniques.

            Researcher(s)

            Research team(s)

              Computational psycholinguistics : natural and artificial language acquisition and processing. 01/01/1998 - 31/12/2003

              Abstract

              The issue of abstract representations in the domains of language acquisition and adult language processing is addressed in this project. Is it possible to learn a subdomain of language without prior linguistic knowledge in this domein '? Can one achieve the final learning stage (adult performance) without developing abstract representations ? A new methodology will be used to study these questions. The research will explicitly combine the techniques that are used in three separate disciplines: language acquisition research, psycholinguistics, and artificial intelligence. Whereas the former two take the real language learner/user as their object of study, the latter one studies the artificial language learner/user. Thus far artificial learning models have always been used to simulate effects observed in actual language use. Whereas simulation reveals the computational power of the learning system and suggests interesting hypotheses on the real language learner/user, it does not falsify hypotheses generated in, for instance, psycholinguistic work. In our research we want to use artificial language learners/users in a radically different way. Apart from having them simulate effects from real language use we want to isolate factors that affect the models behaviour and then study the effects of these same factors in psycholinguistic experiments and in language acquisition data. In case of a different outcome, the effects observed in real language users can then be used to adept the architecture of the artificial learning model and see whether its performance can eventually be matched to that of the language user. This method of relating the results from acquisition and psycholinguistic research to computational work and vice versa is essentially a heuristic for discovering properties of the representational architecture for language in the real language learner/user. This basic issue, and the methodology to study it, will be approached in two linguistic domains: phonology and inflectional morphology. In phonology, the linguistic representation of stress patterns, phonotactic restrictions, and syllable structure will be studied. In morphology, irregularity effects in the past tense forrnation in Dutch will be used to study the issue of the single-route versus dual-route architecture (i.e., rules for regular forms' a lexicon for the irregular ones). A study of the factors causing interference errors in the spelling of (highly regular) past tense forms in Dutch (regular forms affecting other regulars) will shed light on the issue.

              Researcher(s)

              Research team(s)

              Contextual Interpretation of Natural Language Using Abductive Reasoning and Inductive Knowledge. 01/01/1997 - 31/12/2000

              Abstract

              There are two fundamental, linguistic problems to model the interpretation of a context: 1. Making connections that are not explicitely mentioned in the text, such as co-reference and temporal relationships 2. the contextual disambiguation of ambiguous words or constructions. In this project we will focus on the represenation and interpretation of temporal expressions in Dutch. We use the representation language of Discourse Representation Theory as basis. The goal is to use data mining techniques to formulate disambiguation rules. We will need new data modelling techniques, as well as new inferention methodologies for this purpose. The project's aim is to research the possibilities to use abduction for the interpretation of the context of temporal expressions, as well as the usage of inductive reasoning for the extraction of disambiguation rules.

              Researcher(s)

              Research team(s)

                01/01/1997 - 31/12/1997

                Abstract

                Researcher(s)

                Research team(s)

                  A data-driven model of language acquisition: Computational and psycholinguistic investigations. 01/01/1996 - 31/12/2000

                  Abstract

                  The aim of the project is the development of a computational psycholinguistic model of morphosyntactic aspects of language acquisition. This includes a psycholinguistic investigation of the acquisition of morphosyntax, and more specifically the acquisition of the morphological and distributional reflexes of the feature 'finite' in Dutch. A computer model of these linguistic phenomena will be implemented in which the principles of similarity based reasoning will be represented.

                  Researcher(s)

                  Research team(s)

                    Electronic archive for language technology of Dutch. 01/10/1995 - 31/12/1996

                    Abstract

                    The aim of this project is to install, develop, enrich, maintain and make available an electronic server for software, data collections, knowledge bases and corpora related to language technological research (esp. focused on Dutch). This server is crucial for the development of language for Dutch because it enhances the reusability of research results, it prevents the unproductive repetition or overlap of research efforts and can serve as a didactic source of information for students of language technology and computational linguistics.

                    Researcher(s)

                    Research team(s)

                      Memory based acquisition and processing of morphological and syntactic knowledge for language technological applications. 01/07/1995 - 30/06/1996

                      Abstract

                      The project aims at developing a computational model of morphophonological and syntactic knowledge acquisition and processing using the principles of memory-based reasoning. The induction of linguistic knowledge is meant to be domain and language independent.

                      Researcher(s)

                      Research team(s)

                        Computational Linguistics and Language technology. 01/01/1995 - 31/12/1999

                        Abstract

                        The proposed research community has as its goal to promote the integration and strenghtening of the Flemish expertise in the field of the automatic processing of language (both the theoretical and the applied perspective). The Flemish participants represent complementary contributions to fundamental and applied research in natural language processing. The other participants have been selected for their international status, their existing cooperation with the Flemish partners, and their complementarity to the Flemish partners. The next five years, the cooperation will try to achieve the following goals: (1) Integration, coordination and international embedding of Flemish research in the field. (2) Collection of resources for Flemish language technology, within the framework of European standardization efforts.

                        Researcher(s)

                        Research team(s)

                          FONILEX: a pronunciation lexicon of Dutch. 01/01/1995 - 30/06/1997

                          Abstract

                          The aim of the project is the compilation of a pronunciation lexicon of a representative number Dutch word forms and their Flemish pronunciation. The output of the project, a database, is meant to be used in speech research, i.a. phoneme based speech recognition, text-to-speech synthesis, integrated speech and language processing.

                          Researcher(s)

                          Research team(s)

                            Machine learning of pragmatic knowledge : theoretical considerations and an implementation. 01/10/1994 - 30/09/1996

                            Abstract

                            This project concerns the automatic acquisition of knowledge (machine learning). The specific type of knowledge involved is Natural Language, more specifically the pragmatic aspects of natural language. The aim is to develop a theoretical framework for the construction of learning models for pragmatic knowledge.

                            Researcher(s)

                            Research team(s)

                              The acquisition of linguistic knowledge : cognitive and language technological aspects. 01/01/1994 - 31/12/1997

                              Abstract

                              In the present research project we aim at studying the process of language acquisition adapting a data-driven approach and to conduct experiments with artificial learning algorithms in which cue-based competitive learning can readily be implemented.

                              Researcher(s)

                              Research team(s)