Mining Patterns in Dirty Data for Detecting and Correcting Inconsistencies

Date: 10 October 2018

Venue: Campus Middelheim, A.143 - Middelheimlaan 1 - 2020 Antwerpen (route: UAntwerpen, Campus Middelheim)

Time: 4:00 PM

Organization / co-organization: Department of Mathematics and Computer Science

PhD candidate: Joeri Rammelaere

Principal investigator: Floris Geerts & Bart Goethals

Short description: PhD defence Joeri Rammelaere - Faculty of Science, Department of Mathematics and Computer Science


Data is being generated, scraped, and integrated at never before seen rates. At the same time, quality control over this data does not measure up. Much of the data is obtained through unreliable sources, such as faulty sensors, software based on heuristics, and overworked humans. Consequently, these massive amounts of data are becoming increasingly dirty.

This phenomenon is problematic for any sizeable organization, and is estimated to cost the US economy alone anywhere between millions and trillions of dollars each year. Apart from companies suffering financial losses, dirty data also impacts areas such as data analysis, knowledge discovery from databases, and machine learning. These applications typically rely on large amounts of data, and since said data is often dirty, this can lead to wrong conclusions, faulty models, or false patterns.

In this dissertation we focus on erroneous data, in the form of value combinations that violate certain logical rules. These rules are called quality rules, and typically specify what clean data should look like. Throughout the dissertation, we have investigated the problem of discovering such quality rules from different angles. The main challenge here is that, in a typical scenario, the correct rules are unknown. We have tackled this problem by involving a user in the discovery process, and by employing a suitable interestingness measure for discovering violations of quality rules. The techniques used are mostly rooted in the area of pattern mining, a subfield of knowledge discovery from databases, aimed at discovering interesting associations between objects or events.