Query approximation techniques for data analytics
5 September 2016
UAntwerpen, Campus Middelheim, A.143 - Middelheimlaan 1 - 2020 Antwerpen (route: UAntwerpen, Campus Middelheim
PhD defence Reuben Ndindi - Faculty of Science, Department of Mathematics and Computer Science
Technological developments of the last decade make generating and storing large amounts of data cheaper and easier. Research centers, governments, businesses, and even individuals are in possession of huge amounts of data. Such large amounts of data, however, provide new challenges. Indeed, such data must be efficiently analysed in order to make decisions or forming conclusions.
This analysis process falls under the umbrella term of data analytics and consists of exploring the data through traditional relational queries to the underlying database, and secondly the application of data mining techniques to discover knowledge from these data.
Large amount of data, however, requires very efficient methods to evaluate queries. Indeed, even performing simple queries that require a single scan of the data can take hours or even days. Although recent advances in hardware and the emergence of new parallel computing paradigms (such as MapReduce, Spark) make it possible - to a certain extent - to evaluate queries efficiently in a parallel and distributed fashion, it does not solve all problems. In recent years, query approximation techniques have become popular to give users a quick insight in the data.
Most query approximation techniques are, however, especially designed for so-called aggregate queries. Nevertheless, non-aggregate queries are abundant in practice. As part of this thesis, we propose a general technique that allows us to approximate such non-aggregate queries. In particular, we show that we can adapt existing database systems, approximate queries in a very precise way and are able to quantify the errors that are made by these approximations.
As mentioned earlier, data mining techniques are also an essential part of data analytics. In particular, this includes clustering techniques that allocate objects into groups (clusters) on the basis of certain quantitative measures. A commonly used method of clustering is correlation clustering. The aim is to group similar objects together, and place dissimiar objects in separate groups. Correlation clustering methods, however, cluster data in an automatic way, without involving a user. For large amounts of data it is desirable that clustering methods allow user input in order to control the quality of clustering. As part of this thesis, we consider correlation clustering from this interactive perspective.