Cohesive Pattern Mining in Sequential and Spatial Data

Date: 2 September 2015

Venue: UAntwerp, Campus Middelheim, G.010 - Middelheimlaan 1 - 2020 Antwerp

Time: 3:00 PM

Organization / co-organization: Department of Mathematics and Computer Science

PhD candidate: Cheng Zhou

Principal investigator: Bart Goethals

Short description: PhD defence Cheng Zhou - Department of Mathematics and Computer Science, Faculty of Science



Abstract

The goal of pattern mining is to search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be interesting and easy to understand. In practice, we are confronted with a variety of potential applications and different types of data, such as sequence data, protein structures, geographic data and data stream, where the spatial/temporal information of the data is important. Therefore, when looking for interesting patterns in such data, we take how spatially/temporally close to each other its items occur into account.

In this thesis, we study the problem of interesting pattern mining for different applications, i.e., sequence classification, structure analysis and event stream prediction, by defining different interesting measures.

We try to address the problem of sequence classification based on interesting patterns found in a dataset of labelled sequences. The interestingness of a pattern in a given class of sequences is measured by combining the cohesion and the support of the pattern. We use the discovered patterns to generate confident classification rules, and present two different ways of building a classifier. Furthermore, we test a variety of machine learning algorithms for sequence classification by using different kinds of patterns as features to represent each sequence as a feature vector.

We present new interestingness measures to identify spatially cohesive itemsets in one or more multidimensional spatial structures. The usefulness of the method is demonstrated by applying it to find interesting patterns of amino acids in spatial proximity within a set of proteins based on their atomic coordinates in the protein molecular structure. The experiments on geographical data of a city demonstrate the efficiency and intuitiveness of the algorithms.

We present a prediction model for streaming data, which consists of a frequent sequential pattern miner and an event predictor. For the pattern miner, we propose a new method to dynamically determine the optimal error bound by maximising the memory usage in order to achieve higher accuracy. Then, we use the discovered patterns to predict future events. Within this context, the interesting patterns should be those patterns whose prediction position falls within the prediction span.