Classification within network data with a bipartite structure
2 December 2016
University of Antwerp - Stadscampus - Promotiezaal Grauwzusters - Lange Sint-Annastraat 7 - 2000 Antwerp (route: UAntwerpen, Stadscampus
Prof D. Martens
PhD defence Marija Stankova - Faculty of Applied Economics
Many relational, behavioural and transactional datasets in the field of data mining correspond to bipartite graph (bigraph) data settings, such as data about users rating movies or people visiting locations. Although some work exists over such bigraphs, no general network-oriented methodology has been proposed yet to perform node classification. Prior literature has generally seen classification within this type of data from a classical perspective as classification with massive and sparse feature data. We, on the other hand, propose in this dissertation alternative network based formulations for doing classification in bipartite data via projection.
This projecting approach transforms the bigraph into a weighted unigraph version that preserves information about the underlying bigraph and allows the practitioners to make use of the wealth of unigraph techniques already available. The frameworks open up the design space for experimenting with existing or new methods in the different stages and creating new techniques by mixing-and-matching the choices. Furthermore, we validate our designs with two real-world applications.
In collaboration with the Belgian government, we use these network based formulations to help detect companies that fraudulently reside outside of Belgium for tax benefits. This entails what we believe to be the first published data-mining-based approach to detecting corporate residence fraud. Moreover, we worked together with the NYC based micro-lender company Lenddo, in order to assess the creditworthiness of their loan applicants with the use of alternative social network data.
The potential of such an automated credit scoring process is innovative and has large implications for the widespread use of microfinance and the potential economic growth of developing countries. Lastly, we also discuss the problem of classification within bigraph data from the standard perspective with high-dimensional features and elaborate on how helpful it is to consider the nonlinearities in the data through higher order interaction features.