Data mining for tax fraud detection

Date: 14 February 2019

Venue: University of Antwerp, Grauwzusters Cloister - Stadscampus, Lange Sint-Annastraat 7 - 2000 Antwerp (route: UAntwerpen, Stadscampus)

Time: 4:00 PM

PhD candidate: Jellis Vanhoeyveld

Principal investigator: Prof. Dr. Ir. David Martens, Prof. Dr. Bruno Peeters

Short description: PhD defence Jellis Vanhoeyveld - Faculty of Business and Economics - Department of Engineering Management


Due to the impact, reach and diversity of tax fraud, governments invest in advanced detection methodologies. Data mining techniques offer an important opportunity as they can automatically distinguish fraudulent patterns from legal ones and tax administrations can subsequently focus their limited resources on the most likely presumed fraud cases.

First, we focus on the imbalanced data distribution problem for behavioural data, where a variety of tailored imbalanced learning solutions are developed and examined across imbalanced behavioural datasets arising from different application areas. Note that the tax fraud detection domain is also characterized by imbalance, where the number of compliant cases severely outnumbers the number of fraudsters and this causes suboptimal performances for many supervised classification techniques. Furthermore, behavioural data that capture the fine-grained (inter)actions of persons and/or organizations are also widely available (e.g. invoicing data in the VAT domain, people appearing in the board of directors of organizations, shareholder ownership, etc.), though they are mostly ignored in tax fraud detection studies. Next, we leverage the previously obtained insights in a case study of customs fraud. Our main findings suggest that fine-grained behavioural and high-cardinality data (e.g. consignee, declarant, type of commodity, etc.) are very predictive and can even outperform traditional data that are used in the literature and contain many more variables. Furthermore, imbalanced learning solutions can further improve the predictive performance of the models.

Unsupervised anomaly detection constitutes another class of data mining methodologies. It involves the detection of entities (fraud cases) that show a conduct that differs significantly from normal behaviour (compliant cases) solely based on their feature representation. As the number of tax declarations can become large (e.g. several millions), many anomaly detection techniques are too computationally involved to be applied in such settings. We develop efficient data compression techniques as a pre-processing step so that the detectors are presented with much fewer (though relevant) cases. In a case study of VAT fraud, such scalable anomaly detectors are applied to the feature representations of all companies pertaining to the same sector. Based on domain knowledge, we derive presumed fraud indicators (features) that take the form of tax ratios from the VAT declaration form and client listings.