ALPA is a technique that can render any of your black-box models into comprehensible white-box models. The generated models explain the classifications made by your black-box model and can improve upon traditional rule induction techniques as well.

  • 21/10/2013: Added a tutorial and a new revision.
  • 15/10/2012: WEKA regression rule extraction module is now available!



AntMiner+ is a classification technique which is based on the principles of Ant Colony Optimization. The goal is to infer comprehensible rule-based classification models from a data set.

The AntMiner+ implementation is based on description in Martens et al. (2007). A modification was made to the rule evaluation function, see Minnaert et al. (2012) for more details. Please reference the website as well as these two papers. Results of your experiments with Antminer+ will also be added to the website on request. Installation and running instructions are detailed in the README.txt file.


Big Bayes

Big Bayes is a special naive Bayes variant based on the Bernouilli event model, tailored for very big, highly sparse datasets. It was first introduced in this paper and can handle datasets with millions of instances and attributes. To access the software, please fill in the terms of agreement form below.


CFD (Customs Fraud Detection)

This file accompanies the paper "Customs Fraud Detection – Assessing the value of behavioral and high-cardinality data under the imbalanced learning issue" and provides the EasyEnsemble implementations with SVM base learner. Furthermore, the results with respect to AUC and lifts (at arbitrarily chosen capacity values) for each of the different data sources (traditional/high-cardinality/behavioral) under consideration are also disclosed.


Data for Software Fault Prediction

Android data sets used for software fault prediction and extracted within the scope of the paper: "Comprehensible Software Fault and Effort Prediction: a Data Mining Approach".



Document classification has widespread applications, such as with web pages for advertising, emails for legal discovery, blog entries for sentiment analysis, and many more. Previous approaches to gain insight into black-box models do not deal well with high-dimensional data. With EDC, we define a new sort of explanation, tailored to the business needs of document classification and able to cope with the associated technical constraints.


Faster ROC-AUC (matlab)

Calculates the Area under the ROC curve (AUC) associated with a binary classification problem. Main advantages of using this function over perfcurve are:


  • Speed: On a benchmark of 20 million instances this function performed more than 100 times faster than perfcurve (Matlab statistics toolbox).
  • Independence: Works without needing to install the statistics toolbox.

The package can be downloaded from

ICBD (Imbalanced Classification for Behaviour Data)

This toolbox provides implementations (Matlab), results and datasets accompanying the paper “Imbalanced classification in sparse and large behaviour datasets”. Behaviour data reflect fine-grained behaviours of individuals or organisations and are characterized by sparseness and very large dimensions. Traditional studies dealing with the imbalanced learning issue operate on low-dimensional and dense datasets, which have a different structure and properties as opposed to the type of data under consideration. Imbalanced behaviour data occur naturally across a wide range of applications, some examples include: online advertising, fraud detection, churn prediction, default prediction, predictive policing.



The SW-transformation is a fast classifier for binary node classification in bipartite graphs (Stankova et al., 2015). Bipartite graphs (or bigraphs), are defined by having two types of nodes such that edges only exist between nodes of the different type. The SW-transformation combines the weighted-vote Relational Neighbor (wvRN) classifier with an aggregation function that sums the weights of the top nodes. The transformation optimally considers for each test instance only the weights of the neighboring top nodes multiplied by the number of training instances in that column which have a positive label (the positive neighbors of the node). The SW-transformation yields very fast run times and allows easy scaling of the method to big data sets of millions of nodes (Stankova et al., 2015).

Available on Github