Below are the implementations of my research projects and papers up till 2013.

You will find the implementations of any newer work on my group's website.

phpBibLib is a PHP library for easily parsing and displaying entries from bibtex files, including the possibility of using citations in a webpage and displaying the corresponding references.

Measuring the difference between data mining results is an important open problem in exploratory data mining. We discuss an information theoretic approach for measuring how much information is shared between results, and give a proof of concept for binary data.

Comparing Apples and Oranges – Measuring Differences between Exploratory Data Mining Results. Data Mining and Knowledge Discovery vol.25(2), pp 173-207, Springer, 2012. |

A good first impression of a dataset is paramount to how we proceed our analysis. We discuss mining high-quality high-level descriptive summaries for binary and categorical data. Our approach builds summaries by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering.

Summarizing Categorical Data by Clustering Attributes. Data Mining and Knowledge Discovery vol.26(1), pp 130-173, Springer, 2013. |

CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 198-206, SIAM, 2013. |

CompreX discovers anomalies in data using pattern-based compression. Informally, it finds a collection of dictionaries that describe the *norm* of a database succinctly, and subsequently flags points dissimilar to the norm – those with high compression cost – as anomalies.

Fast and Reliable Anomaly Detection in Categoric Data. In: Proceedings of ACM Conference on Information and Knowledge Management (CIKM), pp 415-424, ACM, 2012. |

Given a snapshot of a large graph, in which an infection has been spreading for some time, can we identify those nodes from which the infection started to spread? In other words, can we reliably tell who the culprits are? With NetSleuth, we answer this question affirmatively for the Susceptible-Infected virus propagation model.

Spotting Culprits in Epidemics: How many and Which ones?. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 11-20, IEEE, 2012. |

Suppose we are given a large graph in which, by some external process, a handful of nodes are marked. What can we say about these nodes? Are they close together in the graph? or, if segregated, how many groups do they form? We approach this problem by trying to find *simple connection pathways* between sets of marked nodes — using MDL to identify the optimal result. We propose the efficient dot2dot algorithm for approximating this goal.

Mining Connection Pathways for Marked Nodes in Large Graphs. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 37-45, SIAM, 2013. |

Detecting Bicliques in GF[q]. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 509-524, Springer, 2013. |

The Krimp algorithm mines sets of itemsets by the MDL principle, defining the best set of patterns as the set that compresses the data best. The resulting *code tables* are orders of magnitude smaller than the number of (closed) frequent itemsets. They are highly characteristic for the data, and obtain high accuracy on many data mining tasks.

Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery vol.23(1), pp 169-214, Springer, 2011. |

Boolean Matrix Factorization has many desirable properties, such as high interpretability and natural sparsity. However, no method for selecting the correct model order has been available. We propose to use the Minimum Description Length principle, and show that besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, and, as experiments show, is highly accurate.

mdl4bmf: Minimal Description Length for Boolean Matrix Factorization. Transactions on Knowledge Discovery from Data vol.8(4), pp 1-30, ACM, 2014. |

**Winner of the ACM SIGKDD 2011 Best Student Paper Award**— with Michael Mampaey & Nikolaj Tatti

mtv is a well-founded approach for summarizing data with itemsets; using a probabilistic maximum entropy model, we iteratively find that itemset that provides us the most new information, and update our model accordingly. We can either mine top-k patterns, or identify the best summarisation by MDL or BIC.

Summarizing Data Succinctly with the Most Informative Itemsets. Transactions on Knowledge Discovery from Data vol.6(4), pp 1-44, ACM, 2012. |

We formalise how to probabilistically model real-valued data by the Maximum Entropy principle, where we allow statistics on arbitrary sets of cells as background knowledge in terms of means and variances, or histograms.

Maximum Entropy Modelling for Assessing Results on Real-Valued Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 350-359, IEEE, 2011. |

We aim at finding itemsets that characterise the data well. To this end, we construct decision trees by which we can pack the data succinctly, and from which we can subsequently identify the most important itemsets. The Pack algorithm can either filter a candidate collection, as well as mine its models directly from data.

Finding Good Itemsets by Packing Data. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 588-597, IEEE, 2008. |

Slim mines high-quality Krimp code tables directly from data, as opposed to filtering a candidate collection. By doing so, Slim obtains smaller code tables that provide better compression ratios, while also improving on classification accuracy, runtime, and reducing the memory complexity with orders of magnitude.

Slim: Directly Mining Descriptive Patterns. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 236-247, SIAM, 2012. |

We consider mining informative serial episodes — subsequences allowing for gaps — from event sequence data. We formalize the problem by the Minimum Description Length principle, and give algorithms for selecting good pattern sets from candidate collections as well as for parameter free mining of such models directly from data.

The Long and the Short of It: Summarising Event Sequences with Serial Episodes. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp 462-470, ACM, 2012. |

Stijl mines descriptions of ordered binary data. We model data hierarchically with noisy tiles - rectangles with significantly different density than their parent tile. To identify good trees, we employ the Minimum Description Length principle, and give an algorithm for mining *optimal* sub-tiles in just O(*nm*min(*n,m*)) time.

Discovering Descriptive Tile Trees by Fast Mining of Optimal Geometric Subtiles. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 9-24, Springer, 2012. |

Unraveling Tobacco BY-2 Protein Complexes with BN PAGE/LC-MS/MS and Clustering Methods. Journal of Proteomics vol.74(8), pp 1201-1217, Elsevier, 2011. |