Abstract. The Krimp algorithm is our answer to the pattern explosion in data mining: the best set of patterns is that set that compresses the data best. Using the Minimum Description Length (MDL) principle, Krimp achieves reductions of up to 7 orders of magnitude in the number of frequent itemsets. The selected patterns are highly characteristic for the data, as indicated by good compression ratios and high classification accuracies.
Krimp has been first published as Siebes et al (2006), although not yet under that name. Since then, we extended the Krimp foundation for data mining tasks like characterising differences between databases, generating data, completing missing data, detecting changes in streams, identifying the components of a database, and more. For more details, see the publication list below.
Public release: source code and binaries. Our implementation of Krimp is freely available for research purposes; we provide both the C++ source code and binaries for Windows (x86 and x64) and Unix (x64 only; tested under Ubuntu, Fedora, OSX). In addition to the pattern selection algorithm, it contains the Krimp classifier and the StreamKrimp algorithm. For your convenience, the package includes some example UCI datasets taken from the LUCS-KDD data library. Please refer to the documentation in the package for installation/compilation details and usage hints.
Slim, beyond Krimp Recently, we introduced the Slim algorithm for directly mining good tables from data—opposed to first mining a (large) candidate collection, ordering it, and then greedily filtering these—Slim iteratively generates candidates that are most likely to improve the current code table. You can find more information here.
Implementation
Related Publications
Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery vol.23(1), pp 169-214, Springer, 2011. (IF 2.950) |
|
The Odd One Out: Identifying and Characterising Anomalies. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 804-815, SIAM, 2011. |
|
Identifying the Components. Data Mining and Knowledge Discovery vol.19(2), pp 176-193, Springer, 2009. (IF 2.950) |
|
Preserving Privacy through Data Generation. In: Proceedings of the IEEE International Conference on Data Mining (ICDM), pp 685-690, IEEE, 2007. |
|
Characterising the Difference. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp 765-774, ACM, 2007. |
|
Compression Picks the Item Sets that Matter. In: Proceedings of European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pp 585-592, Springer, 2006. |
|
Item Sets That Compress. In: Proceedings of the SIAM International Conference on Data Mining (SDM), pp 393-404, SIAM, 2006. |