Data Mining Template Library (DMTL)

Download DMTL                                                 DMTL Documentation


Data Mining Template Library (Release 1.0)
New to release 1.0:

DMTL, written in C++, is an open-source, high-performance, generic data mining toolkit. It provides a collection of generic algorithms and data structures for mining increasingly complex and informative patterns types such as: Itemsets, Sequences, Trees and Graphs. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. The kind of pattern to be mined, the kind of mining approach to use, and the kind of data types and formats to mine over are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications.
The project file contains the source code, test examples, extensive documentation as well as related papers.

Background
FPM (Frequent Pattern Mining) is a data mining paradigm to extract informative patterns from massive datasets. Researchers have developed numerous novel algorithms to extract these patterns. Unfortunately, the focus primarily has been on a small set of popular patterns (itemsets, sequences, trees and graphs) and no framework for integrating the FPM process has been attempted. In this paper we introduce DMTL fuses theoretical concepts from formal concept analysis and generic programming. It provides a framework that allows mining a large spectrum of patterns. We express each pattern in terms of its relational properties. Describing patterns based on their properties results in a pattern concept hierarchy. This hierarchical model is implemented using principles from generic programming. In this paper, we describe our design considerations and the subsequent implementation. Some of the challenges faced in terms of language features have also been highlighted. Apart from using the library in its entirety, we believe that some of its components, such as isomorphism checking, can be used independently. These components can definitely enrich the existing functionality provided in some of the popular libraries such as the Boost Graph Library.

The major contributions of our work are as follows:
  1. DMTL offers algorithms for different pattern mining tasks in a unified platform. To the best of our knowledge this is the first effort of this kind in data mining.
  2. DMTL offers flexible interfaces to each of the algorithms, including each of its sub-tasks so that it is very simple for end users to use it as a library component in their software development.
  3. DMTL is extensible; new patterns can be mined with very minimal effort from the end user. Users just need to define some template parameters to ensure that the library selects the proper mining algorithm to mine that pattern successfully. Some additional specialized code may be required for efficiency reasons.
  4. DMTL adopts the generic software development approach using C++ templates. Due to the limitation imposed by the programming language, it is still very difficult for programmers to design generic software. Few books are available that describe an implementation of a generic library. We believe that the design of DMTL could be an example for other generic library developers to follow.
  5. Apart from its ultimate purpose of discovering frequent patterns, our library provides several stand-alone utilities for various patterns. This primarily includes the isomorphism checking functionality for different patterns. We believe that these features can complement the features provided in BGL.
  6. While implementing DMTL, we faced numerous challenges, mostly related to programming language support for generic software development. Most of these issues have already been identified by several researchers, but our work stands as another practical example of those limitations.
  7. DMTL uses several template tricks, which we think could be tremendously useful for any generic software developer.

Project Members
Mohammed Zaki -- PI
Mohammad Al Hasan, Vineet Chaoji, Saeed Salem -- PhD Students
Computer Science Department, Rensselear Polytechnic Institute

Acknowledgements
DMTL project is funded in part by Information Technology Innovation Center -- Knowledge Discovery and Dissemination program; National Science Foundation -- Information and Data Management program (IIS-0092978), and Next Generation Software program (EIA-0103708); and Department of Energy, Office of Science (DE-FG02-02ER25538).

Publications
  1. Mohammed J. Zaki, Nagender Parimi, Nilanjana De, Feng Gao, Benjarath Phoophakdee, Joe Urban, Vineet Chaoji, Mohammad Al Hasan, Saeed Salem, Towards Generic Pattern Mining, International Conference on Formal Concept Anaysis (Invited Paper), Lens, France, February 2005 (Also LNCS 3403, Springer-Verlag, and RPI CS Dept Technical Report 04-01, Jan 2004). PDF
  2. Mohammad Hasan, Vineet Chaoji, Saeed Salem, Nagender Parimi, and Mohammed Zaki, DMTL: A Generic Data Mining Template Library, in Workshop on Library-Centric Software Design (LCSD'05), with Object-Oriented Programming, Systems, Languages and Applications (OOPSLA'05) conference, San Diego, California, October 2005. PDF


Page maintained by Vineet Chaoji. Last modified 2006-07-27.