Data Mining Template Library (Release 1.0)
New to release 1.0:
- A persistent file manager that allows the database to scale beyond main memory limits.
- The sequence mining can now correctly handle both induced and embedded sequences, after a bug was corrected.
- Also new is support for multiset mining.
- Finally, a new property for tokenizer has been added, and various minor changes have been made to make the library more robust.
DMTL, written in C++, is an open-source, high-performance, generic
data mining toolkit. It provides a collection of generic algorithms
and data structures for mining increasingly complex and informative
patterns types such as: Itemsets, Sequences, Trees and Graphs.
DMTL utilizes a generic data mining approach, where all aspects of
mining are controlled via a set of properties. The kind of pattern to
be mined, the kind of mining approach to use, and the kind of data
types and formats to mine over are all specified as a list of
properties. This provides tremendous flexibility to customize the
toolkit for various applications.
The project file contains the source code, test examples, extensive
documentation as well as related papers.
Background
FPM (Frequent Pattern
Mining) is a data mining paradigm to extract informative patterns from
massive datasets. Researchers have developed numerous novel algorithms
to extract these patterns. Unfortunately, the focus primarily has been
on a small set of popular patterns (itemsets, sequences, trees and
graphs) and no framework for integrating the FPM process has been
attempted. In this paper we introduce DMTL fuses theoretical concepts from formal concept analysis
and generic programming. It provides a framework that allows mining a
large spectrum of patterns. We express each pattern in terms of its
relational properties. Describing patterns based on their properties
results in a pattern concept hierarchy. This hierarchical model is
implemented using principles from generic programming. In this paper,
we describe our design considerations and the subsequent
implementation. Some of the challenges faced in terms of language
features have also been highlighted. Apart from using the library in
its entirety, we believe that some of its components, such as
isomorphism checking, can be used independently. These components can
definitely enrich the existing functionality provided in some of the
popular libraries such as the Boost Graph Library.
The major contributions of our work are as follows:
- DMTL offers algorithms for different pattern mining tasks
in a unified platform. To the best of our knowledge this is the first
effort of this kind in data mining.
- DMTL offers flexible interfaces to each of the algorithms,
including each of its sub-tasks so that it is very simple for end users
to use it as a library component in their software development.
- DMTL is extensible; new patterns can be mined with very
minimal effort from the end user. Users just need to define some
template parameters to ensure that the library selects the proper
mining algorithm to mine that pattern successfully. Some additional
specialized code may be required for efficiency reasons.
- DMTL adopts the generic software development approach using
C++ templates. Due to the limitation imposed by the programming
language, it is still very difficult for programmers to design generic
software. Few books are available that describe an implementation of a
generic library. We believe that the design of DMTL could be an example
for other generic library developers to follow.
- Apart from its ultimate purpose of discovering frequent
patterns, our library provides several stand-alone utilities for
various patterns. This primarily includes the isomorphism checking
functionality for different patterns. We believe that these features
can complement the features provided in BGL.
- While implementing DMTL, we faced numerous challenges,
mostly related to programming language support for generic software
development. Most of these issues have already been identified by
several researchers, but our work stands as another practical example
of those limitations.
- DMTL uses several template tricks, which we think could be
tremendously useful for any generic software developer.
Project
Members
Mohammed Zaki -- PI
Mohammad Al Hasan, Vineet Chaoji, Saeed Salem -- PhD Students
Computer Science Department, Rensselear Polytechnic Institute
Acknowledgements
DMTL project is funded in part by Information Technology Innovation
Center -- Knowledge Discovery and Dissemination program; National
Science Foundation -- Information and Data Management program
(IIS-0092978), and Next Generation Software program (EIA-0103708); and
Department of Energy, Office of Science (DE-FG02-02ER25538).
Publications
- Mohammed J. Zaki, Nagender Parimi, Nilanjana De, Feng Gao,
Benjarath Phoophakdee, Joe Urban, Vineet Chaoji, Mohammad Al Hasan,
Saeed Salem, Towards Generic Pattern Mining, International Conference
on Formal Concept Anaysis (Invited Paper), Lens, France, February 2005
(Also LNCS 3403, Springer-Verlag, and RPI CS Dept Technical Report
04-01, Jan 2004). PDF
- Mohammad Hasan, Vineet Chaoji, Saeed Salem, Nagender
Parimi, and Mohammed Zaki, DMTL: A Generic Data Mining Template
Library, in Workshop on Library-Centric Software Design (LCSD'05), with
Object-Oriented Programming, Systems, Languages and Applications
(OOPSLA'05) conference, San Diego, California, October 2005. PDF