DataCrystal: A Metapattern-Based Automated Discovery Loop for Integrated
Data Mining
Information Sciences Institute and Computer Science Department
University of Southern California
Contact Information
Wei-Min Shen
USC Information Sciences Institute
4676 Admiralty Way
Marina del Rey, CA 90292
Phone: (310) 822-1511
Fax : (310) 822-0751
Email: shen@isi.edu
WWW PAGE
http://www.isi.edu/dcrystal
and http://www.isi.edu/~shen
Keywords
Data Mining, Knowledge discovery from databases, Machine Learning,
Automatic Modeling of Legacy Databases, Meta-Pattern Guided Discovery,
Integrated Data Mining Systems.
Project Award Information
-
Award Number: 9529615
-
Duration: the 2nd year of a 3-year effort
-
Title: A Metapattern-Based Automated Discovery Loop for Integrated Data
Mining
Project Summary
This research is developing a metapattern-based discovery loop for integrated
data mining. Metapatterns (also known as metaqueries) are second-order,
declarative expressions that specify the types of patterns to be
discovered and assist humans in focusing on more fruitful search directions.
The discovery loop is a search engine that integrates deduction, induction,
and external guidance from humans, as well as internal guidance of inter-component
dependencies. Given a completely new database, the system first generates
an initial set of the most general metapatterns based on the meta-information
of the database, and then executes these metapatterns against the database
to discover actual patterns. Based on the results, new metapatterns are
dynamically generated by adding more constraints to the more plausible
metapatterns. In this iterative process, human discovers can analyze, create,
select, and execute metapatterns, or instruct the system to pursue metapatterns
on its own. This ability not only makes the process of data mining
more efficient and productive (the more expert users can use the system
for inspiration of better metapatterns, and the less expert users can learn
how to perform data mining in a particular domain by observation), but
also provides a new method for unsupervised learning of probabilistic,
relation-based patterns.
Goals, Objectives, and Targeted Activities
From June 1, 1997 through May 31, 1998, we have been focusing on applying
the Metapattern technology to automatically analyzing legacy relational
databases and discovering useful metadata -- characterizations of the database's
intended semantics. The main problem faced when trying to incorporate
legacy databases into modern knowledge-based systems is the difficulty
of obtaining such metadata. The problem is particularly acute when the
data is noisy and access to human domain experts is limited -- both frequent
occurrences. The AutoModel tool developed under the methodology of
DataCrystal uses advanced data mining techniques to analyze database contents.
By comparing data and attribute names in different tables, in a multi-step
process, it identifies potential key and foreign key attributes and ultimately
proposes an entity-relationship (ER) model for the data. Expert knowledge
about the database and the domain can be used to control the process.
Our goals for the next year is to complete the design and implementation
of generating other types of metapatterns in addition to the type of transitivity,
and incorporate all the results we have obtained so far in the control
penal to allow users to examine, select, and execute metapatterns interactively.
We will also evaluate our system in several large legacy databases that
we have access to, these are mainly logistic databases for military applications.
Indication of Success
The specific component of DataCrystal constructed this year has been tested
on several logistics databases of sizes ranging up to 100 tables with up
to 60 columns each, and up to 1.5 million rows. The results generated are
comparable to those obtained by purely human analysis. However, AutoModel
speeds up this previously labor intensive process by an order of magnitude
-- reducing it from one taking weeks, to one taking hours.
This result has been well received by a research community for military
applications, and we were requested to submit a short summary for the Significant
Event Report for the Secretary of Defense.
So far, we have accomplished most of the items in our statement of work
for the second year of this award. Two exception of this are: instead of
using chemistry databases, we are now focus on logistic database, and due
to the application's needs, we have not addressed the issue of generating
non transitivity metapatterns.
Project Impact
The DataCrystal project has supported two different PhD graduate students
in the Data Mining area, and also enabled a four-day short course on Data
Mining at the UCLA Extension Program. The course is a big success, and
there are 25 people in the class and they are from everywhere: 4 from Brazil,
1 from Sweden, others are in US but from places such as AllState, Fair
Issc, Los Amos Nat. Lab, IBM, Digital, JPL, OOCL, etc.. Two of them are
CEOs of some consulting companies. The three instructors are high quality
people in Data Mining. Rekash Agrawal and Jaiwei Han are chairmen of past
KDD conferences and excellent researchers in databases. People in the class
stayed until the last minute, and they are all very excited and happy!
The DataCrystal technology is also licensed by GKIS, http://www.gkis.com,
a Houston based company for enterprise modeling, information integration,
and technology education. They plan to extend DataCrystal and use it in
applications related to United Nations information technology group. (See
news report at DataCrystal's home page: http://www.isi.edu/dcrystal)
Project References
W.M. Shen and B. Leng
A Metapattern-Based Automated Discovery Loop for Integrated Data Mining
IEEE Transactions on Data and Knowledge Engineering, 8(6)898-910. 1996
W.M. Shen and B. Leng
Metapattern Generation for Integrated Data Mining
The 2nd International Conference on KDD, Portland, Oregan, 1996.
Kero, B, L. Russell, S. Tsur and W.M. Shen
An Overview of Data Mining Technologies
The KDD Workshop in the 4th International Conference on Deductive and
Object-Oriented Databases,
Singapore, 1995.
Leng, B. and W.M. Shen
A Metapattern-Based Automated Discovery Loop
The KDD Workshop in the 4th International Conference on Deductive and
Object-Oriented Databases,
Singapore, 1995.
Shen, W.M., B. Leng, and A. Chatterjee
Applying the Metaquery Framework to Time Sequence Analysis
Technical Report, USC-ISI-95-117, 1995.
Shen, W.M., K. Ong, B. Mitbander, C. Zaniolo
Metaqueries for Data Mining
Advances in Knowledge Discovery and Data Mining, MIT Press, 1995.
Area Background
Data Mining is a process of discovering valuable knowledge from very large
data sets, its research spans from database and statistics to machine learning
and adaptive control and its application includes scientific research,
system design, business management, or many other related applications.
Area References
A book: Advances in Knowledge Discovery and Data Mining, MIT Press, 1995.
Edited by U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy.
Potential Related Projects
Projects or collaborations within the IDM:
Padhraic Smyth: Probabilistic Knowledge Discovery and Data Mining: An
Integrated Approach at the Interface of Computer Science and Statistics
Jeffrey D. Ullman: Research Into DataWarehousing and Decision
Support
Lawrence B. Holder: Scalable Knowledge Discovery from Large Structural
Databases
Dennis Shasha and Jason T. L. Wang: Pattern Discovery in Combinatorial
Databases