NSF Grant IRI-9615272 Progress Report
Scalable Knowledge Discovery from Large Structural Databases
Lawrence B. Holder and Diane J. Cook, Principal Investigators
University of Texas at Arlington
Department of Computer Science and Engineering
Contact Information
Contact Name: Lawrence B. Holder
Department of Computer Science and Engineering
University of Texas at Arlington
Box 19015, Arlington, TX 76019-0015
Phone: (817) 272-2596
Fax: (817) 272-3784
Email:
holder@cse.uta.edu
Web Page
http://www-cse.uta.edu/~holder/research/subdue.html
Keywords
data mining, knowledge discovery, minimum description length principle,
inexact graph match, parallel algorithms, scalability
Project Award Information
Award Number: IRI-9615272
Duration: three years
Award Year: year one
Project Name: Scalable Knowledge Discovery from Large Structural
Databases
Project Summary
The main objective of this project is to improve the scalability and
effectiveness of knowledge discovery systems in order to handle large,
structural databases. The specific objectives are:
- Integration of the structural discovery system SUBDUE with
one or more attribute-value based discovery systems.
- Development of parallel and distributed versions of SUBDUE
and the integrated discovery system.
- Application of the integrated discovery system to large scientific
databases, such as the Brookhaven protein database and NASA LandSat
imagery, and evaluation of results by domain experts.
- Dissemination of source code for serial, parallel and distributed
versions of SUBDUE and the integrated discovery system, and
publication of the results of this research.
Goals, Objectives, and Targeted Activities
In the first year of this project we intend to achieve the following goals:
- Test combination of SUBDUE approach with one other specific approach
to knowledge discovery; in particular, NASA's AutoClass clustering system.
- Complete parallel and distributed versions of the SUBDUE system.
Apply these implementations to large natural and artificial databases and
perform thorough scalability tests. Disseminate the parallel and distributed
source code, and publish the results in related journals
and conference proceedings.
- Establish contacts with scientists that provide databases and indicate
success of results in scientific domains.
- Make all necessary modifications to SUBDUE system to operate on
at least one large scientific database.
In the next year we plan to focus on three main directions. First, we will
complete the packaging of distributed SUBDUE and will the enhancements to
SUBDUE that provide the necessary functionality to
analyze the PDB databases, and will release a new version of the source code.
Second, we will continue to investigate applications for SUBDUE to
discovery in molecular biology databases and pursue our contacts at the UT
Southwestern Medical Center in Dallas to provide expert guidance and
feedback on SUBDUE's analysis of these databases. Third,
we will investigate new methods of incorporating concept discovery methods
from other systems into SUBDUE including supervised concept learning
from positive and negative examples.
Indication of Success
Progress has been made along all of the above objectives. First,
further experiments have been performed on the integration of SUBDUE
with NASA's AUTOCLASS clustering system. In this case artificial
databases were generated to test the ability of SUBDUE to improve the
results of AUTOCLASS. We first use SUBDUE to find patterns in
the structural component of the databases. These patterns are used to
compress the non-structural component of the data, i.e., compressing
several structurally-related records into one larger record. Results
confirm than this coupling approach to the integration of structural and
non-structural discovery is feasible and that discovered concepts are more
interesting to scientists with the combined approach than with either approach
used in isolation. A paper describing the results of this project will appear
in the proceedings of FLAIRS 98.
Development of the parallel and distributed versions of SUBDUE is
complete, and several results have been generated. Three different
versions have been implemented. First, functionally-parallel FP-SUBDUE
distributes the processing of a single database by giving each
processor a different set of patterns to consider. FP-SUBDUE is able to
discover better substructures due to the effective increase in the search beam
width, but communication costs and the constraint that each processor have
the entire database limit FP-SUBDUE's effectiveness. Second,
dynamic-partitioning DP-SUBDUE dynamically distributes both the
processing and the database according to the portion of the search space
being explored by a processor. However, DP-SUBDUE requires even more
communication and possibly the entire database on one processor in the
worst case; therefore, performance degrades. The third implementation,
static-partitioning SP-SUBDUE, proved the most effective by
partitioning the database initially, and then sending each partition to a
different processor. Communication between processors is needed only to
share locally interesting patterns in order to determine their global
quality. SP-SUBDUE proves to be a highly scalable system. One of
our databases contains 2 million vertices and 5 million edges, yet
SP-SUBDUE is able to process the database in less than three hours. The
minimal amount of communication and synchronization that is required make
SP-SUBDUE ideal for distributed environments. These results were the
thesis topic of Dr. Cook's masters student Gehad Galal (May 1997) and have
been published in one journal article, one book chapter, and three
peer-reviewed conference papers.
Databases from both NASA LandSat imagery and the Brookhaven Protein Data
Bank (PDB) have been pursued for evaluating SUBDUE's ability to
discover novel and interesting structural patterns. Dr. Holder's masters
student Ramakrishna Kintada (May 1997) investigated the use of standard
edge detection and object recognition techniques to transform images into
an object-based structural representation. However, the high degree of
noise in this process made automatic processing of the images too
difficult. We are considering other image representations. The PDB
database is much more easily transformed into SUBDUE's own graphical
representation, and we plan to emphasize this database in our future
evaluations of SUBDUE. Two computer science graduate students,
Shaobing Su and Ron Maglothin, both with backgrounds in molecular biology,
are investigating alternative representations for the PDB data, alternative
analysis approaches described in the molecular biology literature, and the
types of patterns SUBDUE is likely to find that could benefit the
domain scientists. We have established contacts with scientists at UT's
Southwestern Medical Center who will provide feedback on SUBDUE's
discovered concepts. Preliminary results of applying SUBDUE to
protein tertiary structure are promising.
Attempts to use SUBDUE for the analysis of the above databases have
revealed several weaknesses in SUBDUE that we are now addressing.
Dr. Holder's masters student Nataliya Bocharova (December 1997) is leading
the efforts to improve SUBDUE. Her improvements have given SUBDUE better facilities for including domain-specific knowledge
describing how to match non-identical graph labels and background patterns
from which to begin the search. Our acquisition of a DEC Alpha 500MHz
machine with 512MB of memory using NSF equipment support and matching funds
from UT Arlington has helped tremendously in these efforts.
We hope to release a new version of SUBDUE soon incorporating the
above enhancements, facilities for distributed processing and a graphical
user interface. Dr. Holder's masters student Prasad Parthasarthy (May
1997) coupled SUBDUE to the daVinci graph visualization tool to
enable visualization of SUBDUE's input and output.
Project Impact and Output
This project has supported several graduate students and their research
results have been included in Dr. Holder's current course on Machine
Learning. Masters students Shaobing Su and Nataliya Bocharova are directly
funded under this project. Ms. Bocharova finished her Master's thesis in
December 1997 and is returning for a PhD.
Shaobing Su plans to pursue a PhD after her masters in May 1998. Other
masters students indirectly supported under this project include Gehad
Galal, Ramakrishna Kintada and Prasad Parthasarthy, all of whom graduated
in May 1997 and have taken jobs in industry.
Results from the parallel and distributed implementations of SUBDUE were
integrated into Dr. Cook's advanced ``Parallel AI Algorithms'' course that she
designed and taught in the spring of 1997.
Results from this research have also been included in Dr. Holder's Machine
Learning course taught in Fall 1997. Specifically, one lecture was devoted
to a description of the SUBDUE and recent enhancements. Another
lecture was devoted to discovery in molecular biology databases. These
topics have attracted a new PhD student, Ron Maglothin, who will be funded
under this project beginning in January 1998.
Results from this project were influential in receiving two other awards.
Dr. Cook with other departmental faculty PIs received NSF funds to purchase
a distributed network of Pentium PCs and DEC Alphas to further investigate
high-speed data mining algorithms. In addition, Drs. Cook and Holder received
a grant from the Texas board of research to continue this project and apply
the results to industrial databases.
Publications
The following publications have resulted from efforts supported either
partially or completely by this project.
- 1.
- G. Galal, D. J. Cook and L. B. Holder, ``Improving the Scalability of
KDD Systems.'' Submitted to Data Mining and Knowledge Discovery.
- 2.
- G. Galal, D. J. Cook and L. B. Holder, ``Exploiting Parallelism in a
Scientific Discovery System to Improve Scalability'', to appear in
Journal of the American Society for Information Science.
- 3.
- D. J. Cook, G. Galal, and L. B. Holder, ``Improving Scalability of
Scientific Discovery Systems by Exploiting Parallelism.'' To appear in
Pattern Discovery in Biological Data: Tools, Techniques and Applications,
Oxford University Press.
- 4.
- L. B. Holder and D. J. Cook, ``Coupling Two Complimentary Knowledge
Discovery Systems.'' to appear in Proceedings of the Eleventh Annual
Florida AI Research Symposium, 1998.
- 5.
- D. J. Cook, G. Galal and L. B. Holder, ``Exploiting Parallelism in
Knowledge Discovery Systems to Improve Scalability'',
Proceedings of the Thirty-First Hawaii International Conference on
System Sciences, 1998.
- 6.
- S. Djoko, D. J. Cook, and L. B. Holder, ``An Empirical Study of
Domain Knowledge and Its Benefits to Substructure Discovery.'' In
IEEE Transactions on Knowledge and Data Engineering, Volume 9, Number 4,
pages 575-586, 1997.
- 7.
- G. Galal, D. J. Cook and L. B. Holder, ``Improving Scalability in a
Knowledge Discovery System by Exploiting Parallelism.'' In the Proceedings
of the Third International Conference on Knowledge Discovery and Data Mining,
pages 171-174, 1997.
- 8.
- G. Galal and D. J. Cook, ``Exploiting Parallelism in a Scientific
Discovery System to Improve Scalability.'' In the Proceedings of the
Tenth Annual Florida AI Research Symposium, 1997.
Area Background
With the increasing amount and complexity of today's data, there is an urgent
need to accelerate discovery of knowledge in large databases.
In response to this need, numerous approaches have been
developed for discovering concepts in databases using a linear,
attribute-value representation.
These approaches address issues of
data relevance, missing data, noise, and utilization of domain knowledge.
However, much of the data that is collected is structural in nature, or is
composed of parts and relations between the parts.
Hence, there exists a need to develop scalable tools
to analyze and discover concepts in structural databases.
Many reported discovery tools are also computationally expensive and
cannot scale easily to large databases, especially those containing structural
information.
Recently, we introduced a method for discovering substructures in structural
databases using the minimum description length (MDL) principle.
The system is called SUBDUE, and it discovers substructures that
compress the original
data and represent structural concepts in the data. Once a substructure is
discovered, the substructure is used to simplify the data by replacing instances
of the substructure with a pointer to the newly discovered substructure. The
discovered substructures allow abstraction
over detailed structures in the original data. Iteration of the substructure
discovery and replacement process constructs a hierarchical description of
the structural data in terms of the discovered substructures. This hierarchy
provides varying levels of interpretation that can be accessed based on
the specific goals of the data analysis. The system is made scalable through
use of parallel and distributed algorithms and resources.
Although the MDL principle is useful for discovering substructures that
maximize compression of the data, scientists often employ knowledge or
assumptions of a specific domain to guide the discovery process.
A domain-independent
discovery method is valuable in that the discovery of
unexpected substructures is not blocked. However, the discovered substructures
might not be useful to the user. On the other hand, using domain-specific
knowledge can assist the discovery process by focusing search and can also help
make the discovered substructures more meaningful to the user.
Hence, in order to trade off between domain-independent and domain-dependent
discovery methods, we incorporate domain knowledge into the SUBDUE system,
and combine both the domain-independent and domain-dependent methods to guide
the search toward more appropriate substructures.
Area References
- P. Cheeseman and J. Stutz, ``Bayesian Classification (AutoClass): Theory
and Results'' in Advances in Knowledge Discovery and Data Mining
(U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds.),
MIT Press, 1996.
- D. Conklin, S. Fortier, J. Glasgow, and F. Allen,
``Discovery of Spatial Concepts in Crystallographic Databases'', in
Proceedings of the ML92 Workshop on Machine Discovery, pages 111-116,
1992.
- D.J. Cook and L.B. Holder, ``Substructure Discovery Using Minimum
Description Length and Background Knowledge'', in Journal of Artificial
Intelligence Research, volume 1, pages 231-255, 1994.
- D.J. Cook, L.B. Holder, and S. Djoko, ``Scalable Discovery of Informative
Structural Concepts Using Domain Knowledge'', in IEEE Expert, volume 10,
pages 59-68, 1996.
- U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth,
``From Data Mining to Knowledge Discovery: An Overview'', in
Advances in Knowledge Discovery and Data Mining (U. M. Fayyad,
G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds.), MIT Press, 1996.
- D. Fisher, ``Knowledge Acquisition Via Incremental Conceptual
Clustering'', in Machine Learning, volume 2, pages 139-172, 1987.
- R. Levinson, ``A Self-Organizing Retrieval System for Graphs'',
AAAI, pages 203-306, 1984.
- J. Rissanen, ``Stochastic Complexity in Statistical Inquiry'',
world Scientific Publishing Company, 1989.
- J. Segen, ``Graph Clustering and Model Learning by Data Compression'', in
Proceedings of the Machine Learning Conference, pages 93-101 1990.
- K. Thompson and P. Langley, ``Concept Formation in Structured Domains'',
in Concept Formation: Knowledge and Experience in Unsupervised Learning
(D.H. Fisher and M. Pazzani, eds.), 1991.