NSF Grant IRI-9615272 Progress Report

Scalable Knowledge Discovery from Large Structural Databases

Lawrence B. Holder and Diane J. Cook, Principal Investigators
University of Texas at Arlington
Department of Computer Science and Engineering

Contact Information

Contact Name: Lawrence B. Holder
Department of Computer Science and Engineering
University of Texas at Arlington
Box 19015, Arlington, TX 76019-0015
Phone: (817) 272-2596
Fax: (817) 272-3784
Email: holder@cse.uta.edu

Web Page

http://www-cse.uta.edu/~holder/research/subdue.html

Keywords

data mining, knowledge discovery, minimum description length principle, inexact graph match, parallel algorithms, scalability

Project Award Information

Award Number: IRI-9615272
Duration: three years
Award Year: year one
Project Name: Scalable Knowledge Discovery from Large Structural Databases

Project Summary

The main objective of this project is to improve the scalability and effectiveness of knowledge discovery systems in order to handle large, structural databases. The specific objectives are:

Goals, Objectives, and Targeted Activities

In the first year of this project we intend to achieve the following goals:


Indication of Success

Progress has been made along all of the above objectives. First, further experiments have been performed on the integration of SUBDUE with NASA's AUTOCLASS clustering system. In this case artificial databases were generated to test the ability of SUBDUE to improve the results of AUTOCLASS. We first use SUBDUE to find patterns in the structural component of the databases. These patterns are used to compress the non-structural component of the data, i.e., compressing several structurally-related records into one larger record. Results confirm than this coupling approach to the integration of structural and non-structural discovery is feasible and that discovered concepts are more interesting to scientists with the combined approach than with either approach used in isolation. A paper describing the results of this project will appear in the proceedings of FLAIRS 98.

Development of the parallel and distributed versions of SUBDUE is complete, and several results have been generated. Three different versions have been implemented. First, functionally-parallel FP-SUBDUE distributes the processing of a single database by giving each processor a different set of patterns to consider. FP-SUBDUE is able to discover better substructures due to the effective increase in the search beam width, but communication costs and the constraint that each processor have the entire database limit FP-SUBDUE's effectiveness. Second, dynamic-partitioning DP-SUBDUE dynamically distributes both the processing and the database according to the portion of the search space being explored by a processor. However, DP-SUBDUE requires even more communication and possibly the entire database on one processor in the worst case; therefore, performance degrades. The third implementation, static-partitioning SP-SUBDUE, proved the most effective by partitioning the database initially, and then sending each partition to a different processor. Communication between processors is needed only to share locally interesting patterns in order to determine their global quality. SP-SUBDUE proves to be a highly scalable system. One of our databases contains 2 million vertices and 5 million edges, yet SP-SUBDUE is able to process the database in less than three hours. The minimal amount of communication and synchronization that is required make SP-SUBDUE ideal for distributed environments. These results were the thesis topic of Dr. Cook's masters student Gehad Galal (May 1997) and have been published in one journal article, one book chapter, and three peer-reviewed conference papers.

Databases from both NASA LandSat imagery and the Brookhaven Protein Data Bank (PDB) have been pursued for evaluating SUBDUE's ability to discover novel and interesting structural patterns. Dr. Holder's masters student Ramakrishna Kintada (May 1997) investigated the use of standard edge detection and object recognition techniques to transform images into an object-based structural representation. However, the high degree of noise in this process made automatic processing of the images too difficult. We are considering other image representations. The PDB database is much more easily transformed into SUBDUE's own graphical representation, and we plan to emphasize this database in our future evaluations of SUBDUE. Two computer science graduate students, Shaobing Su and Ron Maglothin, both with backgrounds in molecular biology, are investigating alternative representations for the PDB data, alternative analysis approaches described in the molecular biology literature, and the types of patterns SUBDUE is likely to find that could benefit the domain scientists. We have established contacts with scientists at UT's Southwestern Medical Center who will provide feedback on SUBDUE's discovered concepts. Preliminary results of applying SUBDUE to protein tertiary structure are promising.

Attempts to use SUBDUE for the analysis of the above databases have revealed several weaknesses in SUBDUE that we are now addressing. Dr. Holder's masters student Nataliya Bocharova (December 1997) is leading the efforts to improve SUBDUE. Her improvements have given SUBDUE better facilities for including domain-specific knowledge describing how to match non-identical graph labels and background patterns from which to begin the search. Our acquisition of a DEC Alpha 500MHz machine with 512MB of memory using NSF equipment support and matching funds from UT Arlington has helped tremendously in these efforts.

We hope to release a new version of SUBDUE soon incorporating the above enhancements, facilities for distributed processing and a graphical user interface. Dr. Holder's masters student Prasad Parthasarthy (May 1997) coupled SUBDUE to the daVinci graph visualization tool to enable visualization of SUBDUE's input and output.


Project Impact and Output

This project has supported several graduate students and their research results have been included in Dr. Holder's current course on Machine Learning. Masters students Shaobing Su and Nataliya Bocharova are directly funded under this project. Ms. Bocharova finished her Master's thesis in December 1997 and is returning for a PhD. Shaobing Su plans to pursue a PhD after her masters in May 1998. Other masters students indirectly supported under this project include Gehad Galal, Ramakrishna Kintada and Prasad Parthasarthy, all of whom graduated in May 1997 and have taken jobs in industry.

Results from the parallel and distributed implementations of SUBDUE were integrated into Dr. Cook's advanced ``Parallel AI Algorithms'' course that she designed and taught in the spring of 1997. Results from this research have also been included in Dr. Holder's Machine Learning course taught in Fall 1997. Specifically, one lecture was devoted to a description of the SUBDUE and recent enhancements. Another lecture was devoted to discovery in molecular biology databases. These topics have attracted a new PhD student, Ron Maglothin, who will be funded under this project beginning in January 1998.

Results from this project were influential in receiving two other awards. Dr. Cook with other departmental faculty PIs received NSF funds to purchase a distributed network of Pentium PCs and DEC Alphas to further investigate high-speed data mining algorithms. In addition, Drs. Cook and Holder received a grant from the Texas board of research to continue this project and apply the results to industrial databases.

Publications

The following publications have resulted from efforts supported either partially or completely by this project.

1.
G. Galal, D. J. Cook and L. B. Holder, ``Improving the Scalability of KDD Systems.'' Submitted to Data Mining and Knowledge Discovery.

2.
G. Galal, D. J. Cook and L. B. Holder, ``Exploiting Parallelism in a Scientific Discovery System to Improve Scalability'', to appear in Journal of the American Society for Information Science.

3.
D. J. Cook, G. Galal, and L. B. Holder, ``Improving Scalability of Scientific Discovery Systems by Exploiting Parallelism.'' To appear in Pattern Discovery in Biological Data: Tools, Techniques and Applications, Oxford University Press.

4.
L. B. Holder and D. J. Cook, ``Coupling Two Complimentary Knowledge Discovery Systems.'' to appear in Proceedings of the Eleventh Annual Florida AI Research Symposium, 1998.

5.
D. J. Cook, G. Galal and L. B. Holder, ``Exploiting Parallelism in Knowledge Discovery Systems to Improve Scalability'', Proceedings of the Thirty-First Hawaii International Conference on System Sciences, 1998.

6.
S. Djoko, D. J. Cook, and L. B. Holder, ``An Empirical Study of Domain Knowledge and Its Benefits to Substructure Discovery.'' In IEEE Transactions on Knowledge and Data Engineering, Volume 9, Number 4, pages 575-586, 1997.

7.
G. Galal, D. J. Cook and L. B. Holder, ``Improving Scalability in a Knowledge Discovery System by Exploiting Parallelism.'' In the Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pages 171-174, 1997.

8.
G. Galal and D. J. Cook, ``Exploiting Parallelism in a Scientific Discovery System to Improve Scalability.'' In the Proceedings of the Tenth Annual Florida AI Research Symposium, 1997.


Area Background

With the increasing amount and complexity of today's data, there is an urgent need to accelerate discovery of knowledge in large databases. In response to this need, numerous approaches have been developed for discovering concepts in databases using a linear, attribute-value representation. These approaches address issues of data relevance, missing data, noise, and utilization of domain knowledge. However, much of the data that is collected is structural in nature, or is composed of parts and relations between the parts. Hence, there exists a need to develop scalable tools to analyze and discover concepts in structural databases. Many reported discovery tools are also computationally expensive and cannot scale easily to large databases, especially those containing structural information.

Recently, we introduced a method for discovering substructures in structural databases using the minimum description length (MDL) principle. The system is called SUBDUE, and it discovers substructures that compress the original data and represent structural concepts in the data. Once a substructure is discovered, the substructure is used to simplify the data by replacing instances of the substructure with a pointer to the newly discovered substructure. The discovered substructures allow abstraction over detailed structures in the original data. Iteration of the substructure discovery and replacement process constructs a hierarchical description of the structural data in terms of the discovered substructures. This hierarchy provides varying levels of interpretation that can be accessed based on the specific goals of the data analysis. The system is made scalable through use of parallel and distributed algorithms and resources.

Although the MDL principle is useful for discovering substructures that maximize compression of the data, scientists often employ knowledge or assumptions of a specific domain to guide the discovery process. A domain-independent discovery method is valuable in that the discovery of unexpected substructures is not blocked. However, the discovered substructures might not be useful to the user. On the other hand, using domain-specific knowledge can assist the discovery process by focusing search and can also help make the discovered substructures more meaningful to the user. Hence, in order to trade off between domain-independent and domain-dependent discovery methods, we incorporate domain knowledge into the SUBDUE system, and combine both the domain-independent and domain-dependent methods to guide the search toward more appropriate substructures.


Area References