Untitled

NSF Recent Report





NYI: Large Scale Information Processing

Ophir Frieder

Department of Computer Science

Florida Institute of Technology

on leave from:

Department of Computer Science

George Mason University

Contact Information

Department of Computer Science

Florida Institute of Technology

Melbourne, FL 32901

Phone: (407) 674-8856

Fax : (407) 674-8192

Email: ophir@ee.fit.edu

Home: http://www.cs.fit.edu/~ophir

Project Home Page

http://www.cs.fit.edu/~ophir/infret.html

Keywords

Information retrieval, text databases, parallel processing, relevance feedback,

structured and unstructured data integration, gene sequencing, multi-sequence alignment

Project Award Information

Duration: September 1993 - August 1999

Current year: 5

Title: NYI: Large Scale Information Processing,

Project Summary

Data are electronically available throughout and in a diversity of formats. To efficiently process and extract information from these data requires distributed, efficient, portable, high-performance information processing engines. Depending on the type of data to be processed and the application demands, examples of such processing engines are: gene sequencing engines, information filtering or retrieval systems, database engines, etc. To support the distributed nature involved, reliable, verified, and efficient communication protocols must be developed. It is within this context that my group has and continues to focus our research efforts.

Topics we focus on include:

  1. Parallel information retrieval systems
  2. Integration of structured and unstructured data
  3. Relevance feedback paradigms
  4. High performance computational genetics
  5. Visualization of text data
  6. Efficient large scale parallel computation
  7. Reliable communication protocols


Goals, Objectives, and Targeted Activities

During the 1997-1998 research year, we will continue our work on the design of reliable, scalable, information systems. We will continue our involvement with NCR personnel. My students who are extending the functionality of our parallel relational information retrieval system will interact on a roughly every-other-day basis with the NCR staff. We hope to demonstrate the impact of domain analysis and term and phrase extraction on retrieval accuracy.

We will continue our involvement with the NIST TREC activities; results from TREC-6 were particularly encouraging. We hope to even further improve in TREC-7.

We will continue to interact with a major commercial database vendor. It is anticipated that we will incorporate our information retrieval as a relational database application technology into their commercial offerings.

Two additional doctoral students will complete their research prior to the completion of this project.

Indication of Success

Our ongoing project focuses on the design, development, and evaluation of large scale information systems. The likelihood of technology transfer to the general user community is improved by close interaction with either an industrial partner that will commercialize the technology or a scientific agency that maintains the critical data repository. Such a scientific agency, if available, typically directs and supports the tools that are commonly used by the scientific community to process the data. Therefore, continuously, throughout the entire duration of all of our efforts, we interacted with our industrial partner when developing our information retrieval technology and with scientists and upper management at the National Institutes of Health when we developed the high performance gene sequencing efforts.

Our greatest focus centered around structured and text database integration. Traditionally, database applications stored predominantly structured data; today, however, the focus has shifted towards the integration of multiple data types. Jointly with an industrial partner, we integrated structured and text data using strictly the relational model. By developing a set query templates, we developed an approach that integrates both structured data and text using standard, unchanged SQL. In contrast to traditional information retrieval systems, the described approach provides for portability across platforms and for the opportunity to exploit parallelism without additional development costs. The approach was implemented on both serial and parallel platforms. Using the benchmark TREC data and query sets, we evaluated the performance of our system. Evaluation consisted of accuracy assessments and runtime and scalability measures. The results demonstrated both high accuracy and near linear speedups using a 24 node system. Portability was demonstrated by the success execution on multiple platforms using database software from multiple vendors. Currently our system is in commercial use by an information systems vendor and is under consideration for adoption by another.

In the Human Genome area, we developed parallel computational methods for both retrieving similar sequences as well as aligning multiple sequences from large genetic and protein databases. In sequence searching, we combined the merits of prior differing load balancing approaches (static and dynamic) to develop a static partitioning scheme whose performance favorably compared against the prior state of the art. In terms of multiple sequence alignment, using a modified simulated annealing algorithm, we developed the first scalable, iterative multiple sequence alignment algorithm. This algorithm was based on the sequential Berger-Munson algorithm. Both parallel approaches are now in operation at the National Institutes of Health.

Finally, to sustain high communications reliability as required by large scale information systems, we developed fault tolerant protocols for intranet server systems. These protocols are deployed nationwide by one of the ìTop-Threeî telecommunications companies.

Per our proposal, we have designed, developed, and evaluated several information processing systems. We have transferred our technology both to commercial and research organizations. We have also developed a fault tolerant communication architecture to support reliable information retrieval services. This technology has also been adopted and deployed by the commercial sector.


Project Impact and Outcome

* Human Resources

W. Addison Woods, Data Organization, Scheduling, and Presentation in Parallel Information Retrieval Architectures, Ph.D. Thesis, George Mason University, May 1995. Current Position: LTC US ARMY, Pentagon.

Tieng K. Yap, High Performance Computing in Genetics, Ph.D. Thesis, George Mason University, May 1995. Current Position: Senior Computer Scientist, NIH.

David Grossman, Integrating Structured Data and Text: A Relational Approach, Ph.D. Thesis, George Mason University, December 1995. Current Position: Program Manager, Office for Research & Development.

Anthony Ruocco, Parallel Clustering and Classification of Monolithic and Non-Monolithic Document Bases, Ph.D. Thesis, George Mason University, May 1996. Current Position: LTC US ARMY, Assistant Professor of Electrical Engineering and Computer Science, West Point Military Academy.

Sorin G. Nastea, Parallel Solutions for Sparse Matrix Computations, Ph.D. Thesis, George Mason University, December 1996. Current Position: Assistant Professor of Computer Science, Tuskegee University

Carol Lundquist, Relational Information Retrieval: Using Relevance Feedback and Parallelism to Improve Accuracy and Performance, Ph.D. Thesis, George Mason University, December 1997. Current Position: Senior Engineer, Lockheed Martin.

Brian Willard, Large Scale Information Retrieval Systems: Resolving Memory Leaks in Non-Cooperative Server Applications, Ph.D. Thesis, Florida Institute of Technology, May 1998. Current Position: Technical Specialist (Level 120), Northrop-Grumman

* Education and curriculum development at all levels.

The PI teaches graduate and undergraduate courses whose content is influenced by the research conducted in this project.

* Industry -- collaborations, transfer of technology, patents.

Over the years, we have developed several information processing systems. Some of these systems are presently in daily use at the National Institutes of Health, NCR, and at Harris Corporation. We are in discussion with a major commercial database vendor regarding the deployment of our integrated information retrieval as a database application technology as part of their text offerings.

Efficient multicast and fault-tolerant protocols were also designed, verified, and developed. Some of the technology is currently deployed by a ìTop-Threeî telecommunications company in, at least, 27 major cities nationwide.

Papers were authored describing both the various information systems and the protocols. Two books and two US issued patents also focus on these same efforts. Two additional US patent applications are currently under evaluation.

Project References

B. Kjell, W. A. Woods., and O. Frieder, "Discrimination of Authorship Using Visualization," Information Processing and Management, Pergamon Press, 30(1), pp. 141-150, January 1994.

T. Yap, O. Frieder, R. Martino, High Performance Computational Methods for Biological Sequence Analysis, Kluwer Academic Publishers, ISBN 0-7923-9724-X, 1996.

D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, "Integrating Structured Data and Text: A Relational Approach," Journal of the American Society of Information Science, 48(2), February 1997.

O. Frieder and H. Siegelmann, "Document Allocation in Multiprocessor Information Retrieval Systems," IEEE Transactions on Knowledge and Data Engineering, 9(4), July/August 1997.

A. Ruocco and O. Frieder, "Clustering and Classification of Large Document Bases in a Parallel Environment," Journal of the American Society of Information Science, 48(10), October 1997.

S. Nastea, O. Frieder, and T. El-Ghazawi, ìLoad Balanced Sparse Matrix-Vector Multiplication on Parallel Computers,î Journal of Parallel and Distributed Computing, 46(2), November 1997.

C. Lundquist, D. Grossman, and O. Frieder, ìImproving Relevance Feedback in the Vector-Space Model,î ACM Sixth Conference on Information and Knowledge Management, Las Vegas, Nevada, November 1997.

T. Yap, O. Frieder, and R. Martino, ìParallel Computation in Biological Sequence Analysis,î IEEE Transactions on Parallel and Distributed Systems, (to appear).


Area Background

Information retrieval is the selection of documents that are potentially relevant to a user's information need. Given the vast volume of data stored in modern information retrieval systems, understanding the data requires the use of visual aids and searching the document database requires the use of vast computational resources. As part of our study, we developed both visual data understanding tools as well as related parallel information systems. Our systems were customized for the needs of a diversity of domains including traditional text databases and biological gene sequences.

Area References

D. Grossman and O. Frieder, Ad Hoc Information Retrieval: Algorithms and Heuristics, Kluwer Academic Publishers, (to appear in early 1998).

K. Spark-Jones and P. Willet (Editors), Readings in Information Retrieval, Morgan Kauffmann Publishers, Inc., ISBN: 1-55860-454-5, 1997.

T. Yap, O. Frieder, R. Martino, High Performance Computational Methods for Biological Sequence Analysis, Kluwer Academic Publishers,

ISBN 0-7923-9724-X, 1996.

G. Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers, ISBN: 0-7923-9926-9, 1997.

PROJECTS


Information Retrieval
Computational Biology
Other