NYI: Large Scale Information Processing
Ophir Frieder
Department of Computer Science
Florida Institute of Technology
on leave from:
Department of Computer Science
George Mason University
Contact Information
Department of Computer Science
Florida Institute of Technology
Melbourne, FL 32901
Phone: (407) 674-8856
Fax : (407) 674-8192
Email: ophir@ee.fit.edu
Home: http://www.cs.fit.edu/~ophir
Project Home Page
http://www.cs.fit.edu/~ophir/infret.html
Keywords
Information retrieval, text databases, parallel processing, relevance feedback,
structured and unstructured data integration, gene sequencing,
multi-sequence alignment
Project Award Information
Duration: September 1993 - August 1999
Current year: 5
Title: NYI: Large Scale Information Processing,
Project Summary
Data are electronically available throughout and in a diversity
of formats. To efficiently process and extract information from
these data requires distributed, efficient, portable, high-performance
information processing engines. Depending on the type of data
to be processed and the application demands, examples of such
processing engines are: gene sequencing engines, information filtering
or retrieval systems, database engines, etc. To support the distributed
nature involved, reliable, verified, and efficient communication
protocols must be developed. It is within this context that my
group has and continues to focus our research efforts.
Topics we focus on include:
Goals, Objectives, and Targeted Activities
During the 1997-1998 research year, we will continue our work
on the design of reliable, scalable, information systems. We
will continue our involvement with NCR personnel. My students
who are extending the functionality of our parallel relational
information retrieval system will interact on a roughly every-other-day
basis with the NCR staff. We hope to demonstrate the impact of
domain analysis and term and phrase extraction on retrieval accuracy.
We will continue our involvement with the NIST TREC activities;
results from TREC-6 were particularly encouraging. We hope to
even further improve in TREC-7.
We will continue to interact with a major commercial database
vendor. It is anticipated that we will incorporate our information
retrieval as a relational database application technology into
their commercial offerings.
Two additional doctoral students will complete their research
prior to the completion of this project.
Indication of Success
Our ongoing project focuses on the design, development, and evaluation
of large scale information systems. The likelihood of technology
transfer to the general user community is improved by close interaction
with either an industrial partner that will commercialize the
technology or a scientific agency that maintains the critical
data repository. Such a scientific agency, if available, typically
directs and supports the tools that are commonly used by the scientific
community to process the data. Therefore, continuously, throughout
the entire duration of all of our efforts, we interacted with
our industrial partner when developing our information retrieval
technology and with scientists and upper management at the National
Institutes of Health when we developed the high performance gene
sequencing efforts.
Our greatest focus centered around structured and text database
integration. Traditionally, database applications stored predominantly
structured data; today, however, the focus has shifted towards
the integration of multiple data types. Jointly with an industrial
partner, we integrated structured and text data using strictly
the relational model. By developing a set query templates, we
developed an approach that integrates both structured data and
text using standard, unchanged SQL. In contrast to traditional
information retrieval systems, the described approach provides
for portability across platforms and for the opportunity to exploit
parallelism without additional development costs. The approach
was implemented on both serial and parallel platforms. Using
the benchmark TREC data and query sets, we evaluated the performance
of our system. Evaluation consisted of accuracy assessments and
runtime and scalability measures. The results demonstrated both
high accuracy and near linear speedups using a 24 node system.
Portability was demonstrated by the success execution on multiple
platforms using database software from multiple vendors. Currently
our system is in commercial use by an information systems vendor
and is under consideration for adoption by another.
In the Human Genome area, we developed parallel computational
methods for both retrieving similar sequences as well as aligning
multiple sequences from large genetic and protein databases.
In sequence searching, we combined the merits of prior differing
load balancing approaches (static and dynamic) to develop a static
partitioning scheme whose performance favorably compared against
the prior state of the art. In terms of multiple sequence alignment,
using a modified simulated annealing algorithm, we developed the
first scalable, iterative multiple sequence alignment algorithm.
This algorithm was based on the sequential Berger-Munson algorithm.
Both parallel approaches are now in operation at the National
Institutes of Health.
Finally, to sustain high communications reliability as required
by large scale information systems, we developed fault tolerant
protocols for intranet server systems. These protocols are deployed
nationwide by one of the ìTop-Threeî telecommunications
companies.
Per our proposal, we have designed, developed, and evaluated several
information processing systems. We have transferred our technology
both to commercial and research organizations. We have also developed
a fault tolerant communication architecture to support reliable
information retrieval services. This technology has also been
adopted and deployed by the commercial sector.
Project Impact and Outcome
* Human Resources
W. Addison Woods, Data Organization, Scheduling, and
Presentation in Parallel Information Retrieval Architectures,
Ph.D. Thesis, George Mason University, May 1995. Current
Position: LTC US ARMY, Pentagon.
Tieng K. Yap, High Performance Computing in Genetics,
Ph.D. Thesis, George Mason University, May 1995. Current
Position: Senior Computer Scientist, NIH.
David Grossman, Integrating Structured Data and Text:
A Relational Approach, Ph.D. Thesis, George Mason University,
December 1995. Current Position: Program Manager, Office
for Research & Development.
Anthony Ruocco, Parallel Clustering and Classification
of Monolithic and Non-Monolithic Document Bases, Ph.D. Thesis,
George Mason University, May 1996. Current Position:
LTC US ARMY, Assistant Professor of Electrical Engineering and
Computer Science, West Point Military Academy.
Sorin G. Nastea, Parallel Solutions for Sparse Matrix
Computations, Ph.D. Thesis, George Mason University, December
1996. Current Position: Assistant Professor of Computer
Science, Tuskegee University
Carol Lundquist, Relational Information Retrieval:
Using Relevance Feedback and Parallelism to Improve Accuracy and
Performance, Ph.D. Thesis, George Mason University, December
1997. Current Position: Senior Engineer, Lockheed Martin.
Brian Willard, Large Scale Information Retrieval Systems:
Resolving Memory Leaks in Non-Cooperative Server Applications,
Ph.D. Thesis, Florida Institute of Technology, May 1998. Current
Position: Technical Specialist (Level 120), Northrop-Grumman
* Education and curriculum development at all levels.
The PI teaches graduate and undergraduate courses whose content
is influenced by the research conducted in this project.
* Industry -- collaborations, transfer of technology, patents.
Over the years, we have developed several information processing
systems. Some of these systems are presently in daily use
at the National Institutes of Health, NCR, and at Harris Corporation.
We are in discussion with a major commercial database vendor
regarding the deployment of our integrated information retrieval
as a database application technology as part of their text offerings.
Efficient multicast and fault-tolerant protocols were also designed,
verified, and developed. Some of the technology is currently
deployed by a ìTop-Threeî telecommunications company
in, at least, 27 major cities nationwide.
Papers were authored describing both the various information systems
and the protocols. Two books and two US issued patents also focus
on these same efforts. Two additional US patent applications
are currently under evaluation.
Project References
B. Kjell, W. A. Woods., and O. Frieder, "Discrimination of
Authorship Using Visualization," Information Processing
and Management, Pergamon Press, 30(1), pp. 141-150, January
1994.
T. Yap, O. Frieder, R. Martino, High Performance Computational
Methods for Biological Sequence Analysis, Kluwer Academic
Publishers, ISBN 0-7923-9724-X, 1996.
D. A. Grossman, O. Frieder, D. O. Holmes, and D. C. Roberts, "Integrating
Structured Data and Text: A Relational Approach," Journal
of the American Society of Information Science, 48(2), February
1997.
O. Frieder and H. Siegelmann, "Document Allocation in Multiprocessor
Information Retrieval Systems," IEEE Transactions on Knowledge
and Data Engineering, 9(4), July/August 1997.
A. Ruocco and O. Frieder, "Clustering and Classification
of Large Document Bases in a Parallel Environment," Journal
of the American Society of Information Science, 48(10), October
1997.
S. Nastea, O. Frieder, and T. El-Ghazawi, ìLoad Balanced
Sparse Matrix-Vector Multiplication on Parallel Computers,î
Journal of Parallel and Distributed Computing, 46(2), November
1997.
C. Lundquist, D. Grossman, and O. Frieder, ìImproving Relevance
Feedback in the Vector-Space Model,î ACM Sixth Conference
on Information and Knowledge Management, Las Vegas, Nevada,
November 1997.
T. Yap, O. Frieder, and R. Martino, ìParallel Computation
in Biological Sequence Analysis,î IEEE Transactions on
Parallel and Distributed Systems, (to appear).
Area Background
Information retrieval is the selection of documents that are potentially
relevant to a user's information need. Given the vast volume
of data stored in modern information retrieval systems, understanding
the data requires the use of visual aids and searching the document
database requires the use of vast computational resources. As
part of our study, we developed both visual data understanding
tools as well as related parallel information systems. Our systems
were customized for the needs of a diversity of domains including
traditional text databases and biological gene sequences.
Area References
D. Grossman and O. Frieder, Ad Hoc Information Retrieval: Algorithms
and Heuristics, Kluwer Academic Publishers, (to appear in early
1998).
K. Spark-Jones and P. Willet (Editors), Readings in Information
Retrieval, Morgan Kauffmann Publishers, Inc., ISBN: 1-55860-454-5,
1997.
T. Yap, O. Frieder, R. Martino, High Performance Computational Methods for Biological Sequence Analysis, Kluwer Academic Publishers,
ISBN 0-7923-9724-X, 1996.
G. Kowalski, Information Retrieval Systems: Theory and Implementation,
Kluwer Academic Publishers, ISBN: 0-7923-9926-9, 1997.
Information Retrieval
Computational Biology
Other