Computational Proxies for Genomic Databases

Judith Bayard Cushing*
Elizabeth Kutter**

*Computer Science
**Molecular Biology
The Evergreen State College
Olympia, WA 98505
 
 

Contact Information

Judith Bayard Cushing
LAB I
The Evergreen State College
Olympia, WA 98505
Phone: (360) 866-6000 x-6652
Fax : (360) 866-6794
Email: judyc@evergreen.edu

WWW PAGE

www.evergreen.edu/scidb
http://hunkworks.mbt.washington.edu/PROJECTS/gto.html

Keywords

Scientific Databases
Object-oriented Databases
Program Interoperability
Conceptual Modeling
Extensible Systems
Data Archiving and Data Mining

Molecular Biology
Computational Chemistry

Project Award Information

Project Summary

Scientific applications and databases rarely interoperate easily. That is, scientific researchers who use computers expend significant time and effort writing special procedures to use their program with someone else's data, or their data with someone else's programs. These problems are exacerbated in modern computing environments, which consist of multiple computers of possibly different types. We are using object technology (databases and structures) to address problems of program and data interoperability, and the overall goal of the project is to specify, design, implement and support an infrastructure for conducting computational science experiments.

Our base infrastructure consists of an object-oriented (database) system with a domain database with inputs and outputs for experiments run using a variety of applications and a subsystem dubbed "computational proxy" that models application programs and processes. We believe that our results will be applicable to other scientific domains, and are extending that work to the ecological and earth sciences.

For the domain of computational chemistry, we extended an existing object database system to facilitate the invocation, monitoring, and output capture of a variety of independently developed programs (aka legacy applications).

For molecular biology, we are addressing what our collaborators specified as their most immediate need -- integrating and reasoning about the complex results of computational applications. We work to provide adequate information (metadata) about computational results so that those results are more readily reproducible, and to track operations on these results such that those operations can be repeated for other sets of results or modified. A major open issue involves declarative representations of computations such that the metadata can be captured, program inputs generated and program outputs parsed semi-automatically.

Goals, Objectives, and Targeted Activities

Molecular biology applications, like those of other scientific domains, need to store and view large amounts of specialized quantitative information. With the advent of high speed sequencing technology and considerable funding to map the genomes of key biological organisms, public databases such as GenBank, PDB, EMBL, JIPID, and SwissProt now make millions of genetic sequences available to molecular biologists. In addition, industry and university laboratories maintain considerable private sequence databases for preliminary or proprietary research. The need for common interfaces and query languages to exploit these heterogeneous databases is well documented, and several such systems now exit or are under development. Our work on database and program interoperability in this domain has shown, however, that providing a single interface to the databases is but the first step towards making these databases fully useful to researchers. Molecular biologists also need to be able to send their own sequences as well as sequences retrieved from public databases through computational applications, and to manage the results of those computations.

Unlike most scientific applications, molecular biologists often split the output from computational biology programs into many (and relatively small) data items so that they can reason about them independently. For example, a researcher might make dozens or even hundreds of sequence comparison searches in the course of a scientific study, each of which generates many separate items. Thus, he might be faced with thousands of items to organize, compare, reason about and keep track of.

Our current research aims to develop conceptual structures and implement prototypes that allow experimentation with integrating, analyzing and tracking inputs and results from numerous computational biology programs. Our current prototype helps researchers organize result items from sequence comparisons into clusters that can be marked, named, annotated, and manipulated. An alpha version is implemented in Smalltalk. A separate prototype allows researchers to apply simple operations to result elements, as well as to create new elements, in order to compare results across different programs; it is implemented in Java.

Working closely with molecular biologists at the University of Washington Biotechnology Center and Zymogenetics, we have developed a conceptual data model and prototype, and are deploying our prototypes in sequencing laboratories to gain direct experience with scientists. The research aspects of this project are "winding down". Our primary goals for the coming year are to evaluate the effectiveness of the current prototype, make software and operational changes needed in order to bring the technology into the hands of practising scientists, and consider what scientific work can now be done with such tools that were not possible earlier. We will evaluate our current work in terms of what computer science research is needed to make such systems even more effective,

To those ends, we will deploy the prototype into two laboratories for use by scientists not involved in its development, and make the changes in the software for making it more than a research prototype. We are establishing usability criteria for determining to what extent the results of our research are actually producing better science. We are writing user's guides and establishing procedures for reporting to users the status of the system modifications. We are also designing a scripting language for reasoning about results, adding a database for persistency and concurrency, and generalizing the work to determine its applicability to other disciplines.

Note that operational aspects of the system (technology transfer) are being funded by the Washington Technology Center.

Indication of Success

One indication of success is the award of a Washington Technology Center grant for technology transfer of the research, which covers the operational aspects of the project, and the establishment by one of our collaborators of a private venture in drug discovery. In the past year we have published a paper, written a book chapter, and presented our work at the ACM SIGMOD data mining workshop.

More specifically, we have seen: (1) a significant development effort for an operational system at the Pacific Northwest National Lab on the promise of our work as a proof of concept; (2) the deployment of our molecular biology system into two laboratories; (3) the synergism between our work and that of the Cold Spring Harbor Molecular Biology Workbench Team (Genome Topographer), which has since received venture capital and is a startup company; in addition to the exchange of ideas is the exchange of personnel, and their hiring of our graduates.

Project Impact

Project impact has been in several areas, including human resources, education, and technology transfer. The human resource aspect is particularly important, since the training of individuals with bioinformatics expertise has been noted as a national priority:

Project References

J. B. Cushing,"Computational Proxies: An Object-based Infrastructure for Experiment Management", Ph.D. Thesis, Department of Computer Science and Engineering, Oregon Graduate Institute of Science & Technology, Portland, OR, 1995.

J. B. Cushing, D. Maier et al, "Computational Proxies: Modeling Scientific Applications in Object Databases", Seventh International Working Conference on Statistical and Scientific Database Management (SSDBM), 1994.

D. Maier, J. B. Cushing, D. Hansen et al, Object Data Models for Shared Molecular Structures", STP 1214, American Society for Testing and Materials (ASTM)", 1994. J. B. Cushing, D. Hansen, D. Maier and C. Pu", "Connecting Scientific Programs and Data Using Object Databases", Bulletin of the Technical Committee on Data Engineering, 16:1, 1993.

M. Rao, "Computational Proxies for Computational Chemistry: A Proof of Concept", Master's Thesis, Department of Computer Science and Engineering, Oregon Graduate Institute of Science \& Technology,

D. Abel, The PCL: An Implementation of the Computational Chemistry Output Language", Master's Thesis, Portland State University, 1996.

D. Maier and J. B. Cushing, "Treating Programs as Objects: The Computational Proxy Experience", Invited paper, Conference on Deductive and Object-Oriented Database (DOOD'93), LNCS 760, Springer-Verlag, pp 1-12, 1993.

D. Maier, J.B. Cushing, T. Keller and T. Marr, Proxies in Practice: Object Architectures for Distributed Computational Workbenches, Journal of the Brazilian Computer Society, 3:2, 1996, pp 16-29.

J. B. Cushing, T. Hunkapiller, E. Kutter, J. Laird, Dave Yee and Frank Zucker, "Beyond Interoperability -- Tracking and Managing the Results of Computational Applications", International Conference on Scientific and Statistical Databases, D. Hansen (ed), IEEE Press, 1997.

A revised version of the above paper will be published in D. Shasha and J. Tsong-Li Wang's "Pattern Discovery in Molecular Biology Data", 1998.

J. B. Cushing, T. Hunkapiller, E. Kutter, J. Laird, Emir Pasalic, Dave Yee and Frank Zucker, "Data Mining the Genome Databases", a poster for the Workshop on Data Mining, SIGMOD 1997.

N. Nadkarni, G. Parker, D. Ford, J. Cushing and C. Stallman, "The Canopy Research Network: A Model for Information Exchange among Forest Canopy Scientists, Northwest Science, 1995.

J. B. Cushing, N. Nadkarni, D. Maier et al, "Database Support for Forest Canopy Researchers -- Metadata as a byproduct of the research process", The Conference on Scientific and Technical Data Exchange and Integration, J. Rumble and P. Uhlir and B. Wright (eds), U.S . National Committee for CODATA, National Research Council, 1997.

Area Background

Since we started our scientific database work in 1991, much progress has been made in improving the physical interoperability of programs and databases. Technological advances such as Microsoft's object linking and embedding, AppleEvents, ToolTalk, and CORBA-compliant implementations such as ParcPlace's Distributed SmallTalk are beginning to make interoperability technically feasible.

With respect to scientific databases and applications, mainstream work in distributed, or mediation, architectures, such as that of Wiederhold and his collaborators, is highly relevant, as they provide a common understanding of the problem and language and context for discussing the alternative solutions. To make this technology work for the scientist, agreement as to the meaning of the data being exported and imported is required, and there has been considerable interest of late in using advanced data models for integrating scientific data and letting applications interoperate with one or more models through a uniform interface. Rieche and Dittrich and Kemp et al present two examples of this approach for molecular biology data, using object-oriented and functional models, respectively. Object-oriented technology has also been used by others to provide more intelligent interfaces to scientific applications, such as the SCENE system of Peskin and Walther and the experiment management system of Sparr and colleagues. Probably the most similar work to our own, in that it uses an object-oriented database and focuses on experiment management is the ZOO system of Ioannidis, Livny and others. While they have concentrated on computational experiments, they have more support for laboratory and observational experiments than we do.

Within the mainstream of database research, semantic and object-oriented data modeling are important because of the importance of representing complex scientific objects independently of any physical schema. Research on personal databases and laboratory notebooks is relevant because of the interplay between an individual scientist's private databases and laboratory-wide or public databases.

As Peter M.D. Gray and his colleagues point out, "Molecular biology provides one of the most challenging application domains for database research". It also provides one of the most productive, since all major genetic science publications now require researchers to publish their sequences electronically and the human genome project set aside a considerable percentage of funding for bioinformatics research. Molecular biologists are now using private and public databases, perhaps because these data are (at least on the surface) easily represented as ASCII character strings. Between 1985 and 1989 the number of nucleotides in centralized DNA databases increased 7-fold, from 3 to 21 million. Automated sequencing methods promise to increase this database 10-fold every five years; the 1994 goal was to sequence 160 million base pairs per year (up from 16 million in 1989). The actual accumulation exceeded this prediction -- over 160 million bases were added to GenBank in 1995. Some researchers contend that data is currently accumulating at near exponential rates. The Human Genome project and the Brookhaven Protein Database are perhaps the most ambitious of these public database projects, but there are a number of other genome-related databases.

As a result: (1) much data is available electronically and available to researchers, who very much want to use this data, and (2) much research and development has been done to provide common file formats, databases, interfaces to databases, and computational applications. Thus computer scientists interested in scientific data and program operability have available a wide range of publicly available data and programs to work with.

Research and development in molecular bioinformatics can be categorized as follows:

Area References

J. French, A. Jones and John Pfaltz, "A Summary of the NSF Scientific Database Workshop", Data Engineering (Special Issue on SSDBMS), 13:3, September 1990, IEEE Computer Society, pp. 55-61.

Proceedings of the International Working Conferences on Statistical and Scientific Database Management ({SSDBM})", IEEE Press, 1994, 1996, 1997, etc.

R. Jain, "Workshop Report on {NSF} Workshop on Visual Information Management Systems", Computer Science and Engineering Division, University of Michigan, 1993.

"Finding the Forest in the Trees", Committee for a pilot study on database interfaces", National Academy Press, Washington, DC, 1995.

"Building Domain-Specific Environmental for Computational Science: A Case STudy in Seismic Tomography", J.E. Cuny et al, The International Journal of Supercomputer Applications and High Performance Computing, 11:3, Fall 1997, pp 179-196.

Potential Related Projects

This project would also benefit from collaboration with researchers in data mining, human computer interaction, domain specific languages, and visual programming. For our new project, we will be consulting recent literature on schema migration and integration, spatial data structures and interactive on-line queries.