*Computer Science
**Molecular Biology
The Evergreen State College
Olympia, WA 98505
Molecular Biology
Computational Chemistry
Our base infrastructure consists of an object-oriented (database) system with a domain database with inputs and outputs for experiments run using a variety of applications and a subsystem dubbed "computational proxy" that models application programs and processes. We believe that our results will be applicable to other scientific domains, and are extending that work to the ecological and earth sciences.
For the domain of computational chemistry, we extended an existing object database system to facilitate the invocation, monitoring, and output capture of a variety of independently developed programs (aka legacy applications).
For molecular biology, we are addressing what our collaborators specified as their most immediate need -- integrating and reasoning about the complex results of computational applications. We work to provide adequate information (metadata) about computational results so that those results are more readily reproducible, and to track operations on these results such that those operations can be repeated for other sets of results or modified. A major open issue involves declarative representations of computations such that the metadata can be captured, program inputs generated and program outputs parsed semi-automatically.
Unlike most scientific applications, molecular biologists often split the output from computational biology programs into many (and relatively small) data items so that they can reason about them independently. For example, a researcher might make dozens or even hundreds of sequence comparison searches in the course of a scientific study, each of which generates many separate items. Thus, he might be faced with thousands of items to organize, compare, reason about and keep track of.
Our current research aims to develop conceptual structures and implement prototypes that allow experimentation with integrating, analyzing and tracking inputs and results from numerous computational biology programs. Our current prototype helps researchers organize result items from sequence comparisons into clusters that can be marked, named, annotated, and manipulated. An alpha version is implemented in Smalltalk. A separate prototype allows researchers to apply simple operations to result elements, as well as to create new elements, in order to compare results across different programs; it is implemented in Java.
Working closely with molecular biologists at the University of Washington Biotechnology Center and Zymogenetics, we have developed a conceptual data model and prototype, and are deploying our prototypes in sequencing laboratories to gain direct experience with scientists. The research aspects of this project are "winding down". Our primary goals for the coming year are to evaluate the effectiveness of the current prototype, make software and operational changes needed in order to bring the technology into the hands of practising scientists, and consider what scientific work can now be done with such tools that were not possible earlier. We will evaluate our current work in terms of what computer science research is needed to make such systems even more effective,
To those ends, we will deploy the prototype into two laboratories for use by scientists not involved in its development, and make the changes in the software for making it more than a research prototype. We are establishing usability criteria for determining to what extent the results of our research are actually producing better science. We are writing user's guides and establishing procedures for reporting to users the status of the system modifications. We are also designing a scripting language for reasoning about results, adding a database for persistency and concurrency, and generalizing the work to determine its applicability to other disciplines.
Note that operational aspects of the system (technology transfer) are being funded by the Washington Technology Center.
More specifically, we have seen: (1) a significant development effort for an operational system at the Pacific Northwest National Lab on the promise of our work as a proof of concept; (2) the deployment of our molecular biology system into two laboratories; (3) the synergism between our work and that of the Cold Spring Harbor Molecular Biology Workbench Team (Genome Topographer), which has since received venture capital and is a startup company; in addition to the exchange of ideas is the exchange of personnel, and their hiring of our graduates.
J. B. Cushing, D. Maier et al, "Computational Proxies: Modeling Scientific Applications in Object Databases", Seventh International Working Conference on Statistical and Scientific Database Management (SSDBM), 1994.
D. Maier, J. B. Cushing, D. Hansen et al, Object Data Models for Shared Molecular Structures", STP 1214, American Society for Testing and Materials (ASTM)", 1994. J. B. Cushing, D. Hansen, D. Maier and C. Pu", "Connecting Scientific Programs and Data Using Object Databases", Bulletin of the Technical Committee on Data Engineering, 16:1, 1993.
M. Rao, "Computational Proxies for Computational Chemistry: A Proof of Concept", Master's Thesis, Department of Computer Science and Engineering, Oregon Graduate Institute of Science \& Technology,
D. Abel, The PCL: An Implementation of the Computational Chemistry Output Language", Master's Thesis, Portland State University, 1996.
D. Maier and J. B. Cushing, "Treating Programs as Objects: The Computational Proxy Experience", Invited paper, Conference on Deductive and Object-Oriented Database (DOOD'93), LNCS 760, Springer-Verlag, pp 1-12, 1993.
D. Maier, J.B. Cushing, T. Keller and T. Marr, Proxies in Practice: Object Architectures for Distributed Computational Workbenches, Journal of the Brazilian Computer Society, 3:2, 1996, pp 16-29.
J. B. Cushing, T. Hunkapiller, E. Kutter, J. Laird, Dave Yee and Frank Zucker, "Beyond Interoperability -- Tracking and Managing the Results of Computational Applications", International Conference on Scientific and Statistical Databases, D. Hansen (ed), IEEE Press, 1997.
A revised version of the above paper will be published in D. Shasha and J. Tsong-Li Wang's "Pattern Discovery in Molecular Biology Data", 1998.
J. B. Cushing, T. Hunkapiller, E. Kutter, J. Laird, Emir Pasalic, Dave Yee and Frank Zucker, "Data Mining the Genome Databases", a poster for the Workshop on Data Mining, SIGMOD 1997.
N. Nadkarni, G. Parker, D. Ford, J. Cushing and C. Stallman, "The Canopy Research Network: A Model for Information Exchange among Forest Canopy Scientists, Northwest Science, 1995.
J. B. Cushing, N. Nadkarni, D. Maier et al, "Database Support for Forest Canopy Researchers -- Metadata as a byproduct of the research process", The Conference on Scientific and Technical Data Exchange and Integration, J. Rumble and P. Uhlir and B. Wright (eds), U.S . National Committee for CODATA, National Research Council, 1997.
With respect to scientific databases and applications, mainstream work in distributed, or mediation, architectures, such as that of Wiederhold and his collaborators, is highly relevant, as they provide a common understanding of the problem and language and context for discussing the alternative solutions. To make this technology work for the scientist, agreement as to the meaning of the data being exported and imported is required, and there has been considerable interest of late in using advanced data models for integrating scientific data and letting applications interoperate with one or more models through a uniform interface. Rieche and Dittrich and Kemp et al present two examples of this approach for molecular biology data, using object-oriented and functional models, respectively. Object-oriented technology has also been used by others to provide more intelligent interfaces to scientific applications, such as the SCENE system of Peskin and Walther and the experiment management system of Sparr and colleagues. Probably the most similar work to our own, in that it uses an object-oriented database and focuses on experiment management is the ZOO system of Ioannidis, Livny and others. While they have concentrated on computational experiments, they have more support for laboratory and observational experiments than we do.
Within the mainstream of database research, semantic and object-oriented data modeling are important because of the importance of representing complex scientific objects independently of any physical schema. Research on personal databases and laboratory notebooks is relevant because of the interplay between an individual scientist's private databases and laboratory-wide or public databases.
As Peter M.D. Gray and his colleagues point out, "Molecular biology provides one of the most challenging application domains for database research". It also provides one of the most productive, since all major genetic science publications now require researchers to publish their sequences electronically and the human genome project set aside a considerable percentage of funding for bioinformatics research. Molecular biologists are now using private and public databases, perhaps because these data are (at least on the surface) easily represented as ASCII character strings. Between 1985 and 1989 the number of nucleotides in centralized DNA databases increased 7-fold, from 3 to 21 million. Automated sequencing methods promise to increase this database 10-fold every five years; the 1994 goal was to sequence 160 million base pairs per year (up from 16 million in 1989). The actual accumulation exceeded this prediction -- over 160 million bases were added to GenBank in 1995. Some researchers contend that data is currently accumulating at near exponential rates. The Human Genome project and the Brookhaven Protein Database are perhaps the most ambitious of these public database projects, but there are a number of other genome-related databases.
As a result: (1) much data is available electronically and available to researchers, who very much want to use this data, and (2) much research and development has been done to provide common file formats, databases, interfaces to databases, and computational applications. Thus computer scientists interested in scientific data and program operability have available a wide range of publicly available data and programs to work with.
Research and development in molecular bioinformatics can be categorized as follows:
Proceedings of the International Working Conferences on Statistical and Scientific Database Management ({SSDBM})", IEEE Press, 1994, 1996, 1997, etc.
R. Jain, "Workshop Report on {NSF} Workshop on Visual Information Management Systems", Computer Science and Engineering Division, University of Michigan, 1993.
"Finding the Forest in the Trees", Committee for a pilot study on database interfaces", National Academy Press, Washington, DC, 1995.
"Building Domain-Specific Environmental for Computational Science: A Case STudy in Seismic Tomography", J.E. Cuny et al, The International Journal of Supercomputer Applications and High Performance Computing, 11:3, Fall 1997, pp 179-196.