CAREER: Generalized Search Technique for Indexing Complex Data


Joseph M. Hellerstein

EECS Computer Science Division
UC Berkeley

Contact Information

Joseph M. Hellerstein
EECS Computer Science Division
387 Soda Hall #1776
Berkeley, CA 94720-1776
Phone: (510) 643-4011
Fax: (510) 642-5615
Email: jmh@cs.berkeley.edu

WWW PAGE

http://gist.cs.berkeley.edu

Keywords

Indexability, Generalized Search Tree, GiST, Index, Abstract Data Type, Object.

Project Award Information

Project Summary

This project addresses two main challenges. The first is to produce a software development environment for easily generating indexing techniques for new complex data, e.g., maps, images, sound, video, sequences, documents, etc., with large database workloads. This is being done by (1) implementing a data structure called a Generalized Search Tree (GiST) in the context of an Object-Relational database management system, and (2) developing a graphical GiST visualization and debugging system. The second challenge is to undertake a rigorous mathematical investigation of how and when one can develop efficient indexes; this is referred to as a Theory of Indexability. Indexability theory is akin to complexity theory, which has guided software developers in understanding how and when one can develop efficient algorithms. The main distinction is that indexability focuses on space/time tradeoffs for indexes, and on the fundamental secondary-storage challenges involved in database indexing. Execution of this research agenda will produce a framework in which data domain experts (e.g., biologists, chemists, earth scientists, computer multimedia researchers, etc.) can (1) identify whether their workloads are indexable, and (2) if so, easily develop efficient indexes for their workloads, integrating these indexes with minimal effort into Object-Relational database management systems. An additional effort is to integrate the development environment into a curriculum for teaching both basic and advanced techniques in indexing and indexability. This curriculum includes the software itself, along with assignments being tested in undergraduate and graduate courses at Berkeley taught by the principal investigator.

Goals, Objectives, and Targeted Activities

In the next year we will integrate our GiST library with the SHORE object repository, and develop an Access Method Debugger called amdb that will provide graphical analysis and debugging of search trees. Part of the development of amdb involves analysis of the efficacy of a search tree; this in essence uses our indexability framework to compare the clustering in the existing tree with an optimal clustering (which may in turn perform poorly due to the workload's inherent indexability.) We intend to explore a variety of clustering algorithms to use as indicators of the indexability of the workload. Since clustering technology has not previously been applied in this framework, we intend to do an evaluation of alternative clustering techniques, and report on their relationship to indexability analysis.

Once an initial version of amdb is stable, we will begin to insert it into the curriculum of our undergraduate database course, to animate B+Tree algorithms, and allow students to interactively explore their behavior. In addition, we will use it in our graduate course to motivate the difficulty of indexing in multiple dimensions, and allow students to explore ideas for more efficient indexes over multidimensional data as well as abstract data types.

Finally, we will use amdb to begin applying GiST technology to a variety of applications. One target is the analysis of competing indexes for near-neighbor search. Another is an integration of GiST with the image analysis work being done in the UC Berkeley digital library project. A third possible target is to integrate GiST as an accelerator for genomics applications like BLAST.

Indication of Success

This project, while only in its first year, has already begun to have impact. A demonstration of amdb will be presented at ACM SIGMOD '98. Our initial work on indexability (Hellerstein, Papadimitriou and Koutsoupias, PODS '97) is driving other research groups: new independent results by Koutsoupias and Taylor will appear in ACM PODS '98. We have begun to discuss the integration of GiST technology into commercial database systems; a major vendor recently approached us about the possibility of this occurring in the short term.

In the longer term, we believe that indexability analysis in general, and amdb in particular, will provide a significant benefit to the database research community. Over the last number of years there has been an enormous volume of publications proposing new indexes. Most proposals differ in only minor ways, and can easily be implemented in the GiST. Indexability analysis gives researchers and developers tools to robustly evaluate both the difficulty of their task and the efficacy of their solution. We envision a significant clarification of the field as a result of our work, and we also look forward to this work driving the development of effective indexing schemes for currently unsupported workloads.

Project Impact

The amdb and indexability project has built bridges at Berkeley between the database group and the theory and digital libraries groups. Active collaboration is underway as a result.

Project References

Hellerstein, J. M., J. F. Naughton and A. Pfeffer. "Generalized Search Trees for Database Systems", Proc. 21st International Conference on Very Large Data Bases, Zurich, September, 1995.

Kornacker, M., C. Mohan, and J.M. Hellerstein. "Concurrency and Recovery in Generalized Search Trees." Proc. ACM-SIGMOD International Conference on Management of Data, Tucson, May, 1997.

Hellerstein, J. M., C. H. Papadimitriou, and E. Koutsoupias. "Towards an Analysis of Indexing Schemes." Proc. Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Tucson, May 1997.

Aoki, P. "Generalizing 'Search' in Generalized Search Trees".Proc. 14th Int'l Conf. on Data Engineering, Orlando, FL, Feb. 1998.

Koutsoupias, E. and D. S. Taylor. "Tight Bounds for 2-Dimensional Indexing Schemes." Proc. Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, June, 1998.

Area Background

Index structures are one of the core components of any information system -- they allow particular data items to be retrieved directly from large datasets, without scanning through all the data. The traditional index structure in database systems is the B+-tree, which works well for range and equality lookups on alphanumeric data. However, much of the data in use today is far more complex than traditional alphanumeric data: witness the "information explosion" on the world-wide web, which includes unusual data such as maps, images, sound, video, sequences (of DNA, stock prices, etc.), documents and so on. For most of this data, the traditional B+-tree approach is not sufficient for queries that are natural to the data types (e.g. "find all videos with explosions", or "find all drum solos".) As a result, there is usually support for only the most naive queries on such data, such as queries based on the textual names of the data objects.

We have developed a framework for solving this problem, by designing and implementing a data structure called a Generalized Search Tree (GiST). The GiST generalizes a large subset of the previously invented indexes into one manageable piece of software, by reducing search trees to their essential properties: a hierarchical decomposition of a dataset, with labels at each level of the hierarchy that describe the data below. Within this basic framework, it is possible to index any set of data, and support arbitrary queries over the data.

The GiST's flexibility provides great power, but it is currently difficult to know how to harness that power for indexing in new applications. In general, data must be carefully placed into the index, and the index nodes carefully labeled, in order to guarantee efficient processing of queries. Moreover, a set of queries may generate conflicting demands on the index, resulting in situations where some queries must suffer if others are to be satisfied quickly. While it is currently very easy to use the GiST software, it is often quite hard to make it work efficiently, and sometimes unclear if it can be made to work efficiently for a particular set of data and queries.

Our work seeks to provide tools -- both actual software tools as well as theoretical "tools" -- to help index developers control and tune indexes so that they are used effectively. The work will result in an index development environment, along with crisp guidelines to describe the possibilities and limitations of the environment. Both the software and the theory are critical: the software enables developers to achieve what is possible, and the theory warns them what is not possible. Both classes of results are missing in indexing research today, and are key to developing quick access for the new workloads emerging in multimedia, complex data analysis, and the world-wide web. In addition to developing these software tools, we are keen to prove their utility by applying them to the challenge of inventing new indexing techniques for emerging applications.

Area References

Comer, D. "The Ubiquitous B-Tree." Computing Surveys, 11(2):121-137, June 1979.

Faloutsos, C. Searching Multimedia Databases by Content, Kluwer Academic Publishers, Boston, 1996.

Finkel, R. A. and J. L. Bentley. Quad-Trees: A Data Structure For Retrieval On Composite Keys. ACTA Informatica, 4(1):1-9, 1974.

Guttman, A. R-Trees: A Dynamic Index Structure For Spatial Searching. In Proc. ACM-SIGMOD International Conference on Management of Data, pages 47-57, Boston, June 1984.

Knuth, D.E. Sorting and Searching, Volume 3 of The Art of Computer Programming. Addison-Wesley Publishing Co., 1973.

H. Samet. Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading, Mass, 1990.

Stonebraker, M. "Inclusion of New Types in Relational Database Systems", Proceedings of the IEEE Fourth International Conference on Data Engineering, "Washington, D.C.",1986, Feb, 262-269.

Potential Related Projects