CAREER: Generalized Search Technique for Indexing
Complex Data
Joseph M. Hellerstein
EECS Computer Science Division
UC Berkeley
Contact Information
Joseph M. Hellerstein
EECS Computer Science Division
387 Soda Hall #1776
Berkeley, CA 94720-1776
Phone: (510) 643-4011
Fax: (510) 642-5615
Email: jmh@cs.berkeley.edu
WWW PAGE
http://gist.cs.berkeley.edu
Keywords
Indexability, Generalized Search Tree, GiST, Index,
Abstract Data Type, Object.
Project Award Information
-
Award Number: 9703972
-
Duration: 3 years
-
Current year:1st year
-
Title: CAREER: Generalized Search Technique for
Indexing Complex Data
Project Summary
This project addresses two main challenges. The
first is to produce a software development environment for easily generating
indexing techniques for new complex data, e.g., maps, images, sound, video,
sequences, documents, etc., with large database workloads. This is being
done by (1) implementing a data structure called a Generalized Search Tree
(GiST) in the context of an Object-Relational database management system,
and (2) developing a graphical GiST visualization and debugging system.
The second challenge is to undertake a rigorous mathematical investigation
of how and when one can develop efficient indexes; this is referred to
as a Theory of Indexability. Indexability theory is akin to complexity
theory, which has guided software developers in understanding how and when
one can develop efficient algorithms. The main distinction is that indexability
focuses on space/time tradeoffs for indexes, and on the fundamental secondary-storage
challenges involved in database indexing. Execution of this research agenda
will produce a framework in which data domain experts (e.g., biologists,
chemists, earth scientists, computer multimedia researchers, etc.) can
(1) identify whether their workloads are indexable, and (2) if so, easily
develop efficient indexes for their workloads, integrating these indexes
with minimal effort into Object-Relational database management systems.
An additional effort is to integrate the development environment into a
curriculum for teaching both basic and advanced techniques in indexing
and indexability. This curriculum includes the software itself, along with
assignments being tested in undergraduate and graduate courses at Berkeley
taught by the principal investigator.
Goals, Objectives, and Targeted Activities
In the next year we will integrate our GiST library
with the SHORE object repository, and develop an Access Method Debugger
called amdb that will provide graphical analysis and debugging of
search trees. Part of the development of amdb involves analysis of the
efficacy of a search tree; this in essence uses our indexability framework
to compare the clustering in the existing tree with an optimal clustering
(which may in turn perform poorly due to the workload's inherent indexability.)
We intend to explore a variety of clustering algorithms to use as indicators
of the indexability of the workload. Since clustering technology has not
previously been applied in this framework, we intend to do an evaluation
of alternative clustering techniques, and report on their relationship
to indexability analysis.
Once an initial version of amdb is stable, we
will begin to insert it into the curriculum of our undergraduate database
course, to animate B+Tree algorithms, and allow students to interactively
explore their behavior. In addition, we will use it in our graduate course
to motivate the difficulty of indexing in multiple dimensions, and allow
students to explore ideas for more efficient indexes over multidimensional
data as well as abstract data types.
Finally, we will use amdb to begin applying GiST
technology to a variety of applications. One target is the analysis of
competing indexes for near-neighbor search. Another is an integration of
GiST with the image analysis work being done in the UC Berkeley digital
library project. A third possible target is to integrate GiST as an accelerator
for genomics applications like BLAST.
Indication of Success
This project, while only in its first year, has
already begun to have impact. A demonstration of amdb will be presented
at ACM SIGMOD '98. Our initial work on indexability (Hellerstein, Papadimitriou
and Koutsoupias, PODS '97) is driving other research groups: new independent
results by Koutsoupias and Taylor will appear in ACM PODS '98. We have
begun to discuss the integration of GiST technology into commercial database
systems; a major vendor recently approached us about the possibility of
this occurring in the short term.
In the longer term, we believe that indexability
analysis in general, and amdb in particular, will provide a significant
benefit to the database research community. Over the last number of years
there has been an enormous volume of publications proposing new indexes.
Most proposals differ in only minor ways, and can easily be implemented
in the GiST. Indexability analysis gives researchers and developers tools
to robustly evaluate both the difficulty of their task and the efficacy
of their solution. We envision a significant clarification of the field
as a result of our work, and we also look forward to this work driving
the development of effective indexing schemes for currently unsupported
workloads.
Project Impact
-
This award will support two graduate students, who
will be doing research on indexability and amdb.
-
We expect amdb to be an effective teaching tool
for illustrating B-Tree behavior at the undergraduate level. At the graduate
level, we envision amdb being used both to illustrate the structure and
behavior of more advanced structures, to illustrate the challenges of indexing
research, and drive new research projects within our graduate course. Research
projects initiated in our graduate database course at Berkeley have produced
a number of major research results (including the current world-record
sorting machine, NOW-Sort), so we view this as a serious vehicle for advancing
the state of the art.
-
Since the PI is currently the only active teaching
member of the Berkeley faculty in the database area, this grant directly
facilitates the maintenance of this top-rated program.
-
We are actively exploring a collaboration with one
of the major database vendors, who have approached us with an interest
in technology transfer in the very short term.
The amdb and indexability project has built bridges
at Berkeley between the database group and the theory and digital libraries
groups. Active collaboration is underway as a result.
Project References
Hellerstein, J. M., J. F. Naughton and A. Pfeffer.
"Generalized Search Trees for Database Systems", Proc. 21st International
Conference on Very Large Data Bases, Zurich, September, 1995.
Kornacker, M., C. Mohan, and J.M. Hellerstein.
"Concurrency and Recovery in Generalized Search Trees." Proc. ACM-SIGMOD
International Conference on Management of Data, Tucson, May, 1997.
Hellerstein, J. M., C. H. Papadimitriou, and
E. Koutsoupias. "Towards an Analysis of Indexing Schemes." Proc. Sixteenth
ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems,
Tucson, May 1997.
Aoki, P. "Generalizing 'Search' in Generalized
Search Trees".Proc. 14th Int'l Conf. on Data Engineering,
Orlando, FL, Feb. 1998.
Koutsoupias, E. and D. S. Taylor. "Tight Bounds
for 2-Dimensional Indexing Schemes." Proc. Seventeenth ACM SIGACT-SIGMOD-SIGART
Symposium on Principles of Database Systems, Seattle, June, 1998.
Area Background
Index structures are one of the core components
of any information system -- they allow particular data items to be retrieved
directly from large datasets, without scanning through all the data. The
traditional index structure in database systems is the B+-tree, which works
well for range and equality lookups on alphanumeric data. However, much
of the data in use today is far more complex than traditional alphanumeric
data: witness the "information explosion" on the world-wide web, which
includes unusual data such as maps, images, sound, video, sequences (of
DNA, stock prices, etc.), documents and so on. For most of this data, the
traditional B+-tree approach is not sufficient for queries that are natural
to the data types (e.g. "find all videos with explosions", or "find all
drum solos".) As a result, there is usually support for only the most naive
queries on such data, such as queries based on the textual names of the
data objects.
We have developed a framework for solving this
problem, by designing and implementing a data structure called a Generalized
Search Tree (GiST). The GiST generalizes a large subset of the previously
invented indexes into one manageable piece of software, by reducing search
trees to their essential properties: a hierarchical decomposition of a
dataset, with labels at each level of the hierarchy that describe the data
below. Within this basic framework, it is possible to index any
set of data, and support
arbitrary queries over the data.
The GiST's flexibility provides great power,
but it is currently difficult to know how to harness that power for indexing
in new applications. In general, data must be carefully placed into the
index, and the index nodes carefully labeled, in order to guarantee efficient
processing of queries. Moreover, a set of queries may generate conflicting
demands on the index, resulting in situations where some queries must suffer
if others are to be satisfied quickly. While it is currently very easy
to use the GiST software, it is often quite hard to make it work efficiently,
and sometimes unclear if it can be made to work efficiently for
a particular set of data and queries.
Our work seeks to provide tools -- both actual
software tools as well as theoretical "tools" -- to help index developers
control and tune indexes so that they are used effectively. The work will
result in an index development environment, along with crisp guidelines
to describe the possibilities and limitations of the environment. Both
the software and the theory are critical: the software enables developers
to achieve what is possible, and the theory warns them what is not possible.
Both classes of results are missing in indexing research today, and are
key to developing quick access for the new workloads emerging in multimedia,
complex data analysis, and the world-wide web. In addition to developing
these software tools, we are keen to prove their utility by applying them
to the challenge of inventing new indexing techniques for emerging applications.
Area References
Comer, D. "The Ubiquitous B-Tree." Computing
Surveys, 11(2):121-137, June 1979.
Faloutsos, C. Searching Multimedia Databases
by Content, Kluwer Academic Publishers, Boston, 1996.
Finkel, R. A. and J. L. Bentley. Quad-Trees:
A Data Structure For Retrieval On Composite Keys. ACTA Informatica,
4(1):1-9, 1974.
Guttman, A. R-Trees: A Dynamic Index Structure
For Spatial Searching. In Proc. ACM-SIGMOD International Conference
on Management of Data, pages 47-57, Boston, June 1984.
Knuth, D.E. Sorting and Searching, Volume
3 of The Art of Computer Programming. Addison-Wesley Publishing
Co., 1973.
H. Samet. Applications of Spatial Data Structures:
Computer Graphics, Image Processing, and GIS. Addison-Wesley, Reading,
Mass, 1990.
Stonebraker, M. "Inclusion of New Types in Relational
Database Systems", Proceedings of the IEEE Fourth International Conference
on Data Engineering, "Washington, D.C.",1986, Feb, 262-269.
Potential Related Projects
-
We are working on the integration of our GiST library
with the SHORE storage manager. The PREDATOR DBMS (NSF award 9702149) is
built on top of SHORE. It would be very interesting to integrate GiST with
PREDATOR, since GiST's ability to index opaque types should mesh in interesting
ways with the optimization opportunities available in the Enhanced Abstract
Data Type model of PREDATOR.
-
The work of Christos Faloutsos (NSF award 9625428)
is among the most influential to date on the topic of indexing non-standard
data. Faloutsos' approach of mapping opaque data to multidimensional space
is somewhat different than our suggestion, though essentially orthogonal
since GiSTs can index multidimensional spaces generated by Faloutsos' techniques.
A closer analysis of linkages between these two projects would be quite
interesting. We have initiated such discussion in the past.
-
The work of Freeston and Smith on BANG files and
BV trees (NSF award 9619915) presents an apparent alternative "generic"
indexing strategy to GiST; it is not immediately apparent whether these
structures support indexes over opaque abstract data types, nor whether
they support an extensible set of queries. However our work on indexability
can certainly illuminate how well BANG files and BV trees work on their
intended workloads, testing whether they indeed approach the theoretical
optima of indexibility for interesting workloads.
-
Since indexing is at the heart of efficient data
retrieval, a wide variety of NSF funded activities could make use of our
framework, for example the many projects on content-based retrieval.