Geometric Techniques for Multidimensional Databases

Scott T. Leutenegger
Mario A. Lopez

Mathematics and Computer Science Department
University of Denver

Contact Information

Scott Leutenegger Mario Lopez
Math and CS Dept Math and CS Dept
University of Denver University of Denver
2360 S. Gaylord Street 2360 S. Gaylord Street
Denver, CO 80208-0189 Denver, CO 80208-0189
Phone: (303) 871-2821 Phone: (303) 871-3287
Fax : (303) 871-3010 Fax : (303) 871-3010
Email: leut@cs.du.edu Email: mlopez@cs.du.edu

Keywords

Multidimensional Data, Indexing, Geometric Techniques, R-trees, Spatial Data

Project Award Information

Project Summary

Storage and retrieval of multidimensional data is necessary for many scientific, geographic, engineering, and business applications. Recent advances in supercomputing and data acquisition necessitate being able to provide efficient retrieval for ever increasing data set sizes. In addition, recent advances in data warehousing and distributed data management provide a need for indexing of distributed high dimensional data. The database community has developed numerous disk-based multi-dimensional indexing schemes, but has ignored a body of relevant theoretical work that may significantly improve loading and retrieval performance. The field of computational geometry has focused on memory resident data. A number of computational geometry techniques are practical and yield efficient solutions to problems such as range searching and proximity queries. This project addresses both point data, i.e. objects denoted by a single point, and region data, i.e. arbitrary k-dimensional shapes. For large and distributed data sets both parallelism and data migration will be needed to provide acceptable load and query response times. The project will explore and exploit a number of computational geometry techniques in the development of a new generation of efficient techniques for efficient loading, retrieval, and update of large disk-resident multidimensional data for both single machine, parallel machine, and distributed architectures. The resulting indexing techniques will enable higher level database architectures to provide efficient access to larger datasets, thus enabling data base technology to keep up with demands placed upon it from science and business on a whole.

Goals, Objectives, and Targeted Activities

  1. Design and development of new packing algorithms based on a top-down variants of STR. Goal: Improve point query performance by a factor of 2-5 for highly skewed data sets.
  2. Development of practical (low order polynomial) optimal node splitting algorithm.
  3. Integration of optimal node splitting with buffer model for post-processing of R-trees. Goal: Improve point query performance by 20 - 50% at the expense of 10% or less additional disk space.
  4. Investigate feasibility of more sophisticated computational geometry heuristics (min/max spanning trees, Delaunay triangulation, Voronoi diagrams) for bulk loading of R-trees.
  5. Implementation of sequential gridfile code, debugging, testing. This code will enable comparisons with R-tree code.
  6. Begin design and implementation of parallel R-tree code. Debug and test. Begin study of declustering methods.
  7. Development of simulation package of parallel R-trees. Simulation will allow us to test larger hypothetical systems in addition to our small cluster of workstations.
  8. Begin design and study of dynamic migration of data indexed by parallel R-trees. Goal: Have simulation running and initial results indicating utility of approach.

Indication of Success

So far the project has made significant improvements to packing of static data sets. Given the increasing number of large static multidimensional data sets from fields such as earth science, weather modeling, and military applications, this work will be a significant aid to science on a whole. Furthermore, the PIs have developed a new methodology for comparison of Rtree methods. The buffer modeling methodology developed will enable researchers to quickly assess the quality of an R-tree without length search experiments. The work focuses attention to minimizing the number of disk I/O for a given buffer size rather than the sometimes misleading metric of number of nodes accessed. This methodology should have a significant impact on future academic research in the area.

Project Impact

Include a brief discussion on the impact of the project on

Project References

Area Background

Storage and retrieval of multidimensional data is necessary for many scientific, geographic, engineering, and business applications. Traditional secondary storage techniques such as B-trees and hash tables are not suitable for dealing with multidimensional data: by concentrating on one-dimension at a time, large quantities of extraneous data may be retrieved. A number of techniques designed specifically for multidimensional data have been proposed. The most successful techniques today are R-trees (Guttman 84) and Gridfiles (Nievergelt et al 84). For point data which of these two methods is best is not clear. For region data, as pointed out by Faloutsos, R-tree variants are more attractive because they: 1) do not partition data objects into pieces, 2) operate on the native address space instead of transforming the input objects to to higher dimensional objects, 3) appear more robust for higher dimensions (Faloutsos et al 94). Applications where the underlying data does not change are said to be static, and arise naturally in many areas, such as scientific databases, GIS, and VLSI CAD. In contrast, data sets that are subject to frequent updates are said to be dynamic. As an example of static data consider the field of computational fluid dynamics (CFD). Techniques for the numerical solution of CFD models produce large data sets consisting of triangles (2D) or tetrahedra (3D). Associated with each vertex are solution values such as pressure, lift, drag, etc. Scientists often want to sample or visualize subsets of the data. Thus, data contained in some sub-region must be retrieved. This type of query is called a region query. The special case of a region of zero area is called a point query. Another application area requiring large amounts of multidimensional data, and one for which the second PI has produced a number of results, is VLSI CAD. A chip is typically described by a large collection of polygons organized in layers. The designer often wishes to retrieve all circuit components in a given region of the chip (a region query). On the other hand, a via is essentially a wire (of zero area) connecting different layers. Determining the components connected by such a via requires the resolution of a point query. Similar problems arise in other areas as well. Reporting all employees between 30 and 35 years of age and earning more than $60,000 is another example of region query. Recent advances in data warehousing have made this type of query important to handle. Support in this area is especially challenging since the number of dimensions is on the order of tens to hundreds.

Area References