Geometric Techniques for Multidimensional Databases
Scott T. Leutenegger
Mario A. Lopez
Mathematics and Computer Science Department
University of Denver
Contact Information
| Scott Leutenegger | Mario Lopez |
| Math and CS Dept | Math and CS Dept |
| University of Denver | University of Denver |
| 2360 S. Gaylord Street | 2360 S. Gaylord Street |
| Denver, CO 80208-0189 | Denver, CO 80208-0189 |
| Phone: (303) 871-2821 | Phone: (303) 871-3287 |
| Fax : (303) 871-3010 | Fax : (303) 871-3010 |
| Email: leut@cs.du.edu | Email: mlopez@cs.du.edu |
Keywords
Multidimensional Data, Indexing, Geometric Techniques, R-trees, Spatial Data
Project Award Information
- Award Number: IRI-9610270
- Duration: 3 years total, 1st year
- Title: Geometric Techniques for Multidimensional Databases
Project Summary
Storage and retrieval of multidimensional data is
necessary for many scientific, geographic,
engineering, and business applications.
Recent advances in supercomputing and data acquisition necessitate
being able to provide efficient retrieval for ever increasing
data set sizes.
In addition, recent advances in data warehousing and distributed
data management provide a need for indexing of distributed high
dimensional data.
The database community has developed numerous disk-based multi-dimensional
indexing schemes, but has ignored a body of relevant theoretical
work that may significantly improve loading and retrieval performance.
The field of computational geometry has focused on memory resident data.
A number of computational geometry techniques are practical and yield efficient
solutions to problems such as range searching and proximity queries.
This project addresses both point data, i.e. objects denoted by a single point,
and region data, i.e. arbitrary k-dimensional shapes.
For large and distributed data sets both parallelism and data migration
will be needed to provide acceptable load and query response times.
The project will explore and exploit a number of computational
geometry techniques in the development of a new generation of efficient
techniques for efficient loading, retrieval, and update of
large disk-resident multidimensional data for both single machine,
parallel machine, and distributed architectures.
The resulting indexing techniques will enable higher level
database architectures to provide efficient access
to larger datasets, thus enabling data base technology to keep
up with demands placed upon it from science and business on a whole.
Goals, Objectives, and Targeted Activities
- Design and development of new packing algorithms based
on a top-down variants of STR.
Goal: Improve point query performance by a factor of 2-5 for highly
skewed data sets.
- Development of practical (low order polynomial) optimal node
splitting algorithm.
- Integration of optimal node splitting with buffer model
for post-processing of R-trees.
Goal: Improve point query performance by 20 - 50%
at the expense of 10% or less additional disk space.
- Investigate feasibility of more sophisticated computational
geometry heuristics (min/max spanning trees, Delaunay triangulation,
Voronoi diagrams) for bulk loading of R-trees.
- Implementation of sequential gridfile code, debugging, testing.
This code will enable comparisons with R-tree code.
- Begin design and implementation of parallel R-tree code.
Debug and test. Begin study of declustering methods.
- Development of simulation package of parallel R-trees.
Simulation will allow us to test larger hypothetical systems in
addition to our small cluster of workstations.
- Begin design and study of dynamic migration of data indexed
by parallel R-trees.
Goal: Have simulation running
and initial results indicating utility of approach.
Indication of Success
So far the project has made significant improvements to
packing of static data sets. Given the increasing
number of large static multidimensional data sets from
fields such as earth science, weather modeling, and military
applications, this work will be a significant aid to science on a whole.
Furthermore, the PIs have developed a new methodology
for comparison of Rtree methods. The buffer modeling methodology
developed will enable researchers to quickly assess the quality of
an R-tree without length search experiments. The work
focuses attention to minimizing the number of disk I/O for a given
buffer size rather than the sometimes misleading metric of number of nodes
accessed. This methodology should have a significant impact on
future academic research in the area.
Project Impact
Include a brief discussion on the impact of the project on
- Human Resources: currently funding two grad students, both women
- Education and curriculum development: The grant is providing
interesting material for independent research projects and
class projects in databases and computational geometry.
- Department Infrastructure: Significant impact. We are a small
department (6 CS faculty) with a small (new) PhD program. We have
produced 3 CS PhDs so far, this grant has greatly strengthened
our PhD program.
- Industry: Results being transfered into industry via a Colorado
Advanced Software Institute grant. Collaborating company: MetaComp Inc.
Project References
-
The Effect of Buffering on the Performance of R-Trees,
S.T. Leutenegger, M.A. Lopez,
Proc. of the 1998 International Conference on Data Engineering (ICDE 1998)
-
STR: A Simple and Efficient Algorithm for R-Tree Packing,
S.T. Leutenegger, M.A. Lopez, J.M. Edgington,
Proc. of the 1997 International Conference on Data Engineering (ICDE 1997)
-
On Optimal Node Splitting for R-trees,
Y.J. Garcia, M.A. Lopez, S.T. Leutenegger,
University of Denver Computer Science Technical Report #97-03,
http://www.cs.du.edu/users/leut/pubs.html, submitted for publication
Area Background
Storage and retrieval of multidimensional data is
necessary for many scientific, geographic,
engineering, and business applications.
Traditional secondary storage techniques
such as B-trees and hash tables are not suitable for dealing with
multidimensional data: by concentrating on one-dimension at a time,
large quantities of extraneous data may be retrieved.
A number of techniques designed specifically for multidimensional data
have been proposed. The most successful techniques today
are R-trees (Guttman 84) and Gridfiles (Nievergelt et al 84).
For point data which of these two methods is best is not clear.
For region data, as pointed out by Faloutsos, R-tree variants are more
attractive because they: 1) do not partition data objects
into pieces, 2) operate on the native address space instead of transforming
the input objects to to higher dimensional objects,
3) appear more robust for higher dimensions (Faloutsos et al 94).
Applications where the underlying data does not change
are said to be static, and arise naturally in
many areas, such as scientific databases, GIS, and VLSI CAD.
In contrast, data sets that are subject to frequent updates
are said to be dynamic.
As an example of static data consider the field of computational fluid
dynamics (CFD). Techniques for the numerical solution of CFD models
produce large data sets consisting
of triangles (2D) or tetrahedra (3D). Associated with each
vertex are solution values such as pressure, lift, drag, etc.
Scientists often want to sample or visualize subsets
of the data. Thus, data contained in some sub-region must be retrieved.
This type of query is called a region query. The special
case of a region of zero area is called a point query.
Another application area requiring large amounts of multidimensional
data, and one for which the second PI has produced a number of results,
is VLSI CAD. A chip is typically described
by a large collection of polygons organized in layers. The designer
often wishes to retrieve all circuit components in a given region
of the chip (a region query). On the other hand, a via is
essentially a wire (of zero area) connecting different layers.
Determining the components connected by such a via requires the
resolution of a point query.
Similar problems arise in other areas as well.
Reporting all employees between 30 and 35 years of age and
earning more than $60,000 is another example of region query.
Recent advances in data warehousing have made this type of query important
to handle. Support in this area is especially challenging since
the number of dimensions is on the order of tens to hundreds.
Area References
-
R-trees: a Dynamic Index Structure For Spatial Searching,
A. Guttman,
Proc. ACM SIGMOD, p. 47-57, 1984
-
The Grid File: An Adaptable, Symmetric Multikey File Structure,
J. Nievergelt, H. Hinterberger, K.C. Sevcik,
ACM Transactions on Database Systems, vol.\ 9, no.\ 1, March 1984, p.\ 38-71.
-
Parallel R-trees,
I. Kamel, C. Faloutsos,
Proc. ACM SIGMOD, p. 195-204, 1992.
-
Hilbert R-tree: An improved R-tree Using Fractals,
I. Kamel, C. Faloutsos,
Proc. International Conference on Very Large
Databases 1995 (VLDB-95).
-
Declustering Spatial Databases on a Multi-computer Architecture,
N. Koudas, C. Faloutsos, I. Kamel,
In Proc of Extended Database Technologies
conference (EDBT), 1996.
-
STR: A Simple and Efficient Algorithm for R-Tree Packing,
S.T. Leutenegger, M.A. Lopez, J.M. Edgington,
Proc. of the 1997 International Conference on Data Engineering (ICDE 1997)