Geoinformatics 2007 Conference (17–18 May 2007)

Paper No. 13
Presentation Time: 11:30 AM

A SYSTEM FOR FAST SPATIAL SEARCHES ON THE EARTH OR SKY USING THE HIERARCHICAL TRIANGULAR MESH


FEKETE, Gyorgy, Physics and Astronomy, The Johns Hopkins University, 3400 N. Charles Street, Baltimore, MD 21218, gfekete@pha.jhu.edu

Spatial searches represent the most frequent search patterns both in Astrophysics and in Earth Sciences. In both of these areas scientists use a multitude of different coordinate systems, describing the distribution of both point-like and extended objects over the sphere. Many cutting edge science projects today involve multiple sources, thus require a detailed fusion and federation of these data sets. Very often drop-outs (objects that are not detected in one of datasets) are particularly interesting, since they could represent objects that are quite "small." Such searches require a detailed knowledge of what area of the globe has been covered by a particular instrument (Earth- or satellite-bound). Once the common area (the intersection of the respective footprints) has been determined, we need to find the relevant objects in this area, and perform a spatial cross-match, find all detections that may correspond to the same physical object. All this requires a rather sophisticated set of tools. Requirements are rather stringent, satellite orbits consist of long stripes over the sky, and the required accuracy is determined by the resolution of the best detectors. We need to track objects over hundreds of of degrees with a resolution of a few milliarcsecs. We need a framework that can be used in several different contexts, like stand-alone user applications, web services and web applications and can also be embedded into existing commercial database systems. We have been working on this problem for over 12 years. In the mid 90-s we have written a simple C-based package called the Hierarchical Triangular Mesh (HTM), that is still being used by several observatories and NASA centers. Then we have substantially rewritten the library to C++. This package is currently in use at multiple institutions and projects world-wide.

We address two separate mathematical issues. The first is about the correct abstract mathematical representation of complex spherical regions, and about implementations of various set operations (union, intersection and difference), morphological functions (dilation and erosion) and scalar functions (area). Beyond the optimal algorithms for these operations there are also serious challenges related to numerical precision (are these two planes exactly parallel?). The other part is a discrete pixelization of the sphere. For fast spatial searches we use data structures suitable for fast search algorithms. We use the sphere-quadtre to build a hierarchy of triangle-shaped pixels (trixels) on the globe, thus the name: Hierarchical Triangular Mesh, HTM.

There are many ways to represent shapes on the surface of the unit sphere. We chose a system, where there are no singularities at any pole, employing three-dimensional Cartesian unit vectors to describe locations on an idealized (unit radius) sphere. The most basic building block is the halfspace, a directed plane that splits the 3-D space into two halves. It is parameterized by a normal vector n and a scalar c (distance from the origin) When intersected with the unit sphere, a halfspace describes a spherical cap. Inclusion of a point inside such a cap is decided by a simple dot-product computation.

A so called "convex" is the intersection of many halfspaces. A rectangle is described by a convex of four halfspaces, the sides of the rectangle. A region, the most general shape is simply a union of convexes.

The normal form of a region is therefore a union of intersections. Very often the shapes are more conveniently represented by other well-know primitives, such as rectangles, spherical polygons, circles, the convex hull of a point set, and we provide a full compliment of shape functions that convert descriptions using these familiar terms into the normal form.

As always, the devil is in the details. Floating-point computations are subject to rounding errors and a number mathematical issues arise from these inherent uncertainties in our project, as well. What two points or halfspaces should be considered identical? When is a point exactly on an arc? Computer geometry libraries working in Euclidean space sometimes avoid these issues by utilizing exact arithmetic mathematical formulas on fractional numbers represented by integer numerators and denominators; however this slower workaround is not an option for solving the spherical geometry due to the normalization constraint, which involves taking the square root of the coordinates. We use the IEEE-754 standard, double precision numbers and derive the limitations from the actual number representation. At the core of many formulas is the cross-product to find perpendicular directions. Co-linear vectors have vanishing cross-products and numerically this can be tested in a robust way by comparing the largest coordinate to the double precision. If it is too small, its square root cannot be taken to normalize the vector and the indefinite perpendicular direct means co-linearity. A similar problem is solving for the roots of two circles on the sphere. If the computation is inaccurate the roots will not be on the circles numerically. The sweet spot for double precision numbers is at about the tolerance level of 10^-8 radians, which corresponds to a few thousandths of an arc second. This is about a foot in size on the surface of the Earth.

There are two basic kinds of spatial constraints in a query. 1: "Is this object within some distance of a target?" 2: "Is this object inside some region?"

Both kinds involve costly geometrical computation, so we devised a mechanism that eliminates objects from consideration quickly. Crude boxing techniques that selects candidates by boxing Lat,Lon values work well only if the region of interest is shaped like a rectangle aligned with the Lat/Lon grid, but is less effective for other shapes, like those that contain the pole, or narrow stripes that have a high inclination to the primary great circles. The main idea is that we implement a coarse filter whose job is to reject all objects that are certain to fail the spatial constraint, and use the fine filter for only the few false positives. The goal is to make the coarse filter as good as possible but also as fast as possible.

In a database which has a user-defined function called AngularDistance and a table of Objects that have columns named ObjID, Lat and Lon, a simple query that selects objects within half a degree of a point Lon = 12, and Lat = -30 is as follows:

select ObjID from Objects o where o.AngularDistance(12, -20, o.Lon, o.Lat) < 0.5

The costly AngularDistance function would be evaluated for each object in the table. Instead, if we implemented the coarse filter as a function that produces a table that would be efficiently computable on the fly, then the query is augmented with a join as follows,

select ObjID from Objects o join CircleCover(12, -20, 0.5) c on o.HtmId between c.lo and c.hi AND AngularDistance(12, -20, o.Lon, o.Lat) < 0.5

The second example presumes the existence of a HtmID column and the CircleCover table-valued function that returns rows of low, high values that constrain the possible HtmId values. Only those objects that pass the first constraint are given to AngularDistance for the precise calculation. The function CircleCover here produces a table that represents a circular region of interest centered on the given location and radius. However, our methods are capable of expressing arbitrary shapes with the full richness of the Region-Convex-Halfpsace paradigm. The essence of the idea is that the inside of every trixel is turned into a range query in a database over the HtmId values. We can exploit the relational database engine's ability to use the HtmID as an index for very rapid operation.

With the advent of modern silicon-based detectors, the way observational science is done is changing rapidly. To take full advantage of the avalanche of data, we need scalable information systems that provide reliable access via interoperable interfaces and efficient search capabilities. For a low-cost scientific data analysis infrastructure, all of the above components are required at the same time. Our approach is to wed state-of-the-art database and internet technologies to novel algorithmic developments to enable fast and seamless access to collaborative science archives. The framework serves both Earth and Space Sciences community.