DISK-BASED GRIDDING FOR LARGE DATA SETS
Summary
As computer speed and memory increases the amount of data to be processed grows even faster, requiring more sophisticated algorithms to run on the new hardware. Terrain and subsurface formation modeling have experienced this data explosion, along with a demand for larger output models. In this paper I present a mathematically sophisticated, grid-based surface modeling algorithm that makes little demand on computer RAM. The algorithm is competitive with other gridding algorithms on small or medium size data sets. For large problems, problems with tens of millions of input data points and output grids with billions of nodes, this algorithm produces a solution while many others are just starting or have already crashed.
Introduction
The algorithm presented here, named Basin Gridding, is a disk-based, mathematical algorithm for gridding large data sets. The algorithm is an extension of an in-core, iterative, B-Spline algorithm. (Lee, Wolberg and Shin, 1997) My implementation follows the workflow of the original algorithm, and retains the accuracy of that algorithm. Like that algorithm, Basin Gridding implements a one-directional, multilevel solution, working with coarse grids during early iterations and increasingly finer resolution grids during later iterations. The difference is that Basin Gridding requires minimal RAM. Because of the conservative use of RAM, Basin Gridding increases: a) the size of data sets that can be modeled and; b) the size of the grids that can be produced. The price paid for conservative use of RAM is reliance on disk-based storage and the associated time penalty. CPU and wall clock times for this algorithm are shown in Table 1.
Control Points | Grid Nodes (millions) | CPU (minutes) | Wall Clock (minutes) |
89,000 | .3 | .2 | .3 |
250,000 | 4.6 | .4 | .5 |
250,000 | 18.5 | .5 | 1 |
5,700,000 | 8.3 | 5.7 | 11 |
5,700,000 | 133.0 | 17 | 78 |
30,000,000 | 380.0 | 50 | 205 |
145,000,000 | 4,876 | 170 | 600 |
Table 1. Algorithm CPU and wall clock times.
Previous Work
A working assumption for all 2D gridding is that the data is sampled from an elevation function F, mapping R2 -> R1 and that the derived model approximates F to some degree of accuracy over R2. This problem has been studied for decades. It is possibly impossible to describe all types of gridding algorithms. Reviews can be found in papers and books by Foley and Hagen (1994), Franke and Nielson (1991), and Jones, Hamilton and Johnson (1986). A recent, general purpose algorithm for solving large gridding problems using matrix formulations and iterative techniques has been presented by A tradeoff rarely addressed for gridding algorithms is dependence on RAM and dependence on hard disk memory. In related domains, such as computational geometry, these tradeoffs are considered. For example, sweep-line algorithms for finding intersections between members of a set of line segments are both memory efficient and a natural way to organize computations. Fortune's (1989) sweep-line based, 2D triangulation is an algorithm that could be used for surface modeling that would make relatively small demands on RAM. I do not know that is being used anywhere for that reason. For Basin Gridding, the criteria for algorithm selection was an ability to model large data sets and create large models in small amounts of RAM, without sacrificing mathematical properties such as smoothness, C2 continuity, and accuracy. I wanted, at least in theory, an algorithm that could solve any size gridding problem. Basin Gridding is the result. It is an extension of an in-core, hierarchical B-Spline gridding algorithm. By trading computational time for disk access, I created an algorithm with the good qualities of the in-core algorithm with reduced dependence on RAM. Hierarchical B-Spline Gridding In-memory Algorithm Creating a grid-based surface model with B-Splines requires the calculation of a lattice of control points over a grid at the same (delta x, delta y) resolution. A multi-grid approach is often followed, starting with a coarse 4 x 4 grid which is refined at the end of every iteration until the target grid row and column spacing are achieved. Lattice values are derived so that the evaluation of a local set of control lattice values to weight a set of 3rd order basis functions produces an estimate of surface values in each grid cell. At the final grid resolution this compact support avoids global influence from purely local data distributions or local surface roughness. In addition, B-Spline surfaces using 3rd order basis functions are C2 continuous, which promises high quality interpolation between regions with different data densities. This general hierarchical B-Spline workflow is shown in Figure 1. Each square block represents an intermediate grid at some row and column spacing in the multi-grid progression. Each refinement, represented by an oval, creates a new grid with twice as many rows and columns at ½ the previous row and columns spacing. The refinement process has the feature that - before updating - the lattice values of the new finer grid represent the same surface model as the grid that was refined. So refinement supplies a more localized way to change lattice weights to better fit the available data at small cost. As shown in Figure 1 updating for each data point involves changing local lattice weights stored at the nodes of a 4 row by 4 columns subset of the entire grid centered at the cell holding the data point. Each individual update makes the implicit surface fit that data point's z-value better. For sparse data relative to grid row and column spacing this algorithm interpolates control point z values exactly. However, even at the final, densest grid resolution there may be multiple control points within 2 grid increments of many grid nodes. So the final grid will seldom interpolate all the input data exactly. Instead it will pass a smooth, representative surface model through such dense collections of data. Figure 1. Algorithm general work flow showing embedded 4 x 4 Basin Gridding Algorithm The local, 4 x 4 updating of lattice values is the algorithm feature that allows it store most columns of the current grid model on disk during each iteration. For any one control point updating modifies nodes in only 4 adjacent grid columns. Grid columns to the left of these 4, and columns to their right, might as well be kept on disk. If the data is sorted on its x coordinate then lattice updates will flow across the grid from left to right. As each new control point is reached, the 4 grid columns of interest will be either the 4 currently in memory, or a set of 4 adjacent columns to their right. This local focus combined with sorting of the input data breaks the RAM constraint. The focus on a few columns is indicated by the narrow rectangles inserted in each grid block in Figure 2. These rectangles represent grid columns that are in memory at any one time. These processing rectangles move across the grid from left to right. Columns of the grid to the left of the window have been updated and can be written to disk. Columns to the right are still on disk, not yet updated, waiting to be read into memory. The moving window scheme discussed above is also applied to the refinement step. Refinement also proceeds from left to right. At any instant a few columns of the grid at the input resolution will be in memory while being used to generate higher density columns of the output grid. The output grid columns have twice as many nodes as the input columns. This is not a problem because only three input columns need to be read into memory to create two output columns. Again, the output columns are written to disk after creation, the refinement process moves to the right, and another input grid column is read into memory. Figure 2. Disk-based work flow. In fact, the description above simplifies algorithm data structures and mathematical operations. Because multiple data points may influence the updating of a single grid node, the original in-core algorithm performs summation of update information in two auxiliary grids, and only at the end of an iteration updates the grid lattice values. That means that Basin Gridding must keep 4 columns from 3 grids in memory at one time. The actual updating of the lattice values in a left-hand column of the lattice grid is computed from the left hand column of the auxiliary grids just as the examination of a new control point reveals that one or more new right side columns of the control lattice need to be read from disk, and left side lattice columns updated and written to disk. New columns for the auxiliary grids are created as needed and node values in those two columns start as zeros. Left hand columns of the auxiliary grids that are no longer needed are simply deleted from RAM. Basin Gridding incorporates other enhancements. The flow charts in Figures 3 and 4 show grid refinement as a distinct operation. That means a lattice grid read from and written to disk twice for every iteration level. I avoid this by interleaving grid refinement with imposing data on the grid. As grid columns leave the processing window - as the window moves right, the algorithm keeps two extra left side columns in memory, does refinement in RAM, and then write two new columns of the refined grid lattice. The result of all the enhancements discussed above is a gridding algorithm constrained only by free space on the computer's hard disks. Discussion/Conclusion An algorithm has been presented for creating rectangular grid-based surface models using disk storage to increase the size of the data sets that can be modeled and the size of the models that can be produced. The method applies multi-grid B-Spline gridding in a novel way such that internal storage space is required for only a few columns of the target grid and one x,y, z triple of data at any one instance. The general workflow and mathematical details for memory-based hierarchical B-Spline gridding are honored, so there is no loss of accuracy from this use of disk storage. References Cited Billings S. D., R. K. Beatson, and G.N. Newsam, 2002, Interpolation of geophysical data using continuous global surfaces, Geophysics, 67, 1810-1818. Franke, R., and G. M. Nielson, 1991, Scattered data interpolation of large sets of scattered data, International Journal of Numerical Methods in Engineering, 15, 1,691-1,704. Foley, T., and H. Hagen, 1994, Advances in scattered data interpolation: Surv. Math. Fortune, S., 1987, A sweepline algorithm for Voronoi diagrams, Algorithmica, 2, 153-174. Jones, T. A., D. E. Hamilton, and D. E. Johnson, 1986, Contouring geological surfaces with the computer, Van Nostrand Reinhold. Lee, S., G. Wolberg, G., S. Y. Shin, S. Y., 1997, Scattered data interpolation with multilevel b-splines, IEEE Transactions on Visualization and Computer Graphics, 3, 228244.