GSA 2020 Connects Online

Paper No. 158-3
Presentation Time: 5:55 PM

DIGITAL SOIL PARENT MATERIAL MAPPING: MODELED VERSUS MAPPED ACCURACY AND VARIABLE IMPORTANCE


BATEMAN MCDONALD, Jacob M., Lewis F. Rogers Institute for Environmental and Spatial Analysis, University of North Georgia, 3820 Mundy Mill Rd, Oakwood, GA 30566

This research used a single county's soil survey and the random forest classification algorithm to predict the distribution of soil parent material in the Southern Blue Ridge Mountains of western North Carolina. Three training set selection techniques (area-dependent, equal sample, and a hybrid approach) were used to randomly select points from the different soil parent material types (e.g., alluvium, colluvium, residuum). A large number of land surface characteristics were attributed to each of the training set points. The land surface characteristics that were used included standard first- and second-order derivatives of a DEM (e.g., slope and curvature), as well as variables that describe landscape position relative to ridges, hillslopes, and bottomlands. To determine the variables that best describe each parent material type, an iterative variable reduction process of was used to create a series of random forest models using a progressively reduced (decorrelated) predictor set. While the random forest modeling results suggest high producer and user accuracies, mapped predictions were only able to provide highly accurate predictions for soil parent material classes that were best represented by their training set. The poor prediction of either minority or majority classes was due to the overlap in variable space between some classes (e.g., residuum and old alluvium are in similar landscape positions). The hybrid training set provided the best model and mapped accuracies but the overlap in variable space continued to be a source of error in the mapped predictions. The results of this analysis question the use of the initial variable importance measures to determine the absolute importance of variables in random forest models. Alternatively, this research suggests using an iterative method in which correlated variables are iteratively discarded from the predictor set so that ‘true’ variable importance can be determined.