GSA Connects 2022 meeting in Denver, Colorado

Paper No. 172-1
Presentation Time: 9:00 AM-1:00 PM

IMPERFECT DATA IN, IMPERFECT MODEL OUT: USING COMPETING MODELS TO DECIDE IF WE HAVE THE RIGHT DATA


MORDENSKY, Stanley, U.S. Geological Survey, 2130 SW 5th Ave, Portland, OR 97201, LIPOR, John, Electrical & Computer Engineering, Portland State University, Portland, OR 97201, DEANGELO, Jacob, U.S. Geological Survey, Geology, Minerals, Energy, and Geophysics Science Center, Moffett Field, CA 94025, BURNS, Erick, U.S. Geological Survey, Geology, Minerals, Energy, and Geophysics Science Center, Portland, OR 97201 and LINDSEY, Cary R., Geology, Minerals, Energy, Geophysics Science Center, U.S. Geological Survey, 2130 SW 5th Ave., Portland, OR 97201

Previous geothermal resource assessments of the western U.S. utilized data-driven methods (i.e., weight-of-evidence and logistic regression) to estimate resource favorability, but these analyses relied upon some non-ideal approaches for data science (i.e., expert decisions). Although expert decisions can add confidence to aspects of the modeling process by ensuring seemingly reasonable models are employed, expert decisions also introduce human bias, which presents a potential source of error that may affect model performance.

To facilitate comparison of methods, we use the same data from the 2008 geothermal resource assessment (e.g., heat flow, horizontal stress) to train models from modern machine learning algorithms (i.e., logistic regression, eXtreme Gradient Boosting, support vector machines, and multilayer perceptron neural networks), which minimize dependence upon expert decisions. While some algorithms are simple (e.g., logistic regression), other algorithms are highly sophisticated (e.g., the neural network). Despite the contrast in complexity, the results from the very simple and highly complex algorithms are similar. In fact, the most complex machine learning model results (i.e., from the neural network) appear to be more similar to the simplest machine learning algorithm (i.e., logistic regression) than either of the models resulting from the expert decisions in the 2008 assessment, indicating human bias influenced estimates away from a machine-driven optimum.

The similarity of the models produced by the spectrum of the machine-learning algorithms is a direct result of the simplicity (and perhaps inadequacy) of the feature data. The feature data used in the 2008 geothermal resource assessment were imperfect approximations of geological conditions (e.g., heat flow and stress were interpolated and informed by measurements only sparsely available for some regions of the U.S.). These results demonstrate that there are not complex patterns within the previous data that can be mined by more sophisticated machine learning, indicating a fundamental limitation of the data previously used for identification of geothermal resource favorability. That is, the most important part of the machine learning workflow, the data, needs to be sufficient to make reliable predictions.