IMPERFECT DATA IN, IMPERFECT MODEL OUT: USING COMPETING MODELS TO DECIDE IF WE HAVE THE RIGHT DATA
To facilitate comparison of methods, we use the same data from the 2008 geothermal resource assessment (e.g., heat flow, horizontal stress) to train models from modern machine learning algorithms (i.e., logistic regression, eXtreme Gradient Boosting, support vector machines, and multilayer perceptron neural networks), which minimize dependence upon expert decisions. While some algorithms are simple (e.g., logistic regression), other algorithms are highly sophisticated (e.g., the neural network). Despite the contrast in complexity, the results from the very simple and highly complex algorithms are similar. In fact, the most complex machine learning model results (i.e., from the neural network) appear to be more similar to the simplest machine learning algorithm (i.e., logistic regression) than either of the models resulting from the expert decisions in the 2008 assessment, indicating human bias influenced estimates away from a machine-driven optimum.
The similarity of the models produced by the spectrum of the machine-learning algorithms is a direct result of the simplicity (and perhaps inadequacy) of the feature data. The feature data used in the 2008 geothermal resource assessment were imperfect approximations of geological conditions (e.g., heat flow and stress were interpolated and informed by measurements only sparsely available for some regions of the U.S.). These results demonstrate that there are not complex patterns within the previous data that can be mined by more sophisticated machine learning, indicating a fundamental limitation of the data previously used for identification of geothermal resource favorability. That is, the most important part of the machine learning workflow, the data, needs to be sufficient to make reliable predictions.