GSA Connects 2021 in Portland, Oregon

Paper No. 170-9
Presentation Time: 3:50 PM

APPLYING DATA-DRIVEN MACHINE LEARNING TO GEOTHERMAL FAVORABILITY, WESTERN UNITED STATES


MORDENSKY, Stanley1, LIPOR, John2, DEANGELO, Jacob3, BURNS, Erick R.1 and LINDSEY, Cary1, (1)U.S. Geological Survey, 2130 SW 5th Ave., Portland, OR 97201, (2)Electrical & Computer Engineering, Portland State University, Portland, OR 97201, (3)U.S. Geological Survey, MS989, 345 Middlefield Road, Menlo Park, CA 94025

We demonstrate that modern machine-learning methods and data-science strategies can be used to reproduce essential findings and potentially improve on past geothermal energy assessments while relying less on expert input into the process. This study demonstrates that two foundational machine learning algorithms (logistic regression and XGBoost), implemented using unbiased data analysis strategies, agree with previous studies that relied much more heavily on expert-systems knowledge. The linear method we use, logistic regression, conforms well with the binned logistic regression and weights-of-evidence approaches used for the 2008 USGS conventional-hydrothermal, resource-favorability maps. The non-linear XGBoost provides an alternate interpretation that broadly agrees and may provide increased granularity in favorability maps.

To provide a direct comparison, we use the same input data from the 2008 conventional-hydrothermal, resource-favorability study to create new favorability maps. This 2008 study relied upon methods that required input data to be binned when creating maps of geothermal favorability, thereby requiring bin-value exploration and selection and, consequently, human-made decisions (e.g., bin quantity, bin limits). Our study presents probability maps for the western US created using modern, data-driven strategies (i.e., no expert choices in the algorithmic application) in an effort to remove human bias and minimize the considerable effort of the expert in creating resource maps. During the analysis, two overarching challenges were identified: 1) the training data have only positive examples (i.e., known hydrothermal systems) and unlabeled examples (comprised of negative [i.e., no hydrothermal system present] and unidentified positive examples) and 2) extreme class imbalance (estimated to have approximately a 1 : 2600 positive-example : unlabeled-example ratio). To address challenge number 1), unsupervised clustering of features was used to identify groups of likely true negative examples, and these likely true negative examples and the known positives were then sampled proportionally for use with the supervised methods. To address challenge number 2), a customized oversampling training strategy was selected for creating a reliable classifier.