Paper No. 187-5
Presentation Time: 2:35 PM
ADAPTING SUPERVISED MACHINE LEARNING APPROACHES FOR HYDROTHERMAL RESOURCE ASSESSMENTS
The inherent mismatch of the data requirements of machine learning (intrinsic to the mathematical strategies employed) and the inherent qualities of natural resource data (e.g., sample bias, low number of samples, high correlation of input data) are challenges during the application of supervised machine learning for the development of natural resource assessments. Herein, we demonstrate how to address problems presented by: positive-unlabeled classifications (i.e., knowing only where some hydrothermal systems [positives] are located, and no locations authoritatively classified as having no hydrothermal convection) by recognizing that most locations are negatives and that a statistically small number of true but unlabeled positives will be mislabeled as negative during the analyses; class imbalance (i.e., that hydrothermal systems [positives] are inherently sparse compared with the total area that contains no hydrothermal systems [negatives]) by training and testing using the expected natural ratio of the classes; having few labeled examples (i.e., that there are only dozens or hundreds of known hydrothermal systems for a region) by selecting appropriate supervised machine learning algorithms and using informative features; and the influence of correlated but not causative features (e.g., regional trends in elevation when positives are common in only a few regions) by ensuring that the sampling of negative sites appropriately samples the range of feature values. We find that, after properly addressing the intrinsic characteristics of machine learning strategies, data-driven approaches can account for the unique qualities of natural resource data to improve upon hydrothermal resource assessments by reducing potential bias from expert decisions.