Paper No. 187-4
Presentation Time: 2:20 PM
IMPROVING DATA-DRIVEN RESOURCE ASSESSMENT BY ACCOUNTING FOR THE EXPECTED NATURAL DISTRIBUTION OF HYDROTHERMAL SYSTEMS
When utilizing machine learning (ML) to predict hydrothermal resource favorability, selecting positive training sites (i.e., locations with documented hydrothermal features) is straightforward, but selecting negative training sites (i.e., locations with no hydrothermal system) is a challenge because there exist hidden systems (i.e., hidden positives) and there may be many types of negatives (e.g., either no heat or no permeability or both). Because approximately equal numbers of positives and negatives are ideal for most supervised machine learning strategies, the Nevada Machine Learning project (NVML) team selected 62 high-confidence negatives for use with the 83 known hydrothermal systems (positives), fitting an artificial neural network model to identify areas favorable for hydrothermal systems. Herein, we consider two alternative strategies of choosing negatives based on the assumption that hydrothermal systems are sparse. Therefore, most areas not previously identified as positive are likely to be negative, so randomly sampling areas that are not associated with favorable structural settings will better represent the range of different types of negatives. We demonstrate that the NVML strategy introduces bias compared with randomly selecting negatives from areas outside ellipses defined by NVML as containing favorable structural settings. Specifically, using two ML algorithms (logistic regression and XGBoost), we compare three training strategies: 1) the NVML strategy with its chosen positives and negatives; 2) a sampling strategy with the chosen positives of NVML and an equal number of randomly selected negatives as the number of negatives in NVML; and 3) a sampling strategy with the same positives of NVML and random negative sampling that accounts for class imbalance. The first two strategies show the influence of how negatives are chosen, and comparisons between the last two strategies demonstrates the importance of the assumption that hydrothermal systems are sparse. Using training strategy 3, XGBoost emerges as the top performing algorithm. This work demonstrates that the expert selection of negatives may impart bias if the full range of the different types of negatives are not selected when training ML models.