GSA Connects 2024 Meeting in Anaheim, California

Paper No. 62-2
Presentation Time: 1:50 PM

RANDOM FOREST PREDICTION OF GEOGENIC CONTAMINANTS IN THE MIDWESTERN CAMBRIAN-ORDOVICIAN AQUIFER


RAMEY-LARIVIERE, Juliet and GINDER-VOGEL, Matthew, Environmental Chemistry and Technology Program, Civil and Environmental Engineering Department, University of Wisconsin-Madison, 660 North Park Street, Madison, WI 53706

The Midwestern Cambrian-Ordovician Aquifer System (MCOAS) provides groundwater to almost 30 million people in seven states. Among principal aquifers in the United States, the MCOAS has consistently shown concentrations of geogenic contaminants approaching or above federal health standards. Regularly testing groundwater for all geogenic contaminants requires extensive time and resources. This study focuses on Minnesota, Illinois, and Wisconsin where data available includes pH, total dissolved solids, hardness, and common groundwater cations and anions, however, the sampling frequency for each constituent varies. Here we attempt to understand if data generated from regulation-mandated sampling between 2000 and 2024 is sufficient to predict the concentration of other constituents. Hierarchical cluster analysis (HCA) and principal cluster analysis (PCA) were used to determine if existing groundwater samples showed discernable patterns to build a predictive model. Preliminary clustering analysis shows the importance of geographic location, inter-sample proximity, and local hydrostratigraphy.

Previous studies have reported success with classification and regression decision trees for groundwater quality prediction. Random forest decision tree models incorporate bootstrapping increasing prediction stability. Even with limited observations and features, random forest classification can predict binary outcomes (ie. above or below the health standard limit) with moderate to high accuracy (~70-95%). Comparison of predicted versus actual concentration values from random forest regression yields significant p-values (<0.05) and low R-squared values (20-40%). This indicates variables other than geogenic contaminant concentration are important when predicting the concentration of a target constituent on a continuous scale. Regulatory groundwater monitoring appears to be sufficient for predicting the binary outcome of a target analyte with fairly high accuracy. Future work will include characterizing other variables that may contribute to the variance otherwise unexplained by geogenic contaminant concentrations and testing other appropriate classification methods. Ideally, this will help local governments get a comprehensive view of water quality in the aquifer without greatly increasing sampling efforts.