APPLYING DATA MINING TECHNIQUES TO CAPTURE OUTLIERS IN GROUNDWATER LEVEL DATA: A CASE STUDY OF EDWARDS AQUIFER, CENTRAL TEXAS
Techniques capable of identifying outliers are required to prepare complex spatial-temporal data for predictive modeling. Data mining techniques using large time-series datasets are especially useful when modeling complex systems such as the Edwards Aquifer. Data mining techniques identify distinctive patterns in time series datasets and apply those findings to new data to simulate, predict, and forecast complex systems.
Preliminary analyses confirm that the Edwards Aquifer’s complex hydrogeology and groundwater pumping introduces outlier responses into the data. To better understand outliers and what causes them, this research conducts the data mining technique of hierarchical clustering analyses, combined with distance time warping (DTW) algorithm as a similarity measure, of daily groundwater levels from observation wells under various temporal scales and hydrologic conditions in order to detect outliers in the dataset, which will then be assessed for future model input. The DTW algorithm measures similarity of observation wells independent of time thereby accounting for spatial dependency and spatial heterogeneity in the Edwards Aquifer. As a result, clustering using the time-invariant DTW algorithm produces more robust cluster solutions than other similarity measures that are time-variant, such as Euclidean distance. We believe the outcome of this study can reveal distinctive spatial-temporal patterns between groundwater wells such as changing flow paths under different hydrologic conditions, which can introduce outliers in the data. Future research will use clusters identified in this hierarchical cluster analysis as inputs to an artificial neural network to predict groundwater levels in the Edwards Aquifer.