GSA Connects 2022 meeting in Denver, Colorado

Paper No. 87-13
Presentation Time: 11:15 AM

NEAREST-NEIGHBOR MACHINE LEARNING FEATURE SELECTION FOR INTERPRETATION OF MICROBIAL MOLECULAR SIGNATURES FROM ISOTOPE RATIO MASS SPECTROMETRY DATA


CLOUGH, Lily1, MCKINNEY, Brett A.2, THEILING, Bethany P.3, DA POIAN, Victoria4, CHEN, Jingyi1, MAJOR, Jonathan5 and SEYLER, Lauren6, (1)Department of Geosciences, University of Tulsa, 800 S. Tucker Drive, Tulsa, OK 74104, (2)Tandy School of Computer Science, University of Tulsa, 800 S Tucker Dr, Tulsa, OK 74104, (3)Planetary Environments Laboratory, NASA Goddard Space Flight Center, 8800 Greenbelt Rd, Greenbelt, MD 20771, (4)Microtel LLC, 7703 Belle Point Dr, Greenbelt, MD 20770; Planetary Environments Laboratory, NASA Goddard Space Flight Center, 8800 Greenbelt Rd, Greenbelt, MD 20771, (5)School of Geosciences, University of South Florida, 4202 E Fowler Ave, Tampa, FL 33620, (6)School of Natural Sciences and Mathematics, Stockton University, 101 Vera King Farris Dr, Galloway, NJ 08205

Mass spectrometry (MS) promises to be a powerful tool for potential biosignature detection during astrobiological missions on ocean worlds in our solar system. Accurate and generalizable machine learning methods could enhance science return on investment by predicting seawater chemistry and classifying isotopic biosignatures, either as a signature consistent with microbial life (biotic) or as a novelty (unclassified/unique). However, machine learning models are likely to be complex and involve interactions between MS features, making biosignatures difficult to interpret. Feature selection methods provide biological and chemical context that help interpret the mechanisms of machine learning models, but these methods also need the ability to detect complex interactions.

Previously, we developed a machine learning feature selection algorithm called nearest-neighbor projected distance regression (NPDR) that has the ability to identify important model features that involve complex interactions and automatically reduce correlation and the dimensionality in a high-dimensional variable space. The standard distance metrics used in NPDR – Manhattan and Euclidean – assume the multivariate data are isotropic, which is often violated in real data due to differences in the covariance between variables. Thus, we extend NPDR to include a random forest distance, and other anisotropic distance metrics, for computing nearest neighbors. We also augment the isotope-ratio MS data with time-series features from the raw MS signal to improve biotic classification.

We test NPDR on our novel experimental ocean world seawater analog MS data. We measure isotope fractionations of volatile CO2 that could be measured in exospheres or plumes. Samples include baseline abiotic conditions using a range of possible seawater chemistry consistent with Europa and Enceladus, and biotic samples that include microbes in these seawaters. We use penalized NPDR with random forest proximity to identify interpretable microbial molecular signatures. We compare features with random forest importance, and we train a classifier that discriminates between biotic and abiotic samples with high accuracy. These ML-trained ocean-world analog MS data could be used to assist in identifying biosignatures during future missions.