NEAREST-NEIGHBOR MACHINE LEARNING FEATURE SELECTION FOR INTERPRETATION OF MICROBIAL MOLECULAR SIGNATURES FROM ISOTOPE RATIO MASS SPECTROMETRY DATA
Previously, we developed a machine learning feature selection algorithm called nearest-neighbor projected distance regression (NPDR) that has the ability to identify important model features that involve complex interactions and automatically reduce correlation and the dimensionality in a high-dimensional variable space. The standard distance metrics used in NPDR – Manhattan and Euclidean – assume the multivariate data are isotropic, which is often violated in real data due to differences in the covariance between variables. Thus, we extend NPDR to include a random forest distance, and other anisotropic distance metrics, for computing nearest neighbors. We also augment the isotope-ratio MS data with time-series features from the raw MS signal to improve biotic classification.
We test NPDR on our novel experimental ocean world seawater analog MS data. We measure isotope fractionations of volatile CO2 that could be measured in exospheres or plumes. Samples include baseline abiotic conditions using a range of possible seawater chemistry consistent with Europa and Enceladus, and biotic samples that include microbes in these seawaters. We use penalized NPDR with random forest proximity to identify interpretable microbial molecular signatures. We compare features with random forest importance, and we train a classifier that discriminates between biotic and abiotic samples with high accuracy. These ML-trained ocean-world analog MS data could be used to assist in identifying biosignatures during future missions.