GSA Connects 2021 in Portland, Oregon

Paper No. 245-6
Presentation Time: 2:50 PM

IMPROVING MACHINE-READABILITY OF GEOCHEMICAL DATA AND SAMPLE METADATA IN EARTHCHEM/LEPR/TRACEDS/ASTROMAT AND SESAR


JI, Peng, PROFETA, Lucia and LEHNERT, Kerstin, Lamont-Doherty Earth Observatory, Columbia University, 61 Rte 9W, Palisades, NY 10964

Geochemical data has been historically difficult to compile and ingest into machine-readable formats that are suitable for machine learning (ML) applications. Multiple efforts are underway on a global scale to improve data access to clean and harmonize data, both for legacy data, as well as for data from new studies.

EarthChem and related systems (LEPR/traceDs/AstroMat and SESAR) provide compilations of curated data and metadata for sample and experiment based data. With over 6 million chemical values in PetDB alone, and integrated access to databases such as GEOROC and SedDB through the EarthChem Portal, EarthChem provides an excellent foundational dataset for ML applications. While these data are currently available through user interfaces and web services, we are investigating software solutions that can provide a more streamlined pipeline for data extraction, transformation, and fusion.

The goal is to create a general processing model which will interact with new generation data analysis methods and tools. This framework can be adapted to fit data stores from partner organizations. We will present our concept and proposed implementation and request feedback from the broader community to ensure that it is an optimal solution for data scientists looking to utilize our catalogues in their ML based research.