IMPROVING MACHINE-READABILITY OF GEOCHEMICAL DATA AND SAMPLE METADATA IN EARTHCHEM/LEPR/TRACEDS/ASTROMAT AND SESAR
EarthChem and related systems (LEPR/traceDs/AstroMat and SESAR) provide compilations of curated data and metadata for sample and experiment based data. With over 6 million chemical values in PetDB alone, and integrated access to databases such as GEOROC and SedDB through the EarthChem Portal, EarthChem provides an excellent foundational dataset for ML applications. While these data are currently available through user interfaces and web services, we are investigating software solutions that can provide a more streamlined pipeline for data extraction, transformation, and fusion.
The goal is to create a general processing model which will interact with new generation data analysis methods and tools. This framework can be adapted to fit data stores from partner organizations. We will present our concept and proposed implementation and request feedback from the broader community to ensure that it is an optimal solution for data scientists looking to utilize our catalogues in their ML based research.