MACHINE LEARNING IN THE EARTH SCIENCES: A BROAD SURVEY WITH USE CASES FROM THE THROUGHPUT DATABASE

Dominguez Vidana, Socorro

Paper No. 245-5

Presentation Time: 2:35 PM

MACHINE LEARNING IN THE EARTH SCIENCES: A BROAD SURVEY WITH USE CASES FROM THE THROUGHPUT DATABASE

DOMINGUEZ VIDANA, Socorro¹, GORING, Simon², LENARD, Michael³, WOFFORD, Morgan³ and THOMER, Andrea K.³, (1)Vancouver, BC V5N4E8, CANADA, (2)Department of Geography, University of Wisconsin – Madison, 550 N Park St, Madison, WI 53706, (3)School of Information, University of Michigan, 105 S. State St., Ann Arbor, MI 41804

Throughput (https://throughputdb.com) is a graph database that has been developed with a focus on the geosciences. Throughput links code from online, public code repositories to data resources within the geosciences, in particular data repositories such as LinkedEarth (http://linked.earth), the Neotoma Paleoecology Database (https://neotomadb.org), Magic (https://www2.earthref.org/MagIC), EarthChem (https://www.earthchem.org/) and others. Throughput contains links to over 2000 data resources, through the Registry of Research Data Repositories (http://re3data.org), and links these resources to almost 80,000 code repositories.

Throughput provides a front-end application to help researchers discover new data resources and the code that can be used to explore those data resources. Among the nearly 80,000 code repositories indexed in Throughput, there are over 400 code repositories using machine learning using Physical Science data; however standards, use of best practices and the re-usability of these resources varies considerably.

The extent of data within Throughput and its connected resources allows us to apply machine learning techniques directly to the data resource itself, or to use the graph network and its associated UIDs to perform broader analysis on research products within the geosciences. We present work here that highlights elements of code repositories that improve discoverability and reuse for machine learning resources, and that applies best practices to machine learning applications involving Throughput. The first ML workflow involves NLP tagging of spatial data using the EarthCube funded GeoDeepDive infrastructure to improve data acquisition and metadata quality within research databases. The second ML workflow involves metadata augmentation to improve discoverability of geoscientific resources to improve the reuse of code, and reduce the time-to-science for earth science researchers working with machine learning algorithms.

Recorded Presentation

Session No. 245

T173. Machine Learning for Advancing Data Analysis Toolkit in Geoscience

Wednesday, 13 October 2021: 1:30 PM-5:30 PM

B113/B114 (Hybrid Room) (Oregon Convention Center)

Geological Society of America Abstracts with Programs. Vol 53, No. 6
doi: 10.1130/abs/2021AM-370665

© Copyright 2021 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.

Back to: T173. Machine Learning for Advancing Data Analysis Toolkit in Geoscience

<< Previous Abstract | Next Abstract >>

GSA Connects 2021 in Portland, Oregon

MACHINE LEARNING IN THE EARTH SCIENCES: A BROAD SURVEY WITH USE CASES FROM THE THROUGHPUT DATABASE