GSA Connects 2021 in Portland, Oregon

Paper No. 245-5
Presentation Time: 2:35 PM

MACHINE LEARNING IN THE EARTH SCIENCES: A BROAD SURVEY WITH USE CASES FROM THE THROUGHPUT DATABASE


DOMINGUEZ VIDANA, Socorro1, GORING, Simon2, LENARD, Michael3, WOFFORD, Morgan3 and THOMER, Andrea K.3, (1)Vancouver, BC V5N4E8, CANADA, (2)Department of Geography, University of Wisconsin – Madison, 550 N Park St, Madison, WI 53706, (3)School of Information, University of Michigan, 105 S. State St., Ann Arbor, MI 41804

Throughput (https://throughputdb.com) is a graph database that has been developed with a focus on the geosciences. Throughput links code from online, public code repositories to data resources within the geosciences, in particular data repositories such as LinkedEarth (http://linked.earth), the Neotoma Paleoecology Database (https://neotomadb.org), Magic (https://www2.earthref.org/MagIC), EarthChem (https://www.earthchem.org/) and others. Throughput contains links to over 2000 data resources, through the Registry of Research Data Repositories (http://re3data.org), and links these resources to almost 80,000 code repositories.

Throughput provides a front-end application to help researchers discover new data resources and the code that can be used to explore those data resources. Among the nearly 80,000 code repositories indexed in Throughput, there are over 400 code repositories using machine learning using Physical Science data; however standards, use of best practices and the re-usability of these resources varies considerably.

The extent of data within Throughput and its connected resources allows us to apply machine learning techniques directly to the data resource itself, or to use the graph network and its associated UIDs to perform broader analysis on research products within the geosciences. We present work here that highlights elements of code repositories that improve discoverability and reuse for machine learning resources, and that applies best practices to machine learning applications involving Throughput. The first ML workflow involves NLP tagging of spatial data using the EarthCube funded GeoDeepDive infrastructure to improve data acquisition and metadata quality within research databases. The second ML workflow involves metadata augmentation to improve discoverability of geoscientific resources to improve the reuse of code, and reduce the time-to-science for earth science researchers working with machine learning algorithms.