MACHINE LEARNING IN THE EARTH SCIENCES: A BROAD SURVEY WITH USE CASES FROM THE THROUGHPUT DATABASE
Throughput provides a front-end application to help researchers discover new data resources and the code that can be used to explore those data resources. Among the nearly 80,000 code repositories indexed in Throughput, there are over 400 code repositories using machine learning using Physical Science data; however standards, use of best practices and the re-usability of these resources varies considerably.
The extent of data within Throughput and its connected resources allows us to apply machine learning techniques directly to the data resource itself, or to use the graph network and its associated UIDs to perform broader analysis on research products within the geosciences. We present work here that highlights elements of code repositories that improve discoverability and reuse for machine learning resources, and that applies best practices to machine learning applications involving Throughput. The first ML workflow involves NLP tagging of spatial data using the EarthCube funded GeoDeepDive infrastructure to improve data acquisition and metadata quality within research databases. The second ML workflow involves metadata augmentation to improve discoverability of geoscientific resources to improve the reuse of code, and reduce the time-to-science for earth science researchers working with machine learning algorithms.