GSA Connects 2024 Meeting in Anaheim, California

Paper No. 51-1
Presentation Time: 1:35 PM

MINING THE PUBLISHED LITERATURE TO INDEX THE CHARACTERISTICS OF GEOLOGIC UNITS


QUINN, Daven, Department of Geoscience, University of Wisconsin – Madison, 1215 W Dayton, Madison, WI 53703

Geologic units are the core concepts used by geologists to divide and classify the rocks of Earth's crust. Units are defined in several ways, each with its own set of published data products: the spatial and temporal bounds of rock units are described in maps and stratigraphic columns, respectively, and their characteristics (e.g., mineral contents, fossils, sedimentary structures, igneous textures, paleoenvironment) in scientific papers. Formalized naming schemes (e.g., GeoLex) and digital systems such as Macrostrat (Peters et al. 2018) have helped connect unit descriptions across these different media. However, detailed information about geologic units still must be read from the narrative text of papers and reports, complicating synthetic analysis of geology at scale.

Here, we describe new efforts to extract geologic units characteristics from the scientific literature using machine reading. Using the xDD digital library of over 17M (and growing) scientific papers, we have built a pipeline that finds mentions of geologic units in context, extracts structured data about their properties, and aligns them with Macrostrat data dictionaries. Several classes of models are being tested; initial results suggest that large language models (LLMs) produce more correct relationships for complex phrasings, while fine-tuned bidirectional transformers (BERT) provide results more closely matching the targeted structured vocabulary.

A supporting software infrastructure allows batch processing to generate descriptions of many units at once, and a user interface supports training and model validation. By scaling this system up, targeting new descriptors (e.g., sedimentary structures, igneous textures, alteration mineralogy), and engaging geologists to curate the dataset, we hope to build an index of rock-record properties for mineral system modeling and other targeted analyses of crustal rocks. Accepted results will be incorporated into Macrostrat's structured database, improving a data resource in wide use by the community.