MINING THE PUBLISHED LITERATURE TO INDEX THE CHARACTERISTICS OF GEOLOGIC UNITS
Here, we describe new efforts to extract geologic units characteristics from the scientific literature using machine reading. Using the xDD digital library of over 17M (and growing) scientific papers, we have built a pipeline that finds mentions of geologic units in context, extracts structured data about their properties, and aligns them with Macrostrat data dictionaries. Several classes of models are being tested; initial results suggest that large language models (LLMs) produce more correct relationships for complex phrasings, while fine-tuned bidirectional transformers (BERT) provide results more closely matching the targeted structured vocabulary.
A supporting software infrastructure allows batch processing to generate descriptions of many units at once, and a user interface supports training and model validation. By scaling this system up, targeting new descriptors (e.g., sedimentary structures, igneous textures, alteration mineralogy), and engaging geologists to curate the dataset, we hope to build an index of rock-record properties for mineral system modeling and other targeted analyses of crustal rocks. Accepted results will be incorporated into Macrostrat's structured database, improving a data resource in wide use by the community.