2015 GSA Annual Meeting in Baltimore, Maryland, USA (1-4 November 2015)

Paper No. 239-9
Presentation Time: 3:45 PM


PETERS, Shanan E.1, ZHANG, Ce2, ROSS, Ian2 and LIVNY, Miron2, (1)Department of Geoscience, University of Wisconsin–Madison, 1215 W Dayton St, Madison, WI 53706, (2)Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, peters@geology.wisc.edu

Recent developments in machine reading and learning approaches to text and data mining hold considerable promise for accelerating the pace and quality of literature-based data synthesis, but these advances have outpaced access to the published literature. For paleobiology, a discipline that has many important research questions that require synthesizing data from multiple sources, this limitation is a significant handicap. Here we describe a computing infrastructure to support automated paleontological data recognition and extraction from the published literature. Specifically, our infrastructure supports automated, rate-controlled fetching of original documents and their full bibliographic citation metadata from remote servers, the secure storage of these original documents, and the utilization of considerable high-throughput computing resources for their pre-processing by optical character recognition, natural language parsing, and other document annotation and parsing software tools. New tools and versions of existing tools can be automatically deployed against all original documents when they are made available. The products of these software tools (text/XML files) are managed by MongoDB and are available for use in data extraction applications. Documents containing paleontologically relevant information are identified using a combination of ElasticSearch and document classifiers. TDM-ready data for these original documents are then retrieved and passed to the PaleoDeepDive (PDD) machine reading and learning system, which is used to populate a database structure similar to that of the Paleobiology Database. The PDD system is kept up-to-date as new relevant documents are incorporated into the digital library. Currently, our digital library contains more than 370K documents from Elsevier and the USGS and we are actively seeking additional content providers. By focusing on building a dependable infrastructure to support the retrieval, storage, and pre-processing of published content, we are establishing a foundation for complex, and continually improving, information integration and data extraction applications. The PaleoDeepDive application is one such system, and we invite collaborations for similar initiatives.