PALEODEEPDIVE: TOWARDS AN AUTOMATED SYSTEM FOR PALEONTOLOGICAL DATA DISCOVERY AND RETRIEVAL FROM THE PUBLISHED LITERATURE

PETERS, Shanan E.¹, ZHANG, Ce², ROSS, Ian² and LIVNY, Miron², (1)Department of Geoscience, University of Wisconsin–Madison, 1215 W Dayton St, Madison, WI 53706, (2)Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, peters@geology.wisc.edu

Recent developments in machine reading and learning approaches to text and data mining hold considerable promise for accelerating the pace and quality of literature-based data synthesis, but these advances have outpaced access to the published literature. For paleobiology, a discipline that has many important research questions that require synthesizing data from multiple sources, this limitation is a significant handicap. Here we describe a computing infrastructure to support automated paleontological data recognition and extraction from the published literature. Specifically, our infrastructure supports automated, rate-controlled fetching of original documents and their full bibliographic citation metadata from remote servers, the secure storage of these original documents, and the utilization of considerable high-throughput computing resources for their pre-processing by optical character recognition, natural language parsing, and other document annotation and parsing software tools. New tools and versions of existing tools can be automatically deployed against all original documents when they are made available. The products of these software tools (text/XML files) are managed by MongoDB and are available for use in data extraction applications. Documents containing paleontologically relevant information are identified using a combination of ElasticSearch and document classifiers. TDM-ready data for these original documents are then retrieved and passed to the PaleoDeepDive (PDD) machine reading and learning system, which is used to populate a database structure similar to that of the Paleobiology Database. The PDD system is kept up-to-date as new relevant documents are incorporated into the digital library. Currently, our digital library contains more than 370K documents from Elsevier and the USGS and we are actively seeking additional content providers. By focusing on building a dependable infrastructure to support the retrieval, storage, and pre-processing of published content, we are establishing a foundation for complex, and continually improving, information integration and data extraction applications. The PaleoDeepDive application is one such system, and we invite collaborations for similar initiatives.

Session No. 239

T46. Using Digitized Data in Geological and Paleontological Research I

Tuesday, 3 November 2015: 1:30 PM-5:30 PM

Room 314 (Baltimore Convention Center)

Geological Society of America Abstracts with Programs. Vol. 47, No. 7, p.613

© Copyright 2015 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.

Back to: T46. Using Digitized Data in Geological and Paleontological Research I

<< Previous Abstract | Next Abstract >>

2015 GSA Annual Meeting in Baltimore, Maryland, USA (1-4 November 2015)

PALEODEEPDIVE: TOWARDS AN AUTOMATED SYSTEM FOR PALEONTOLOGICAL DATA DISCOVERY AND RETRIEVAL FROM THE PUBLISHED LITERATURE