Paper No. 12-6
Presentation Time: 9:15 AM
TEXT-MINING THE BRYOZOAN FOSSIL RECORD
An increasing number of observations of fossil organisms are recorded by paleontologists. When studying the dynamics of speciation and extinction, researchers often use compiled literature datasets such as those found in the Paleobiology Database. While the Paleobiology Database is an excellent resource for many groups of fossil animals, there are substantial gaps in the coverage of several groups, such as the Cenozoic bryozoans. The compilation of observational data from the published literature is a challenging and labor-intensive endeavor. Using Bryozoa as an exemplary case study, we use natural language processing to extract temporal distributions of fossils in an automated fashion. We perform named-entity recognition of bryozoan species names and geological time intervals in published articles and books using dictionaries of known names. Next, we apply supervised machine-learning techniques to discriminate between the observation or non-observation of a fossil species in a geological time-interval, given their co-appearance in a sentence. This type of information retrieval is reproducible from end-to-end, making tasks such as reference lookup and outlier inspection of the fossil record more transparent. Our preliminary results indicate that human and machine-based information retrieval are similarly accurate. Raw observed genus counts through time appear congruent, yet not identical, with previous counts of richness in Cenozoic bryozoans. We present estimates of true richness and diversification rates using capture-recapture approaches using our machine-compiled data. Remaining challenges include updating outdated taxonomic names, incorporating non-English texts, and the acquisition of large-volume documents in the presence of legal restrictions on intellectual property. We argue that this approach can easily be adapted for other groups of fossil animals, and would be especially beneficial when studying groups that are under-represented in curated databases.