GSA Annual Meeting in Denver, Colorado, USA - 2016

Paper No. 105-10
Presentation Time: 10:15 AM

LINKING DATA SILOS VIA FUZZY MATCHING ALGORITHMS


LAUTERS, Jonathan David, Florida State University, 731 Rundell St, Iowa City, IA 52240 and NELSON, Gil, iDigBio, Florida State University, Tallahassee, FL 32306, jonathan.lauters@gmail.com

ePANDDA (enhanced PAleontological and Neontological Data Discovery API ) is an EarthCube Integrative Activities project designed to increase accessibility, linking, and discovery of paleontological and neontological data across existing siloed data stores. Initial collaborators include PaleoBiology Database (PBDB), iDigBio, and iDigPaleo. The ultimate goal of the project is to create an independent, transactional API that communicates with the APIs of participating databases to distribute and process queries between them for the purpose of returning formatted datasets. Using parameter-driven configurable RESTful API calls will allow search tools to be built that will eliminate the need to make separate searches of multiple data portals and the subsequent time consuming step of translating these data into a coherent dataset by hand. ePANDDA will enable users to leverage data-matching logic to create tailored apps for visualization, outreach, and collaboration. The data matching logic in ePANDDA matches different types of data (e.g. specimens mentioned in publications to individual specimen records) through non trivial means. Due to the complexity of these matching efforts and the amount of data, real time access to the data providers APIs was found to be inoperable. Distributed computing practices are employed instead to perform bulk matching of available datasets and cache resulting data. Identifiers (e.g. UUIDs, DOIs, ORCID) are becoming increasingly valuable and will be used to help foster relations across data types. ePANDDA will use the OpenAnnotation model to push citation data back to iDigBio. Returned annotations will be incorporated into PBDB, iDigBio, and iDigPaleo, allowing each database to enhance data completeness for its users while eliminating the need for replicating, duplicating, or mirroring existing data across multiple data stores. Future goals include building and demonstrating an innovative model for linking data, and providing avenues for bringing other collaborating databases online.