Paper No. 280-5
Presentation Time: 9:00 AM
STRATEGIES FOR CURATING BIG DATA IN THE HETEROGENEOUS LONG TAIL: EXAMPLES FROM THE USGS SCIENCEBASE REPOSITORY AND SEAD DATA SERVICES
Properly curating big data remains a challenge for heterogeneous datasets in the long tail of research science data. Big data and long tail data used to be on opposite ends of a spectrum, but now with advances in technology for data collection, long tail datasets can fall into the category of big data. Creation of large datasets from high resolution imagery and high frequency sensors has been outgrowing capabilities for storing, maintaining, and serving the data; however, funder and publisher policies often require all supporting data of research products to be made available. Data repositories are working with users to create solutions for curating these large datasets. We present two cases where capabilities have been developed in response to user needs to handle big data from the long tail, the ScienceBase repository and the SEAD data services. ScienceBase (sciencebase.gov) is a data platform developed and maintained by the U.S. Geological Survey (USGS) to provide shared, permission-controlled access to scientific data products and bureau resources. SEAD (sead-data.net) offers end-to-end data services for managing, sharing, and preserving data through partnering repositories across a broad range of physical and social science disciplines. Both platforms have developed tools for transferring large data collections that exceed the size that is currently easily handled (10s of GB). We show examples from image- and video-heavy geomorphology datasets from the field and laboratory. This discussion of current strategies and challenges will help users understand options to provide increased access to their valuable and unique big datasets.