AUTOMATED MULTI-DISCIPLINARY COLLECTION BUILDING
Geologic disciplines often need to combine both historic and current data, observations, and interpretations. If data collections exist, they tend to be home grown, often drawing on the isolated expertise of investigators, with minimal technological support. Expensive to acquire at $25K per day, and generally impossible to re-create, shipboard data are potentially valuable for a wide range of disciplines, far beyond the award that funded the original expedition. However, data need to be discovered before they can be used, and appropriate metadata need to be generated to support effective wide community access.
SIOExplorer presents, as a scalable solution, a multi-disciplinary digital library of over 50 years of shipboard data. Based upon an extensible metadata scheme and implemented with technology from the San Diego Supercomputer Center, SIOExplorer enables discovery of data collected onboard Scripps Institution of Oceanography (SIO) research vessels. The SIOExplorer digital library consists of over 700 SIO cruises, with more than 100,000 digital objects, including datasets, documents and images.
The efforts are being extended to the collections of the Woods Hole Oceanographic Institution (WHOI) to include cruises, Alvin submersible dives and ROV lowerings. The technology also supports the efforts of thousands of scientists from dozens of nations with the Site Survey Data Bank of the Integrated Ocean Drilling Program (IODP).
Streamlined procedures have been developed to stage the data, extract metadata from data files, perform quality control and error correction, and publish metadata and data in a searchable digital library. Discovery tools search a PostgreSQL database with metadata, and deliver relevant objects from a Storage Resource Broker (SRB) instance. The system provides both text-based AJAX webform and interactive geographical Java interfaces.
Information Infrastructure
The SIOExplorer is a unique collaboration between scientists, technologists, and educators. The lifecycle of information within the digital library includes data creation, quality control and data packaging, publication via the web, and data access/analysis/use. To support each of these activities, a scalable scientific and technological infrastructure has been developed.
The infrastructure for the digital library is defined by a metadata scheme. Currently broken into eight sections, or blocks, the SIOExplorer metadata template file (mtf) file comprehensively describes the scientific framework in which the digital object was created.
Figure 1. SIOExplorer Metadata Scheme. Every Arbitrary Digital Object (ADO) is associated with a metadata structure complete with descriptive information. Starred entries indicate fields that are managed with controlled vocabularies.
Interoperability and Controlled Vocabularies
Interoperability has long been a goal of digital library implementations. To facilitate collaboration, tools are developed that can be utilized by other projects, and/or adapted to specific implementation needs. One example is the recent development of the controlled vocabulary dictionary. In addition to a list of acceptable values for metadata parameters, the controlled vocabulary dictionary incorporates a human-readable explanation of each allowed value. Scientists, technologists and project managers can access this controlled vocabulary dictionary, which enables accurate interpretation of the metadata.
As a member of the Marine Metadata Interoperability (MMI) project, SIO promotes the exchange, integration and use of marine data through enhanced data publishing, discovery, documentation and accessibility. The MMI community works to develop best practices and guidance on project development and implementation. Community members include metadata experts, scientific researchers, and project managers.
Auto Harvesting in an Imperfect World
Over the years, and across projects and disciplines, there is an unfortunate tendency for descriptive terminology within metadata to wander. Some of the variation is due to evolution in sensor technology, but some may be due to odd abbreviations, typographical errors on rolling decks, institutional practices, or a momentary inspiration to use a new term. As a consequence, we now face challenges in searching digital collections, and in designing re-usable tools that can be applied to multiple institutions.
Practical experience with SIOExplorer has enabled the development of techniques to assess variations in metadata values across collections. The assessment helps to guide the development of controlled vocabularies, which in turn can be used to enable automatic detection of metadata errors, and in some cases automatic correction.
Controlled vocabularies underlie an emerging set of tools that support web user interfaces, large-scale automatic harvesting of metadata and data, project status assessment, workflow management and overall quality control. They are a key resource for user upload code in the IODP Site Survey Data Bank, prompting and enforcing appropriate metadata values for ocean drilling proposal support data. Compared to previous generations of hard-wired code, the access to controlled vocabularies allows a project to evolve with flexibility, and the code to be ported from one project to another.
Related links
- SIOExplorer: http://SIOExplorer.ucsd.edu
- Site Survey Data Bank: http://ssdb.iodp.org/
- Marine Metadata Interoperability: http://marinemetadata.org/
- Storage Resource Broker: http://www.sdsc.edu/srb/index.php/