Geoinformatics 2007 Conference (17–18 May 2007)

Paper No. 14
Presentation Time: 11:45 AM

SODA — A SELF-SERVICE ONLINE DIGITAL ARCHIVE FOR UNLOVED SCIENTIFIC DATA


SANDERS, Rex, USGS, 400 Natural Bridges Drive, Santa Cruz, CA 95060, rsanders@usgs.gov

SODA (Self-service Online Digital Archive) is a project under development at USGS for archiving "unloved" scientific data.

Background

USGS collects many thousands of gigabytes of new data annually. Many data types have well-defined processing and archiving paths, but many do not — our so-called "unloved data". Unloved data types usually fall into two classes: data types that have not traditionally shown national significance (e.g., marine sediment analyses), and data types created from new technology and research (e.g., airborne and land-based LIDAR surveys). Unloved data are difficult to find, difficult to access, and often vanish completely upon the retirement or departure of key scientists and technicians.

Scientists with the best intentions frequently fail to archive their data well. One scientist carefully created and labeled three copies of digital core photos on a total of 90 CD-R disks. Three years later, none of these disks were readable, because the label adhesive had corrupted the data layer. As a non-professional archivist, he did not know the risk associated with using sticky labels on CD-R disks.

Goal and Use Cases

The SODA project wants to make archiving unloved scientific data easier than burning another CD-R.

We are building our system around two use cases — archiving data and finding data.

To archive data:

  1. Point your web browser to "soda.usgs.gov", and click the "Submit Data" button.
  2. Fill out a form with data type, format, minimum metadata (or more), and select a public release policy.
  3. Upload your data
  4. Get a "permanent" link pointing to the data and metadata, with initial internal-only access.
  5. Archivist reviews the data, metadata, and release policy.
  6. If review passed, archivist enables public access to data and metadata according to release policy. If review fails, archivist contacts you for corrections.

To find data:

  1. Point your web browser to "soda.usgs.gov", and click the "Find Data" button.
  2. Fill out form to search for data using any metadata fields, including geographic region, data type, or author.
  3. Get links to download the data and metadata direct to your desktop.

Design Features

SODA will have other features, including:

  • Designed for users — using SODA must be easier than burning another CD-R.
  • Designed for re-use by other web-based tools, including ArcGIS, Geospatial One-Stop, MRIB (http://mrib.usgs.gov), InfoBank (http://octopus.wr.usgs.gov/infobank), etc.
  • Designed for longevity using open standards and simple technology.
  • Designed to scale to large numbers of very large files.
  • Separate searches for USGS-only and public data and metadata.
  • Archivists easily add new data types, data formats, metadata forms, and release policies.

Benefits

Some of the anticipated benefits of SODA include:

  • Improves access to data and metadata by USGS scientists and the public.
  • Scientists and technicians can archive data easily and immediately.
  • Improves data preservation.
  • Reduces data rescue.
  • Scientists can cite permanent links in published papers.
  • Scientists and technicians don't need to respond to data requests.

SODA is intended to be the scientific data archive of last resort, dependent on the cooperation of overworked scientists and technicians to keep valuable data from being lost forever. As such, we cannot require very much effort to archive the data; the process must be simple and self-explanatory. Our reduced metadata and approval requirements disappoint many mainstream data archivists, but capturing more data with some metadata is better than capturing no data.

SODA is not intended to replace any other USGS data archiving mechanism, including Open File Reports, Data Series, or online databases like NWIS (http://waterdata.usgs.gov/nwis)

Technical Design

SODA technical design is based on several principles:

  • SODA servers will be organizationally and geographically distributed: locally run, but centrally searchable.
  • SODA servers will be easy to setup and run by local IT personnel, using inexpensive hardware and software.
  • SODA is much more than hardware and software — SODA will include processes and procedures to ensure the longevity of the data archive.
  • A SODA "cookbook" will enable IT personnel to setup and run a SODA server with little outside support.
  • A central SODA server will enable searches and retrieval across the distributed SODA servers.

Current Status of the SODA Project

SODA has been under development with minimal funding since early 2006. After surveying commercial and open-source projects, we concluded that writing our own software and designing our own system would best meet our needs. We have a core developer team with three members, and an email-based advisory group with about 45 members, all working at the USGS.

We have an initial, non-archival prototype running. We are using the prototype to work out many technical, user-interface, process, and procedural issues.

We anticipate release of our first production SODA server by the end of 2007. A few months after that release, we anticipate release of the SODA "cookbook" to enable other sites to setup and run local SODA servers. Development of the central search system and other SODA features is unscheduled, dependent on acquisition of further resources.

We will consider joint development of SODA with non-USGS partners.