2008 Geoinformatics Conference (11-13 June 2008)
Paper No. 5-4
Presentation Time: 10:40 AM-11:00 AM

LONG-TERM AVAILABILITY OF GEOSCIENCE DATA

KLUMP, Jens, GeoForschungsZentrum-Potsdam (DRZ), Telegrafenberg, Potsdam 14473 Germany, jklump@gfz-potsdam.de

In the last decade research in the geological sciences has produced vast amounts of new data. In some cases it is the enormous volume of data that poses a technical challenge, in other cases it is their semantic complexity. What ever the volume an format of the data may be, geoscience data are characterized by their origin in a heterogeneous and dynamic research environment. In contrast to a business or administrative context, scientific work flows are characterized by ad-hoc changes that become necessary through the incorporation of new results into experimental working hypotheses (Barga and Gannon 2007).

To the individual scientist, data curation is not at the focus of scientific work and there are few incentives to scientists to make data accessible for re-use or re-purposing. Only few science funding agencies ask grant recipients to make their data accessible, and even fewer journals make data access a prerequisite for publication. Furthermore, the roles and responsibilities in long-term curation of scientific data still need to be resolved (Lyon 2007). This situation leads to deficits in data management that puts large portions of our scientific heritage at risk of loss. Furthermore, the inaccessibility of data might have a negative impact on the quality of research (Nature Editorial 2006).

To achieve a sustainable long-term accessibility and re-usability of research data requires a combination of organizational and technical measures. On the organizational side data curation needs to become an integral part of good scientific practice, at the same time geoinformatics has to develop tools that facilitate the tasks needed for efficient and sustainable data curation. Several declarations by government, non-government and scientific bodies have called for open access to data and for better accountability for the long-term preservation of data, but with little success. Several studies (Lyon 2007, Klump 2008, and others) have investigated the requirements towards long-term preservation of digital research materials. These studies also report on best-practice examples from existing data repositories.

A key to devising effective and sustainable strategies for the long-term preservation and accessibility of research data is to define “Levels of Persistence” in the data curation process and its supporting technical architecture. The idea is to distinguish the domain of active research, where curation is the responsibility of the scientists, and the long-term preservation domain, where responsibility and expertise lie with the “memory institutions” (library, data center). These domains are not discrete but rather form the end-points of a “curation continuum” (Treloar, Groenewegen and Harboe-Ree 2007). Some scientific disciplines already have repositories for their data, but for the majority of researchers no data repositories exist. Best practice examples for disciplinary data repositories are the World Data Centers (WDC) of the International Council for Science (ICSU). The use of institutional repositories for data curation on an institutional level is still a relatively recent idea (Lyon 2007)⁠, (Treloar, Groenewegen and Harboe-Ree 2007).

In most cases, the existing disciplinary data repositories are not integrated into the scientific work flow, which leads to only a small proportion of the data being archived in disciplinary repositories. This break in the work flow is also reflected in the problems observed in the generation and curation of metadata. More research needs to be done to determine which kind of metadata are needed at which level of data curation (Treloar, Groenewegen and Harboe-Ree 2007), and how metadata can be generated automatically in the data curation processes (Robertson 2006).

The heterogeneity of data in the geological sciences requires to pay special attention to data and file formats. Not all formats that are popular among scientists are suitable for long-term preservation (Lormant et al. 2005). This also means, that preservation metadata need to encode more of the data format than just their MIME-type. Data curation and long-term preservation of digital research data are motivated by the aim to re-use and re-purpose research data that already exist. This will only happen if the use and citation of data become part of scientific culture. Without demand from scientists none of the data repositories can be operated on a sustainable basis. This requires that their data holdings can be found through catalogs and portals, and that the published data can be cited.

Because Uniform Resource Locators (URL) are transient they are not suitable as means of referencing data for the purpose of citation. The shortcomings of URL are overcome by the use of persistent identifiers, such as Digital Object Identifiers and Uniform Resource Names (URN) (Altman and King 2007) The introduction to the Open Archival Information Systems reference model (OAIS, ISO 14721:2003) describes a digital archive as “an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community.” Data curation and long-term preservation of scientific data is therefore not only a technical issue, but also needs an appropriate organizational framework. The increasing importance of data in the scientific process will, in future, also highlight the importance of coherent and sustainable long-term data curation. Altman, M. and King, G., 2007. A Proposed Standard for the Scholarly Citation of Quantitative Data. D-Lib Magazine, v. 13, no.3/4. Available at: doi:10.1045/march2007-altman.

Barga, R. and Gannon, D.B., 2007. Scientific versus business workflows. In I. J. Taylor et al., (Eds.) Workflows for e-Science. London, UK: Springer-Verlag, p. 9-16.

Klump, J., 2008. Anforderungen von e-Science und Grid-Technologie an die Archivierung wissenschaftlicher Daten, Göttingen, Germany: Kompetenznetzwerk Langzeitarchivierung (nestor).

Lormant, N. et al., 2005. How to Evaluate the Ability of a File Format to Ensure Long-Term Preservation for Digital Information? In Ensuring Long-term Preservation and Adding Value to Scientific and Technical data (PV 2005). Edinburgh, UK, p. 11. Available at: http://www.ukoln.ac.uk/events/pv-2005/pv-2005-final-papers/003.pdf.

Lyon, L., 2007. Dealing with Data: Roles, Rights, Responsibilities and Relationships, UKOLN, Bath, UK.

Nature Editorial, 2006. A fair share. Nature, v. 444, no. 7120, p.653-654.

Robertson, R.J., 2006. Evaluation of metadata workflows for the Glasgow ePrints and DSpace services, University of Strathclyde, Glasgow, UK.

Treloar, A., Groenewegen, D. and Harboe-Ree, C., 2007. The Data Curation Continuum - Managing Data Objects in Institutional Repositories. D-Lib Magazine, v. 13, no. 9/10.

2008 Geoinformatics Conference (11-13 June 2008)
Session No. 5
Geoinformatics Oral Session III
GeoForschungsZentrum Potsdam, Building H: Main Lecture Theater
9:00 AM-4:20 PM, Friday, 13 June 2008


© Copyright 2008 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.