2008 Geoinformatics Conference (11-13 June 2008)

Paper No. 4
Presentation Time: 10:40 AM

LONG-TERM AVAILABILITY OF GEOSCIENCE DATA


KLUMP, Jens, Data Center, German Research Centre for Geosciences (GFZ), Telegrafenberg, Potsdam, 14473, Germany, jens.klump@gfz-potsdam.de

Introduction

In the last decade research in the geological sciences has produced vast amounts of new data. In some cases it is the enormous volume of data that poses a technical challenge, in other cases it is their semantic complexity. What ever the volume an format of the data may be, geoscience data are characterized by their origin in a heterogeneous and dynamic research environment. In contrast to a business or administrative context, scientific work flows are characterized by ad-hoc changes that become necessary through the incorporation of new results into experimental working hypotheses (Barga & Gannon, 2007).

To the individual scientist, data curation is not at the focus of scientific work and there are few incentives to scientists to make data accessible for re-use or re-purposing. Only few science funding agencies ask grant recipients to make their data accessible, and even fewer journals make data access a prerequisite for publication. Furthermore, the roles and responsibilities in long-term curation of scientific data still need to be resolved (Lyon, 2007). This situation leads to deficits in data management that puts large portions of our scientific heritage at risk of loss. Furthermore, the inaccessibility of data might have a negative impact on the quality of research (Nature Editorial, 2006).

To achieve a sustainable long-term accessibility and re-usability of research data requires a combination of organizational and technical measures. On the organizational side data curation needs to become an integral part of good scientific practice, at the same time geoinformatics has to develop tools that facilitate the tasks needed for efficient and sustainable data curation.

Organizational Strategies

Several declarations by government, non-government and scientific bodies have called for open access to data and for better accountability for the long-term preservation of data, but with little success. Several studies (Lyon, 2007, Klump, 2008) and others) have investigated the requirements towards long-term preservation of digital research materials. These studies also report on best-practice examples from existing data repositories.

A key to devising effective and sustainable strategies for the long-term preservation and accessibility of research data is to define "Levels of Persistence" in the data curation process and its supporting technical architecture. The domains of collaboration and publication in dealing with research data are not discrete but rather form the end-points of a "curation continuum". The implementation of data curation processes, however, requires the definition of a boundary between the two domains to be able to distinguish roles and responsibilities of the actors in the data curation processes. The idea is to distinguish the domain of active research, where curation is the responsibility of the scientists (collaboration domain), and the long-term preservation domain (publication domain), where responsibility and expertise lie with the "memory institutions" (library, data center) (Treloar et al., 2007). The diagram in Figure 1 shows the two data curation domains and the "curation boundary" with its interface between a university's research groups and its memory institution.


Figure 1: Schematic diagram of data curation in the collaboration domain, and in the publication domain. Note that objects in the publication domain require more comprehensive metadata, while in the collaboration many metadata are available only as implicit information.

Technological Strategies

Some scientific disciplines already have repositories for their data, but for the majority of researchers no data repositories exist. Best practice examples for disciplinary data repositories are the World Data Centers (WDC) of the International Council for Science (ICSU). The use of institutional repositories for data curation on an institutional level is still a relatively recent idea (Lyon, 2007; Treloar et al., 2007).

In most cases, the existing disciplinary data repositories are not integrated into the scientific work flow, which leads to only a small proportion of the data being archived in disciplinary repositories. This break in the work flow is also reflected in the problems observed in the generation and curation of metadata. More research needs to be done to determine which kind of metadata are needed at which level of data curation (Treloar et al., 2007), and how metadata can be generated automatically in the data curation processes (Robertson, 2006).

The heterogeneity of data in the geological sciences requires to pay special attention to data and file formats. Not all formats that are popular among scientists are suitable for long-term preservation (Lormant et al., 2005) . This also means, that preservation metadata need to encode more of the data format than just their MIME-type.

Re-use and Re-purposing of Data

Data curation and long-term preservation of digital research data are motivated by the aim to re-use and re-purpose research data that already exist. This will only happen if the use and citation of data become part of scientific culture. Without demand from scientists none of the data repositories can be operated on a sustainable basis. This requires that their data holdings can be found through catalogs and portals, and that the published data can be cited (Klump et al., 2006).

Because Uniform Resource Locators (URL) are transient they are not suitable as means of referencing data for the purpose of citation. The shortcomings of URL are overcome by the use of persistent identifiers, such as Digital Object Identifiers and Uniform Resource Names (URN) (Altman & King, 2007)

Conclusions

The introduction to the Open Archival Information Systems reference model (OAIS, ISO 14721:2003) describes a digital archive as "an organization of people and systems, that has accepted the responsibility to preserve information and make it available for a Designated Community. Data curation and long-term preservation of scientific data is therefore not only a technical issue, but also needs an appropriate organizational framework.

Successful approaches to long-term availability of data need to recognise the roles and responsibilities in the data curation process. The identification of actors in the process is necessary to identify the right tools and incentives that are necessary components of technical and organisational strategies for long-term availability of data.

Literature

Altman, Micah, and Gary King (2007), A Proposed Standard for the Scholarly Citation of Quantitative Data, D-Lib Magazine, 13(3/4). doi:10.1045/march2007-altman

Barga, Roger, and Dennis B Gannon (2007), Scientific versus business workflows, in Workflows for e-Science, Eds.: I. J. Taylor, E. Deelman, et al., S. 9-16, Springer-Verlag, London, UK.

Klump, Jens (2008), Anforderungen von e-Science und Grid-Technologie an die Archivierung wissenschaftlicher Daten, Kompetenznetzwerk Langzeitarchivierung (nestor), Göttingen, Germany.

Klump, Jens, Roland Bertelmann, et al. (2006), Data publication in the Open Access Initiative, Data Science Journal, 5, 79-83. doi:10.2481/dsj.5.79

Lormant, Nicolas, Claude Huc, et al. (2005), How to Evaluate the Ability of a File Format to Ensure Long-Term Preservation for Digital Information?, In Ensuring Long-term Preservation and Adding Value to Scientific and Technical data (PV 2005), S. 11, Edinburgh, UK. [online] Available from: http://www.ukoln.ac.uk/events/pv-2005/pv-2005-final-papers/003.pdf

Lyon, Liz (2007), Dealing with Data: Roles, Rights, Responsibilities and Relationships, consultancy report, UKOLN, Bath, UK. [online] Available from: http://www.ukoln.ac.uk/ukoln/staff/e.j.lyon/reports/dealing_with_data_report-final.pdf

Nature Editorial (2006), A fair share, Nature, 444(7120), 653-654.

Robertson, R. John (2006), Evaluation of metadata workflows for the Glasgow ePrints and DSpace services, University of Strathclyde, Glasgow, UK.

Treloar, Andrew, David Groenewegen, et al. (2007), The Data Curation Continuum - Managing Data Objects in Institutional Repositories, D-Lib Magazine, 13(9/10), 13, doi:10.1045/september2007-treloar