SEMANTICALLY ENABLED REGISTRATION AND INTEGRATION ENGINES (SEDRE AND DIA) FOR THE EARTH SCIENCES
We present both the justification and a development initiative to design and implement a pair of service engines that utilize ontologies for semantically enabled discovery and integration of structurally heterogeneous earth science data. We also emphasize that capabilities of these engines are likely to be transformative for earth science research and education. Our motivation in developing these engines is based on the recognized need to discover new knowledge through advancing semantic capabilities that are able to bridge across disciplines.
Scientific studies of the Earth and solar system have resulted in massive volumes of data. However, most of the data sets are isolated from each other, and the ability to use these heterogeneous, disciplinary data to generate new knowledge has been limited. In our ongoing research to facilitate seamless exchange of heterogeneous data, we have developed a web based system DIA (Discovery, Integration and Analysis), that enables scientists to use ontologies to discover, integrate, and analyze Earth science data (Rezgui and others, 2007). In this paper, we present SEDRE (Semantically-Enabled Data Registration Engine), a system that complements DIA by enabling scientists to use ontologies to advertise their data sets so that they may be automatically discovered. We first summarize our efforts in ontology development for Earth sciences and then present SEDRE to show how it uses ontologies for data registration.
Ontology Development for Earth Sciences
The role of ontologies in enabling semantic integration is well established (Malik and others, 2007) as it enables a community to associate well-defined, commonly accepted meanings with data. In recent years, several research efforts have recognized the potential of ontologies in promoting data integration in Earth sciences (Sinha and others, 2007). Several ontologies are being developed, e.g. SWEET, as well as a more data oriented Earth and Planetary ONTology (EPONT) developed by Sinha and others (2007). EPONT imports and inherits properties from existing ontologies available. The availability of these data ontologies is likely to have significant impact in promoting intra- and inter-disciplinary interoperability
Overview of DIA
Geoscientists have generated massive volumes of earth science data for decades. Most of the produced data, however, remain isolated knowledge islands; the ability to find, access, and properly interpret these large data repositories has been very limited because of the absence of data sharing infrastructures used to advertise data, as well as the lack of a common language to properly interpret other providers' data. As a result, it has precluded the meaningful use of the available data in answering complex questions that require information from several data sources. To address this problem, we have developed DIA which provides a collaborative environment where scientists can share their resources for discovery and integration by registering them through well-defined ontologies (Sinha and others, 2007).
The DIA engine (Rezgui and others, 2007) provides three classes of functionalities: discovery, integration, and analysis. Data discovery enables users to retrieve data sets, while data integration enables users to query multiple resources along some common attributes to generate previously unknown information called data products.
DIA's architecture consists of five components (i) User Interface: This is an ArcGIS Server .NET map viewer Web application. (ii)Web Servers: DIA uses two Web servers. The first is responsible for routing users' queries to DIA's query processor, and the second ensures communication between DIA's query processor and its own map server. (iii) Map Server: This component is an ArcGIS map server that provides maps to DIA's query processor. (iv) Registry Servers: These servers provide directory functionalities (registration of data and tools, indexing, etc.) which providers use to advertise their resources on registry servers. (v)Query Processor: It is responsible for producing the results for users' queries and delivering them to the Web server. The DIA engine identifies and retrieves resources required to answer the user's query, and is linked to the SEDRE engine that registers data for retrieval
Overview of SEDRE
Semantic data registration is a precursor to improved data discovery and integration. To date, majority of integrative solutions have been hindered due to the adoption of personal acronyms, notations, etc, making it difficult for other scientists to correctly understand the semantics of the produced data. To address this concern, SEDRE was developed with the goal to allow researchers to associate one or more ontologies to their data files so that unique and definite meaning is associated with each column.
SEDRE facilitates discovery through resource registration at three levels:
(i) Keywords-based registration: Discovery of data resources (e.g., gravity, geologic maps, etc) requires registration through use of high level index terms.
(ii) Ontological class-based registration: Discovering item level databases requires registration at data level ontologies.
(iii)Item detail level registration: Item detail level or fine-grain registration consists of associating a column in a database to specific concept or attribute of ontology, thus allowing the resource to be queried using concepts instead of actual values. This level of registration is a requirement for semantic integration, i.e., the automatic processing (by tools) of shared data.
Figure 1: A flow diagram showing the main components of SEDRE. When the data provider publishes the data, SEDRE would track its origin, assign a registration number and monitor its provenance. The Geochemical data shows structural heterogeneity that can be resolved easily through association of concepts to data columns.
Figure 1 shows the wiring diagram for SEDRE. We also show a small sub-section of the Planetary Material package (part of EPONT) so that individual data sets containing geochemical analyses with locations (from Planetary Location ontology package of EPONT) can be mapped to terms defined in the ontologies. We recognize that data registration through ontologies is a time consuming process, and as such, SEDRE has been developed as a downloadable service, where data owners can connect to SEDRE's online repository only to upload the data- ontology mappings. This allows data owners to register their data to ontology mappings at their own convenience, while keeping ownership of data.
Figure 2 Schematic representation of Registration of data through SEDRE and discovery/integration through DIA
SEDRE is designed to be used as a desktop application. As shown in Figure 2, SO2 data gathered on any given date can be registered to the concept of SO2 in EPONT ontology. Conceptual mapping of locations and analyzed element abundances of liquids, gases or solids can be captured through SEDRE user interface. DIA accesses these semantically registered data sets for integration and analyses. We suggest that semantic interoperability challenges (Malik and others, in press) can be easily overcome through the deployment of SEDRE and DIA in a Web environment.
Acknowledgements: This research is supported by the National Science Foundation; award EAR 0225588 to A. K. Sinha and as a subcontract from a NASA award (SEDRE) to Peter Fox, UCAR. <>
References Cited
A. Rezgui, Z. Malik, and A. K. Sinha, 2007, DIA Engine: Discovery, Integration and Analysis of Earth Science Data, International Geoinformatics Conference, San Diego, CA, USA, May 2007.
Z. Malik, A. Rezgui, and A. K. Sinha, Semantic Integration in Geosciences, Computers and Geosciences, in press.