CONSTRUCTING AN INTERNATIONAL GEOSCIENCE INTEROPERABILITY TESTBED TO ACCESS DATA FROM DISTRIBUTED SOURCES: LESSONS LEARNED FROM A GEOSCIML TESTBED
Geoscience data are being generated at exponentially increasing volumes, and it is no longer feasible to develop centralized warehouses from which data are accessed. Efficient access to such data online in real time from distributed sources is rapidly becoming one of the major challenges in building cyberinfrastructures for the Earth Sciences.
EXtensible Markup Language (XML) and web-based data delivery is a proven technology which allows access to standardized data on the fly via the internet. GeoSciML (GeoScience Markup Language) is a geoscience specific, XML-based, GML (Geography Markup Language) application that supports interchange of geoscience information. It has been built from various existing geoscience data model sources, particularly the North American Data Model (NADM) and XMML (eXtensible Mining Markup Language). It is being developed through the Interoperability Working Group of the Commission for the Management and Application of Geoscience Information (CGI), which is a commission of the International Union of Geological Sciences (IUGS). The Working Group is (currently) comprised of geology and information technology specialists from agencies in North America, Europe, Australia and Asia.
The GeoSciML Testbed
In 2006, representatives from geological surveys in USA, Canada, UK, France, Sweden and Australia came together to develop a testbed that would utilize GeoSciML to access globally distributed geoscience map data (Duffy et al, 2006).
Data was served from seven sites in six countries with several different WMS/WFS (Web Feature Service/Web Map Service) software solutions employed. Geological surveys in Canada, USA and Sweden used an ESRI ArcIMS platform (and in one case a MapServer platform) with a Cocoon wrapper to handle queries and transformations of XML documents. The UK and Australian geological surveys employed the open source GeoServer software to serve data from ArcSDE and Oracle sources. The French geological survey implemented a system using an Ionic RedSpider server for WMS and client, and a custom development to implement a WFS. Web clients were constructed in Vancouver, Canada using Phoenix, and later in Canberra, Australia using Moximedia IMF software to test various use case for the WMS/WFS services. Generic web clients, such as Carbon Tools' Gaia 2 were also used to test some use cases.
In addition to geologic map data, the testbed also demonstrated the capacity to share borehole data as GeoSciML. Two WFS (French and British) provided borehole data to a client able to display the borehole logs.
System (OGC) compliance
There are three important things to consider when establishing an OGC (Open Geospatial Consortium)-compliant interoperability testbed - compliance, compliance, and compliance. However, working at the cutting edge of WFS implementation in the GeoSciML Testbed strained existing WFS software implementations to breaking point. Approaching the deadline of the public release of the Testbed, rigorous OGC standards compliance became an unrealistic goal. The focus of the project necessarily turned from OGC standards compliance to ensuring a useful degree of data exchange, including display, download and some simple query functionality.
In comparison with other data types that have successfully used OGC services, GeoSciML deals with extremely complex data. Thus, although the GeoSciML Testbed did prove that it is possible to make Geoscience interoperable, it also showed that semantic compliance is not going to be a trivial exercise, particularly for the more descriptive components of earth sciences.
Proprietary vendor and open source software that aims to fully support the detail of OGC web service specifications is still at the developmental stage. The complexity of both the WFS query framework and the XML implementation model make implementation of such software an onerous task. It may eventually be found that WFS as a generic query framework over an XML model of GeoSciML's complexity is not achievable, but only by presenting the OGC standards with well conceptually modeled schema in a real domain like geoscience's GeoSciML will this be tested.
Support for a subset of WFS services was achieved in the GeoSciML Testbed, but there is no standard mechanism to expose or describe the set of functionality that is implemented. In time, vendor and/or open source software will likely provide more rigorous and powerful WFS software implementations. The GeoSciML Testbed proved an effective mechanism to push further development of software capability in this area.
Semantic compliance
The GeoSciML Testbed highlighted firstly the importance of strict compliance to standard vocabularies of controlled concepts for true interoperability, and secondly the complexity of the concepts that we were tying to standardize and make readable by computers. Humans easily cope with a degree of fuzziness in data structures or ontologies. It is in our nature that many geologists can't see the problem with attributing a sandstone as cross bedded or cross bedded. But it is vital to computer-based queries of digital data.
A lot of work is still to be done (and is underway) in the vocabulary arena to make data exchange and query more interoperable. A geologist knows that an igneous extrusive rock and a volcanic rock are the same thing, but a computer searching for volcanic rocks will not find rocks coded as igneous extrusive unless rules of equivalence and hierarchy are established in complex vocabularies. As with many other international initiatives for sharing information, the multilingual aspect has to also be taken into account in any vocabulary development.
Semantic compliance
So you have a data model. It is entirely scientifically logical and robust, with complex hierarchical structure and vocabularies to accommodate your complex and hierarchical data. But just how practically interoperable is it?
Participants in the GeoSciML WFS Testbed all provided information on the age of the geological units that they served. However, the schematic flexibility of the GeoSciML data model allows services to provide their age information in fully compliant, yet slightly different ways - as single terms, as multiple hierarchical terms, as maximum and minimum terms. This meant that querying and reclassifying the data based on age information had to be done differently on each dataset without the ability to apply a single standard query to all the GeoSciML datasets.
Usability issues such as these will only be solved with the increasing maturity of emerging complex scientific spatial data models like GeoSciML. Use cases for data models and WFS services must be developed recognizing the capabilities of existing and future WFS/WMS and GIS software, as well as scientific user needs. However, as the GeoSciML Testbed showed, some of the limiting factors in a cutting edge project do not become apparent until the project is well under way.
Implications of the GeoSciML WFS Testbed
While WFS/WMS standards, data models and supporting software are still being developed, demonstrator projects such as the GeoSciML Testbed are vital to the progress of interoperability to show users that the technology can deliver access to distributed data sources in real time. Further testbeds for GeoSciML will result in more robust and functional WFS/WMS services which will become mainstream data delivery services in the near future. Above all, this testbed highlighted the complexity of Geoscience data and showed that strict adherence to controlled vocabularies is essential to making Geoscience data semantically interoperable.
References Cited
Duffy, T., Boisvert, E., Cox, S., Johnson, B.R., Raymond, O., Richard, S.M., Robida, F., Serrano, J.J., Simons, B., Stolen, L-K, 2006, The IUGS-CGI International Geoscience Information Interoperability Testbed, International Association for Mathematical Geology XIth International Congress, Liege Belgium.