HYDROSEEK – A SEARCH ENGINE FOR HYDROLOGISTS

BERAN, Bora and PIASECKI, Michael, Civil, Architectural and Environmental Engineering, Drexel University, 3141 Chestnut Str, Philadelphia, PA 19104, bb63@drexel.edu

Search engines have changed the way we see the Internet. The ability to find the information by just typing in keywords was a big contribution to the overall web experience. While the conventional search engine methodology worked well for textual documents, locating scientific data remains a problem since they are stored in databases not readily accessible by search engine bots.

Considering different temporal, spatial and thematic coverage of different databases, especially for interdisciplinary research it is typically necessary to work with multiple data sources. These sources can be federal agencies which generally offer national coverage or regional sources which cover a smaller area with higher detail. However for a given geographic area of interest there often exists more than one database with relevant data. Thus being able to query multiple databases simultaneously is a desirable feature that would be tremendously useful for scientists. Development of such a search engine requires dealing with various heterogeneity issues. In scientific databases, systems often impose controlled vocabularies which ensure homogeneity within themselves, thus heterogeneity becomes a problem when more then one database are involved. Having controlled vocabularies at individual database level defines the boundaries of vocabulary variety, making it easier to solve semantic heterogeneity problem than with the conventional search engines that deal with free text.

However structural, syntactic and information system heterogeneities emerge as types of additional incompatibilities that these systems have to resolve. Structural heterogeneity is generally defined as different information systems storing their data in different document layouts and formats. In the current state of hydrologic data providers we can speak of HTML tables, XML documents or text files where the file format alone does not guarantee homogeneity since data output can be organized in many different ways. Syntactic heterogeneity is the presence of different representations or encodings of data. Date/time formats can be given as an example where common differences are; local time vs. UTC, 12 hour clock vs. 24 hour clock and standard date format vs. Julian day which is common in Ameriflux data. Whereas information system heterogeneity requires methods of communication specifically tailored to interact with each data providers' servers due to the difference in interfaces e.g. REST services vs. SOAP services it also encompasses the difficulties from the difference of arguments that each service requires. Sometimes even responses and requests have different formats. In EPA STORET data requests (through available REST services) require dates to be provided in Dublin Julian days¹ while the server returns Gregorian dates with the data.

We have developed a search engine² that enables querying multiple data sources simultaneously and returns data in a standardized output despite the aforementioned heterogeneity issues between the underlying systems. This application relies mainly on metadata catalogs or indexing databases, ontologies and webservices with virtual globe and AJAX technologies for the graphical user interface. Users can trigger a search of dozens of different parameters over hundreds of thousands of stations from multiple agencies by providing a keyword, a spatial extent, i.e. a bounding box, and a temporal bracket.

¹Days since the noon between December 31st , 1899 and January 1st , 1990

²http://cbe.cae.drexel.edu/search/

Session No. 5

Geoinformatics Oral Session III

Friday, 18 May 2007: 8:15 AM-3:00 PM

© Copyright 2007 The Geological Society of America (GSA), all rights reserved. Permission is hereby granted to the author(s) of this abstract to reproduce and distribute it freely, for noncommercial purposes. Permission is hereby granted to any individual scientist to download a single copy of this electronic file and reproduce up to 20 paper copies for noncommercial purposes advancing science and education, including classroom use, providing all reproductions include the complete content shown here, including the author information. All other forms of reproduction and/or transmittal are prohibited without written permission from GSA Copyright Permissions.

Back to: Geoinformatics Oral Session III

Previous Abstract | Next Abstract >>

Geoinformatics 2007 Conference (17–18 May 2007)

HYDROSEEK – A SEARCH ENGINE FOR HYDROLOGISTS