Geoinformatics 2007 Conference (17–18 May 2007)

Paper No. 1
Presentation Time: 8:15 AM

HYDROSEEK – A SEARCH ENGINE FOR HYDROLOGISTS


BERAN, Bora and PIASECKI, Michael, Civil, Architectural and Environmental Engineering, Drexel University, 3141 Chestnut Str, Philadelphia, PA 19104, bb63@drexel.edu

Search engines have changed the way we see the Internet. The ability to find the information by just typing in keywords was a big contribution to the overall web experience. While the conventional search engine methodology worked well for textual documents, locating scientific data remains a problem since they are stored in databases not readily accessible by search engine bots.

Considering different temporal, spatial and thematic coverage of different databases, especially for interdisciplinary research it is typically necessary to work with multiple data sources. These sources can be federal agencies which generally offer national coverage or regional sources which cover a smaller area with higher detail. However for a given geographic area of interest there often exists more than one database with relevant data. Thus being able to query multiple databases simultaneously is a desirable feature that would be tremendously useful for scientists. Development of such a search engine requires dealing with various heterogeneity issues. In scientific databases, systems often impose controlled vocabularies which ensure homogeneity within themselves, thus heterogeneity becomes a problem when more then one database are involved. Having controlled vocabularies at individual database level defines the boundaries of vocabulary variety, making it easier to solve semantic heterogeneity problem than with the conventional search engines that deal with free text.

However structural, syntactic and information system heterogeneities emerge as types of additional incompatibilities that these systems have to resolve. Structural heterogeneity is generally defined as different information systems storing their data in different document layouts and formats. In the current state of hydrologic data providers we can speak of HTML tables, XML documents or text files where the file format alone does not guarantee homogeneity since data output can be organized in many different ways. Syntactic heterogeneity is the presence of different representations or encodings of data. Date/time formats can be given as an example where common differences are; local time vs. UTC, 12 hour clock vs. 24 hour clock and standard date format vs. Julian day which is common in Ameriflux data. Whereas information system heterogeneity requires methods of communication specifically tailored to interact with each data providers' servers due to the difference in interfaces e.g. REST services vs. SOAP services it also encompasses the difficulties from the difference of arguments that each service requires. Sometimes even responses and requests have different formats. In EPA STORET data requests (through available REST services) require dates to be provided in Dublin Julian days1 while the server returns Gregorian dates with the data.

We have developed a search engine2 that enables querying multiple data sources simultaneously and returns data in a standardized output despite the aforementioned heterogeneity issues between the underlying systems. This application relies mainly on metadata catalogs or indexing databases, ontologies and webservices with virtual globe and AJAX technologies for the graphical user interface. Users can trigger a search of dozens of different parameters over hundreds of thousands of stations from multiple agencies by providing a keyword, a spatial extent, i.e. a bounding box, and a temporal bracket.

1Days since the noon between December 31st , 1899 and January 1st , 1990

2http://cbe.cae.drexel.edu/search/