Geoinformatics 2007 Conference (17–18 May 2007)

Paper No. 10
Presentation Time: 2:00 PM

CYBER-INTEGRATOR: A HIGHLY INTERACTIVE SCIENTIFIC PROCESS MANAGEMENT ENVIRONMENT TO SUPPORT EARTH OBSERVATORIES


KOOPER, Rob1, MARINI, Luigi1, MYERS, Jim1, MINSKER, Barbara1 and BAJCSY, Peter2, (1)NCSA, University of Illinois at Urbana-Champaign, 1205 W. Clark St, Room 1008, MC-257, Urbana, IL 61801, (2)National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign, Urbana, IL 61801, pbajcsy@ncsa.uiuc.edu

The concept of Earth Observatories has been evolving over the past decade and it is surrounded with the multitude of “informatics” needs for a successful deployment.  In the context of Earth Observatories, informatics refers to problems related to expected large amounts of often highly complex data that have to be analyzed; information has to be interactively extracted from raw data and then understood by domain scientists. In all application areas where Earth Observatories are being designed and built (e.g., WATERS, CUAHSI, NEON, GEON or ORION), scientists desire to learn from their data about a spectrum of complex phenomena surrounding us. However, the informatics challenges for a domain expert can significantly hinder any learning progress. Regardless of a domain, geo, hydro, eco, environmental or sensor, general informatics challenges exist related to (a) data volume and computational requirements, (b) data, analysis and resource complexity management, and (c) the heterogeneity of information technologies supporting scientists. Our goal is to support scientists building Earth Observatories to overcome these general challenges.

This paper presents a novel process management environment called CyberIntegrator to support diverse analyses in Earth Observatories. These analyses are very human time consuming and hard to reproduce because of the lack of in-silico scientific process management and because of the diversity of data, software and computational requirements. The motivation for our work comes from the need to build the next generation of in-silico scientific discovery processes that require (a) access to heterogeneous and distributed data and computational resources, (b) integration of heterogeneous software packages, tools and services, (c) formation and execution of complex analytical processing sequences, (d) preservation of traces about scientific analyses and (e) design of secure collaborative web-based frameworks for sharing information and resources.

The goal of the presented work is to describe a modular architecture and key features of a workflow that provides a process management environment for automating science processes, reducing the human time involved and enabling scientific discoveries that would not be possible without supporting software and hardware infrastructure. Our approach to solving the above problem is based on adopting object-oriented software engineering principles, designing a modular software prototype, and focusing on user interfaces that simplify complex analyses including heterogeneous software integration.

Figure 1 shows the overall architecture of the developed process management environment by bringing together the top level features with the low level object-oriented software engineering principles. From the functionality perspective, the CyberIntegrator software could be viewed as a system for (a) browsing and searching available data, tools and computational resources, (b) accessing available data sets, tools and computational resources, (c) bringing them together; (d) executing one tool at the time or a sequence of tools, (e) monitoring and controlling executions, (f) efficiently utilizing available data, tools and computational resources (g) collecting information, or provenance, about the process flow to help later reconstruct the thought process of the scientist and (h) assisting the scientist using the provenance gathered by the community. The key architectural components are the editor, engine, executors, applications, collections of registries, and optional meta-data repository and event broker. They are all written in Java. The registries could be viewed as repositories of high-level descriptions of available data, tools and resources. The meta-data store provides a repository for gathered information about workflow execution and it is based on a resource description framework (RDF) format. Finally, the event broker is a component for handling a stream of data or events, and it is based on Java Message Service (JMS) application programming interface (API) for sending messages between two or more clients

 

 

 

Figure 1. The overall architecture of Cyber-Integrator

 

Our aim has been to design a workflow that works with descriptions (meta-data) of data, software tools and computational resources for easy integration and hierarchical organization. This distributed meta-data (also called “registries”) about data/tools/resources can be modified with a text or xml editor.  The workflow system will pull in all the information from the registries, load a requested workflow and execute the workflow accordingly. The benefits of such workflows for domain scientists are (a) the simplicity of integrating existing software within the workflow system and (b) the benefits of running, reusing, re-purposing and sharing workflows with other scientists and (c) receiving feedback from the system during workflow creation based on the provenance gathered.

Cyber-Integrator editor provides a user-friendly interface for browsing registries of data, tools and computational resources; creating workflows in a step-by-step exploration mode;  re-using and re-purposing workflows; executing process flows locally or remotely; aiding research explorations using a provenance-to-recommendation pipeline; and  incorporating heterogeneous tools and linking them transparently. Figure 2 shows the graphical user interface of CyberIntegrator editor. There are three top panes (left – to right: data pane, tool pane and resources pane) and one pane on the bottom. The Data pane lists all data sets loaded or generated so far for processing. The Tools pane lists the tools currently loaded from local and remote registries. The Resources pane lists computational resources (executors) available for a particular tool. The bottom pane contains several tabs with information about the CyberIntegrator execution. The Help tab contains help text about the tool currently selected. The Steps tab contains a list of the tools run so far with information about each execution. The Graph tab includes a graphical representation of the sequence of steps. The current workflow can also be saved and either reloaded or shared with others later. The scientist can ask the system to recommend a tool based on the current data by pushing the button above the data pane.

 

Figure 2: CyberIntegrator

The key contributions of our work on CyberIntegrator can be summarized as follows. The main computer science novelty of our work lies in (1) formalizing the software integration framework using object-oriented software engineering principles, (2) designing a browser-based modeling paradigm for step-by-step composition of workflows, (3) gathering provenance during workflow creation and execution, (4) using the provenance for tool recommendation feedback for workflow auto-completion, and (5) providing capabilities to publish, run, monitor, retrieve, re-use and re-purpose workflows from local and remote computational resources for long-running workflows (e.g., referring to large simulation [2] and streaming data analyses  [1]).  The main technical contributions are also in testing and demonstrating the prototype process management system with several application scenarios from environmental and hydrologic engineering sciences [1]. Our process management prototype is also supporting WATERS/CLEANER and CUAHSI communities in the context of building earth observatories. The software is available for download at http://isda.ncsa.uiuc.edu/download.