CHAPTER 4

 

THE NEOWORK RUN-TIME ARCHITECTURE

 

In this chapter, we will discuss the design of the run-time architecture of the NEOWork enactment system for the METEOR2 WFMS. The NEOWork run-time system uses both CORBA (NEO) and Web as the primary communication infrastructures. The system is built on a refined centralized-scheduling architecture that extends the system scalability and employs a comprehensive mechanism for system recovery. In this chapter, we shall focus on the architectural design of the system. The error handling and recovery framework will be discussed in the next chapter.

 

4.1 MOTIVATIONS

 

The initial motivation of this thesis project was to design a recovery mechanism for the METEOR centralized architecture originally discussed in [Wan95] using the CORBA object services provided in Sun’s DOE (NEO’s predecessor). However, after we studied the architecture, we found that the original design has the following problems:

 

We have redesigned the system to avoid these problems, including using different schedulers for different workflows, handling critical data using CORBA object locking and clustering workflow engines to avoid the single point of failure problem. We will discuss these in detail later.

The advent of the CORBA object services (especially CORBA 2.0) provides a new and powerful way to design and implement workflow systems. For example, the object persistence service provides a mechanism to make the attribute data associated with an object persistent -- this could help us to recovery the state of the workflow system in case of failures.

The CGI scripts are a common way to connect the Web to CORBA objects. However, the lack of sophisticated error handling for CGI scripts makes the recovery of workflow systems difficult. In our design, instead of relying on CGI scripts, we implemented a special kind of CORBA object (interface server) that bypasses the CGI Scripts and communicates with the Web through Java interfaces (Java Applets) using direct socket connections+ .

4.2 THE NEOWORK RUN-TIME ENVIRONMENT

Figure 4.1: The NEOWork run-time environment

 

Figure 4.1 shows the run-time environment of NEOWork. The figure does not show the components for system recovery. A detailed discussion on recovery will be carried out in the next chapter. In addition to the basic components of the original centralized architecture --scheduler, task managers and tasks, this design incorporates workflow instance factory, task manager factory, and interface server into the run-time system. The architecture also shows three types of tasks at the run-time: transactional tasks, non-transactional tasks and user tasks. They are managed by their corresponding task managers.

 

4.3 THE RUN-TIME COMPONENTS

 

The underlying workflow infrastructures impact the run-time WFMS architecture design. One of the design strategies is to take advantage of the special characteristics of the infrastructures and maximize their abilities to support the run-time system. Since we decided to use Sun's CORBA product NEO+ as our principal communication infrastructure, the NEOWork run-time components are designed to utilize the advanced features of the NEO system while at the same time the generality of these components is preserved.

4.3.1 SCHDULERS

 

The role of a scheduler in a centralized architecture is to enforce the inter-task dependencies and coordinate the workflow executions. To avoid the overloading of work for a single scheduler for concurrently running workflows and make the system more scalable, we enforce the object-oriented principle into our design where multiple schedulers are used to facilitate the workflow scheduling. Each workflow is seen as an object instance and has its own scheduler, and the scheduler is responsible for coordinating the execution of different workflow tasks by evaluating the inter-task dependencies of the individual workflow. There are no inter-dependencies among schedulers. Schedulers are running concurrently and share the resources in the system. The object-oriented nature of the NEO system leads us to implement schedulers as CORBA objects. The life of the scheduler as a CORBA object spans the entire execution time of the workflow instance. After the workflow instance is completed, the scheduler is destroyed.

 

4.3.2 TASK MANAGERS

 

Task managers are the execution coordinators of tasks. As soon as a task is to run, a task manager is created by a task manager factory (discussed later) on the task client machine. The task manager in turn gathers necessary initial data for the task (a mapping to the task specific data format may be required) and kicks off the task through the task interface. The role of a task manager in NEOWork is to supervise the task's execution, report execution status to the scheduler, and participate in application data transfer for tasks. A task manager is also responsible for logging task execution states and the run-time task data (defined in the task design time) into a local persistence storage that could be used for task recovery in case of task failures.

According to the METEOR2 task model, which differentiates several types of tasks, each type of task has its corresponding type of task managers so that the semantics of tasks can be recognized and enforced by their specific task managers. NEOWork produces three types of task managers (TMs): Transactional TM, Non-transactional TM and User TM. All of them are logically inherited from a base class called basic task manager so that the common functionality among TMs can be reused.

 

4.3.3 TASKS

 

The METEOR2 WFMS run-time supports workflows with enterprise-wide heterogeneous tasks. Tasks in the METEOR2 NEOWork system could be legacy applications in some organizations, database applications that support transactional properties, or some user tasks that require graphical user interfaces for human participation and interactions, etc. (see figure 4.1). Tasks are typically distributed on different client machines under the provision of their corresponding task managers.

 

4.3.4 COMPONENT FACTORIES

 

The METEOR2 WFMS system is designed to handle a large number of workflows executing simultaneously. In NEOWork, an executing workflow can be seen as a workflow instance in the system. The run-time components of a workflow instance are composed of a scheduler object, several task manager objects and the corresponding executing tasks. Each workflow instance has its own workflow schedule maintained by its scheduler and could be running concurrently without interfering each other. The Scheduler Factory (SF) is a component factory that is responsible for creating new scheduler components for different workflow instances. Accordingly, the Task Manager Factory (TMF) is a component factory that is responsible for producing task managers for the workflows. When a workflow starts, a scheduler is created by the SF and loaded with the scheduling information (the inter-task dependencies) specified during the workflow design time. A request will be sent to the TMF (located on the task host) to create a task manager object for the task if the task is scheduled for execution. Different types of tasks have different types of task managers. The TMF should be able to create different types of task managers based on the request from the scheduler. The workflow repository could be the source where SFs and TMFs find information, such as the scheduling information and data for creating different types of task managers, to create schedulers and task managers. The object-oriented approach of the system component design makes the NEOWork a more scalable workflow system the one developed by [Wan95].

Figure 4.2: Distribution of the Scheduler and Task Manager Factories

Figure 4.2 depicts the run-time distribution of the Schedulers Factories and Task Manager Factories in the NEOWork environment. Scheduler Factories and Task Manager Factories are implemented as CORBA registered servers distributed on different hosts in the NEOWork run-time. Although only one scheduler factory is needed to handle the scheduler creations for workflows, to avoid single point of failure NEOWork includes several scheduler factories distributed on different hosts as a clustered scheduler factory service template. At the time when a new workflow instance is requested, NEOWork will find one of the scheduler factories in the template to create the scheduler object. This will greatly reduce the chance of the scheduler factory failure and provide higher availability for the NEOWork system.

In summary, the design of SF and TMF in NEOWork provides a clean and unified run-time environment for distributed hosts within the scope of WFMS. A host equipped with a scheduler factory can participate in workflow scheduling, while a host equipped with a task manager factory becomes a processing entity for task execution. Run-time workflow components, schedulers and task managers, are created as needed during the workflow run-time. This will optimize the use of system resources. The design also supports generality and reusability of implementing WFMS enactment systems. Codes for SF and TMF can be standardized and reused. Since the scheduling structure of a workflow is constructed by the SF during a scheduler’s creation time, NEOWork run-time can support both the homogeneous workflows (workflows having same inter-task dependecies) and heterogeneous workflows (workflows having different inter-task dependecies).

 

 

4.4 LIFE CYCLE OF THE RUN-TIME COMPONENTS

 

As we have discussed, schedulers and task managers are created during a workflow execution and destroyed after the workflow completes. We call schedulers and task managers NEOWork’s dynamic components. Scheduler factories and task manager factories are registered persistently in the NEOWork during the system setup and we call them static components.

Figure 4.3: Lifecycle of the run-time components

 

Figure 4.3 shows an example of the lifetime of the dynamic components during a workflow’s execution. When a new workflow instance is requested, a scheduler is created and the lifetime of the scheduler ts spans through the entire execution of the workflow tw. The scheduler then enters a preparation period t to load the scheduling information for the workflow instance and start the scheduling routine. Task managers are created when tasks are scheduled to execute. The lifetime of task managers (ttm1, ttm2, ttm3 and ttm4) covers the entire task execution time (tt1, tt2, tt3 and tt4). After a task completes execution, the task manager is not destroyed immediately. There is a period t for a task manager to help transfer the output task data to all the successors (task managers). In the NEOWork implementation, the object reference of the task manager is passed to the successors. The successors in turn bind to the object and get the data object of the task. The destruction of a task manager object is done by the NEO system automatically when the object reference is invalidated (released) by the successors.

Figure 4.4 Different Phases in Components’ Lifecycle

 

The lifecycle of dynamic components has four phases: created, active, inactive and destroyed (Figure 4.4). After the creation, components enter an active phase while a scheduler loads the scheduling procedures and sends messages to create new task managers, and task managers prepare and start the task executions. The scheduler object will timeout and enter an inactive phase when no messages come from the task managers. While waiting for task executions, task managers will also enter an inactive phase. They will be activated again by the NEOWork system as soon as there are further references. The lifecycle of the static components have four phases as well: registered, running, not running and unregistered (Figure 4.4). During NEOWork installation, the object factories are registered as an object server in response to object creation requests. When a request is received, the factory server starts running again. Timeout mechanism is used when there is no requests through a period of time to put the server into inactive phase. The implementation of these lifecycles is easy to achieve using NEO’s lifecycle service.

Figure 4.5: The centralized scheduling mechanism of the NEOWork

 

4.5 CENTRALIZED SCHEDULING AND TASK ACTIVATION

 

In NEOWork, task scheduling is done through a central scheduler and task managers are responsible for task activation. Figure 4.5 depicts the centralized scheduling mechanism of the NEOWork.

The inter-task dependencies of a workflow are specified at design time, saved in WIL and translated into a set of internal representations that can be interpreted by the run-time scheduler. A task in METEOR2 is logically represented by a set of internal states depending on different task types. The transitions of controllable states are guarded by gates. A gate is a normalized AND-OR tree that represents the satisfactory conditions for a state transition. A typical controllable state of a task is the activation of a task. If the gate is evaluated as open by the scheduler, the task will then be scheduled to start executing. In NEOWork, the scheduling information of a workflow is stored in a persistent storage PSM (figure 4.5).

The run-time structure of a scheduler contains the scheduler engine and a group of task agents. The scheduler engine is a service that evaluates the inter-task dependencies (AND-OR trees) in the persistent storage. If the condition of a state transition is satisfied, the transition information will be sent to the task manager through the task agents. Task agents represent tasks inside the scheduler. In NEOWork, they are thread agents that communicate with task managers through synchronous IDL interface calls. The role of task agents is to send control messages to task managers, update task states to the scheduling structure in the persistent storage and perform failure detection and recovery for task managers. The scheduler engine creates a task agent to request and synchronize with a task manager when a task is scheduled to run.

A task manager is responsible for initializing the execution environment for a task, and activating and supervising the task execution. During initialization time, task data are gathered and possibly filtered into a data format recognizable by the task manager. In NEOWork, data transfer is performed by task managers. During task execution, task manager reports the task states to the task agent and logs task states in its local persistent storage. In the case of task failure, the task manager will try to recover the task as well. The details will be discussed in next chapter. The persistent service of the NEO system provides a feasible and reliable way to implement the data logging down to the per-object level.

 

4.6 USER INTERFACES AND THE INTERFACE SERVER

 

The importance of using a common Web browser like Netscape as the primary GUI to interact with users is emphasized in the design of all METOER2 WFMSs. The dynamic generation of HTML forms technology gives WFMS developers a fast way to develop an application-like interface to let users input data, and the underlying CGI script gets the data and communicates with the CORBA objects. However, the issues of execution efficiency, security control and run-time failure handling of the CGI scripts suggest searching for better ways to integrate with the Web. NEOWork supports the interactions of users through the Web by providing Java Applet interfaces. The Java Applets not only deliver users a real-time application to work with but also a more secure communication environment through the Web. The underlying data communication between Java Applets and CORBA objects is through TCP/IP socket connections instead of using CGI scripts and Web server as the User-CORBA gateway. The Interface Server is designed to manage the data communications.

The primary role of the interface server is to facilitate and manage the data communications between user task mangers and user tasks. In chapter 2, we have talked about how Java Applets communicate with CORBA objects using TCP/IP socket connections. In NEOWork, the interface server is implemented as a CORBA object that is able to open different ports to communicate data with Java Applets on the user host for the task manager. We will discuss the detail later.

Figure 4.6: Run-time structure of the interface server

 

Another role of the interface server is to manage the worklist for the users. A worklist is a list of workitems associated with a user task. Each workitem represents a scheduled task associated with a workflow instance. Each user should have a worklist, and every time when a user task is ready for execution, it is broadcast to the worklist of the users. When a user selects a task from the worklist, the task has to be deleted from all other users’ worklist. Once the execution of that task is completed, the task is deleted from all the worklists. As a result of its execution, new tasks may become eligible for execution and the cycle begins anew.

The run-time structure of the interface server contains a set of User Agents and a Controller (figure 4.6). User agent represents a user in the interface server during task execution and facilitates data communication between the user task and the user task manager. The controller is an object factory that creates a user agent for a user task and provides functionality to logically maintain the worklist on the user interfaces. The communication between the user task managers and the interface server is through IDL calls, while user tasks communicate with the interface server using TCP/IP socket connections.

In NEOWork, the interface server is implemented as a registered NEO object server that provides services to facilitate data communication with the user interfaces (Java Applets) distributed on different hosts over the Web. When a user task is scheduled to run, the user task manager finds the interface server and registers the user task with the controller. The controller in turn opens a new socket port, broadcasts the user task and the port number to the worklists on the user interfaces and waits for a user to select the task. Every user machine has a common port that is registered with the controller of the interface server. This port is called the control port because it is reserved only for communication with the controller. The controller sends control data through this port to maintain the worklist on the user interface. The following steps outline how the controller of an interface server maintains worklists for the users:

    1. The controller broadcasts the scheduled user task to the user machines through their control ports. The user task is then added to the worklist on the user interfaces by a Java Applet routine running on the user’s machine.
    2. When a user selects the task, the user interface first sends a request to the controller for reservation.
    3. The controller determines if the task has been selected by other users or not (i.e., if a user agent has been created for the task or not). If the task was not selected, the controller creates a user agent which in turn opens a new socket port to communicate with the user interface, and approves the request by sending back an approval notice to the user interface. Consequently, the controller sends a "delete" message to other user interfaces to delete the selected task on their worklists.
    4. If the task has been selected (a user agent is created), the controller sends back a "deny" message to the user interface and the task will be deleted from the user’s worklist.

 

The user agent object created by the controller of the interface server facilitates data communication between the user task manager and the user interface. The following steps outlines how the user agent manages data communication:

    1. After being created by the controller, the user agent opens a socket port to listen for data from the user interface.
    2. When receiving the connection request from the user interface, the user agent sets up the socket connection to the target user machine.
    3. The user agent then transfers the task initialization data from the task manager to the user interface to prepare for the task execution. The user agent also saves the data to its persistent storage for recovery purpose.
    4. After the user completes the task, data are sent back from the user interface to the user agent; and the user agent in turn transfers the data back to the task manager.
    5. After finishing all these steps, the user agent shuts down the connection with user machine and recycles the socket port.

 

In summary, the interface server provides TCP/IP socket hook-ups for the CORBA world to communicate with the Java Applets/Applications bypassing CGI scripts and Web servers. The other option is to use IDL-to-Java mapping (like JOE from SUN) to generate CORBA Java stubs that could be compiled and included into the user interface applications to handle data communication automatically.

 

4.7 ADMINISTRATION TOOLS

 

NEOWork also includes a set of preliminary administration and monitoring tools to manage the run-time environment of the METEOR2 WFMS: