Run-Time Selection of Block Size in Pipelined Parallel Programs

Students

Michael James

Karthik Balasubramanian

Papers

David K. Lowenthal. Accurately Selecting Block Size at Run Time in Pipelined Parallel Programs. International Journal of Parallel Programming., 28(3):245-274 (2000). PS , PDF

David K. Lowenthal and Michael James. Run-Time Selection of Block Size in Pipelined Parallel Programs. Second Merged IPPS/SPDP Symposium , p. 82-87, 1999. PS

Karthik Balasubramanian and David K. Lowenthal. Efficient Support for Pipelining in Distributed Shared Memory Systems. Parallel and Distributed Computing Practices. PS

Project Overview

Distributed-memory parallel computing is widely used to speed up scientific applications. In practice, parallel programming has proven to be very difficult, so compilers have been developed to generate parallel programs. For example, automatic methods exist to discover parallelism in sequential code, distribute data and computation to nodes, and communicate between nodes. Most parallelizing compilers look for parallelism only in loops, because that is where most of the time is spent in typical scientific applications.

One key obstacle to automatic parallelization is the presence of data dependencies, which enforce a certain ordering; their existence in loops may result in sequential execution of the entire loop. A significant number of scientific applications contain loops whose data dependencies prevent all iterations from being executed in parallel, including alternate direction implicit (ADI) integration, implicit hydrodynamics codes, three-dimensional Multigrid methods, and air pollution simulations. Loops with data dependencies are often referred to as DOACROSS loops; if the dependencies can be precisely determined by the compiler, we call them regular DOACROSS loops. These can often be executed efficiently using a technique called pipelining, where each node computes its work in blocks; after each, it passes the necessary results (via an explicit message) to the next node in the pipeline. Once the pipe is full, the nodes execute in parallel; the data dependencies are satisfied by forcing a node to block awaiting data from the node that precedes it in the pipeline. Sophisticated compilers such as SUIF often can automatically transform sequential loops with data dependencies to pipelined loops.

One important parameter in a pipelined program is the amount of work performed by each node before communicating. This is often referred to as the tile or block size and is critical to the efficiency of the program. A small block size decreases node idle time but increases the amount of communication, while a large block size decreases the amount of communication but increases node idle time. Unfortunately, compilers are not always successful at choosing an effective block size. There are several reasons for this. First, compilers use static estimates of workloads, which can be inexact. Even if the estimate is reasonable, it is still nontrivial to choose an effective block size. Second, compilers generally divide the computation into equal-sized blocks, which may not be flexible enough to accommodate scientific applications with unbalanced workloads; for example, an update may be guarded by a conditional. Finally, the best block size may be unknown until run time if the amount of work depends on input values. Fortunately, most scientific programs are iterative (i.e., contain an outermost loop that encloses the entire computation), and many exhibit similar characteristics on each iteration. This makes it possible to monitor one iteration of the outermost loop and use the results to guide decisions on future ones. This work seeks to develop novel run-time analysis, to be used in conjunction with existing compiler detection methods of pipelining opportunities, that finds the best block size for pipelined code.



David Lowenthal
Thu Sep 11 20:57:29 EDT 1997