Reading List for Parallel and Distributed Computing (partial)

General

Alan Jay Smith. The Task of the Referee. IEEE Computer, pages 65-71, April 1990.

Parallel Architectures

Jeffrey Kuskin, David Ofelt, Mark Heinrich, John Heinlein, Richard Simoni, Kourosh Gharachorloo, John Chapin, David Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st Annual International Symposium on Computer Architecture, pages 302-313, April 1994

Thomas E. Anderson, David E. Culler, David A. Patterson, and the NOW team. A Case for Networks of Workstations: NOW. In Principles of Distributed Computing, August 1994

N. Adiga, G. Almasi et al.   An Overview of the BlueGene/L Supercomputer.  Proc. SC 2002.

Caching

David Chaiken, Craig Fields, Kiyoshi Kurihara, and Anant Agarwal. Directory-Based Cache Coherence in Large-Scale Multiprocessors. IEEE Computer, pages 49-58, June 1990.

Per Stenstrom. A Survey of Cache Coherent Schemes for Multiprocessors. IEEE Computer, pages 12-24, June 1990.

John Hennessy and David Patterson. Computer Architecture: A Quantitative Approach (2nd Edition). Chapter 8 (Multiprocessors). Morgan Kaufman.

Distributed Shared Memory (DSM)

John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and Performance of Munin. SOSP 1991, p. 152-164.

Honghui Lu, Alan Cox, R. Rajamony, Willy Zwanapoel, and Sandhya Dwarkadas. Compiler and Software Distributed Shared Memory Support for Irregular Applications Proceedings of the Sixth PPOPP, Las Vegas, NV, June 1997

E. Penheiro et al.  S-DSM for Heterogeneous Machine Archictectures Second Workshop on Software Distributed Shared Memory, May 2000

Honghui Lu et al.  Contention elimination by replication of sequential sections in distributed shared memory programs.  In PPOPP, 2001.

A. Itzkovitz and A. Schuster.   MultiView and Millipage - Fine-Grain Sharing in Page-Based DSMs. In OSDI, 1999.

Compilation

Seema Hiranandani, Ken Kennedy, Chau-Wen Tseng. Compiling Fortran D for MIMD Distributed-Memory Machines. Communications of the ACM, pages 66-80, August 1992.

Todd C. Mowry, Monica S. Lam, and Anoop Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. (PostScript) In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 62-73, October 1992

David Padua and Michael J. Wolfe. Advanced Compiler Optimizations for Supercomputers. Communications of the ACM , 29(12), December 1986

Ken Kennedy and Kathryn McKinley.  Optimizing for Parallelism and Data Locality International Conference on Supercomputing, 1992

Mike Voss and Rudi Eigenmann.  High-level adaptive program optimization with ADAPT.  In PPOPP, 2001

Alexandru  Salcianu and Martin Rinard.  Pointer and escape analysis for multithreaded programs.  In PPOPP, 2001

Thread-Level Speculation/Run-Time Parallelization

J. Gregory Steffan and Todd C. Mowry.  The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization. Proceedings of HPCA, 1998

Lawrence Rauchwerger and David Padua.  The LRPD Test: Speculative Run-Time Parallelization of Loops with
Privatization and Reduction Parallelization. Proceedings of the SIGPLAN'95 Conference on Programming Language
Design and Implementation, June 1995,La Jolla, CA, pp.218--232.

Raja Das, Mustafa Uysal, Joel Saltz, Yuan-Shin Hwang.  Communication Optimizations for Irregular Scientific Computations on  Distributed Memory Architectures.  Journal of Parallel and Distributed Computing, 1993.

Communication

Ahmad Faraj and Xin Yuan. Automatic Generation and Tuning of MPI Collective Communication Routines. ICS 2005

Ernie Chan, William Gropp, Rajeev Thakur, and Robert van de Geijn. Collective Communication on Architectures that Support Simultaneous Communication over Multiple Links , PPOPP 2006.

Chao Huang, Gengbin Zheng, Sameer Kumar, and Laxmikant Kale. Performance Evaluation of Adaptive MPI , PPOPP 2006.

Data Distribution

M. Gupta and P. Banerjee. PARADIGM: A compiler for automated data distribution on multicomputers.Proceedings of 1993 ACM Intl. Conf. on Supercomputing, p. 357-367, July 1993

Sotiris Ioannidis and Sandhya Dwarkadas. Compiler and Run-Time Support for Adaptive Load Balancing in Software Distributed Shared Memory Systems. Fourth Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, May 1998

Donald G. Morris III and David Lowenthal. Accurate Data Redistribution Cost Estimation in Software Distributed Shared Memory Systems.   PPOPP 2001.

Ken Kennedy and Uli Kremer.  Automatic Data Layout for Distributed Memory Machines.  ACM Transactions on Programming Languages and Systems (TOPLAS), 20 (4), ACM Press, 1998.

Jaydeep Marathe, and Frank Mueller. Hardware Profile-guided Automatic Page Placement for ccNUMA Systems , PPOPP 2006.

Dimitrios S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D. Polychronopoulos, Jesus Labarta, and Eduard Ayguade.  A Case for User-Level Dynamic Page Migration.  ICS '00

Gosia Wrzesinska, Jason Maassen and Henri Bal.  Self-adaptive Applications on the Grid  PPOPP 2007

Modeling Parallel Programs

David Culler, Richard Karp, et al. LogP: Towards a Realistic Model of Parallel Computation. In Fourth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, p. 1-12, July 1993

Logical Clocks

Leslie Lamport. Time, clocks and the ordering of events. Communications of the ACM, 21(7):558 565, July 1978.

Power-Aware Computing

T. Heath, E. Pinheiro, J. Hom, U.Kremer, and R. Bianchini. Application Transformations for Energy and Performance-Aware Device Management. PACT 2002. ( PS )

T. Heath, B. Diniz, E. V. Carrera, W. Meira Jr., and R. Bianchini. Energy Conservation in Heterogeneous Server Clusters. ( PS gzipped )

Robert Springer, Barry Rountree, David Lowenthal, and Vince Freeh. Minimizing Execution Time in MPI Programs on an Energy-Constrained, Power-Scalable Cluster. 11th ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), March 2006

Reducing Communication Latency

Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active Messages: a Mechanism for Integrated Communication and Computation. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 256-266, May 1992

Software Support for Virtual Memory-Mapped Communication. Cezary Dubnicki, Liviu Iftode, Edward W. Felten, and Kai Li. Intl. Parallel Processing Symposium, April 1996.

Vincent W. Freeh, David K. Lowenthal, and Gregory R. Andrews. Distributed Filaments: Efficient Fine-Grain Parallelism on a Cluster of Workstations. First Symposium on Operating Systems Design and Implementation, p. 201-212, Monterey, CA, November 14-17, 1994

Combining Task and Data Parallelism

J. Subhlok and G. Vondran. Optimal Mapping of Sequences of Data Parallel Tasks. In Fifth ACM SIGPLAN Symposium on Principles and Practices of Parallel Programming, p. 134-143, July 1995.

A. Radulescu et al. CPR: Mixed Task and Data Scheduling for Distributed Systems. In IPDPS, May 2001.

T. Gross, D. O. Hallaron, J. Subhlok. Task Parallelism in a High Performance Fortran Framework. IEEE parallel and distributed technology: systems and applications, 1994.

Out of Core Applications

Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications. Todd Mowry, Angela Demke, and Orran Krieger. In Second Symposium on Operating Systems Design and Implementation, October 1996).

R. Bordewekar et al.  A Model and Compilation Strategy for Out-of-core Data Parallel Programs.  In PPOPP '95.

Message Passing Performance Analysis

Philip Roth and Barton Miller. On-line Automated Performance Diagnosis on Thousands of Processors , PPOPP 2006.

Sadaf Alam, Pratul Agarwal, Al Geist, and Jeffrey Vetter. Performance characterization of bio-molecular simulations using molecular dynamics , PPOPP 2006.

Jeff Vetter and Michael McCracken.  Statistical scalability analysis of communication operations in distributed applications.  In PPOPP, 2001

J.S. Vetter and P. Worley.  Asserting Performance Expectations. Proc. SC 2002.

J.S. Vetter. Dynamic Statistical Profiling of Communication Activity in Distributed Applications. Proc. SIGMETRICS: Joint International Conference on  Measurement and Modeling of Computer Systems

J.S. Vetter. Performance Analysis of Distributed Applications using Automatic Classification of Communication Inefficiencies. Proc. ACM Int'l Conf.  Supercomputing

Memory Consistency Models

David Mosberger.  Memory Consistency Models.  Operating Systems Review, 1993.

Etc.

Enrique Carerra and Ricardo Bianchini.  Efficiency vs. portability in cluster-based network servers.  In PPOPP, 2001

Michael Scott and William Scherer.  Scalable queue-based spin locks with timeout.  In PPOPP, 2001