I. A One-Week Primer on (Scalable Scientific) High Performance Computing

A. Defining some terms

B. Motivation

C. Some history

D. What do I need to know about hardware?

  1. Achieving ``high performance'' without parallelism:
    • Computer engineering: VLSI, fabrication, cooling, etc.
    • Low-level parallelism (e.g., independent functional units).
    • RISC.
    • Instruction pipelining, superscalar, LIW.
    • Hierarchical memory.

  2. Flynn's taxonomy: SISD, SIMD, MIMD (Flynn, 66).

  3. Shared- vs. distributed memory.

  4. A typical distributed memory multiprocessor architecture:
    • Processors are interconnected by a network (tightly- vs. loosely-coupled).
    • Memory is distributed among the processors.
    • Communication is by message passing.

  5. Issues, variations, and details:

    • An alternative architecture: connect processors to separate memory modules.

    • Network topologies: ring, mesh, hypercube, fat tree, completely connected (e.g., with a cross-bar switch or multistage network), ...

    • Issues in message passing:
      • Routing technologies, special hardware.
      • Overhead: startup cost, buffers, message length effects.
      • Latency vs. bandwidth.
      • Synchronization.
      • Contention, bottlenecks.

    • Some famous examples of distributed memory multiprocessors.
      (For a more complete list, go here).

      Intel iPSC

      Intel Paragon

      IBM SP2.

      Thinking Machines CM5

      Convex Exemplar

      KSR-1

      Cray T3D, T3E

E. What do I need to know about software and programming?

  1. One can make a case that all major aspects of computing are made significantly more difficult in a parallel setting:
    • algorithm design
    • implementation
    • debugging, testing, verification
    • maintenance
    • portability

  2. Achieving ``high performance'' without parallelism:
    • Optimizing compilers.
    • Optimized kernels.
    • Algorithms that exploit hierarchical memory: column-oriented, block-oriented.

  3. Approaches to parallel processing:

    • Automatic parallelization of existing sequential code is a very difficult problem.

    • Low-level (manufacturer-supplied) primitives: fortran or C plus explicit message passing.

    • ``Standards" for explicit message passing.

    • Extensions to existing high-level languages.

    • New parallel languages.

    • Parallel programming tools and environments.

    • Distributed shared memory (DSM), virtual shared memory (VSM).

F. What do I need to know about algorithms?

  1. Important issues in exploiting parallelism:

    • Functional (control) vs. data parallelism.

    • Granularity: fine vs. coarse.

    • Computation vs. communication.

    • Redundant computation is not necessarily bad.

    • Use of memory and communication systems is critical: ``where is the data?!''

    • Mapping and load balancing are critical for irregular computations.

    • Synchronization and nondeterminism cause problems.

    • Use of existing (sequential) software would be nice ...

  2. Common techniques in exploiting parallelism:

    • Vectorize and/or parallelize loops.

    • Divide and conquer.

    • Pipelining.

    • Master/slave paradigm.

    • Data flow paradigm.

  3. Theoretical results: there is an extensive literature.

G. What do I need to know about performance?

  1. Theory and experimentation are both important.

  2. What to measure?
    • Speedup and efficiency.
    • Scaled-speedup, iso-efficiency.
    • Peak speeds vs. benchmarks.
    • Time vs. computational rate.
    • Time to achieve a given level of accuracy.

  3. Why is performance evaluation difficult?
    • Complex interactions among hardware and software.
    • Nonlinearities.
    • Nondeterminism.

  4. Limits on performance: Amdahl's Law.


CS6404 class account (cs6404@ei.cs.vt.edu)