I. A One-Week Primer on (Scalable Scientific) High Performance Computing
- A. Defining some terms
-
- High performance
- Scientific
- Scalable
- architecture
- algorithm
- system
- B. Motivation
-
- Solve bigger problems, more quickly, more accurately, and at
less cost.
- Grand challenge problems.
- National challenge problems (see
1996 blue book).
- Computational science.
- Why parallelism?
- Physical arguments: there are physical limits to reducing cycle
time, dissipating heat, laying out circuits, etc.
- Computer science and engineering arguments:
it makes sense to exploit the typical behavior of
programs---regularity, uniformity, independent
operations, ...
- Economic arguments: share resources, remove bottlenecks, use
resources efficiently, exploit commodity microprocessors.
- Philosophical arguments: the systems being modeled have
tremendous inherent parallelism, so ...
- Why scalable parallelism?
- C. Some history
- Low-level, ``hidden'' parallelism has been used for decades.
- But we follow Duncan (90):
``... a parallel architecture provides
an explicit, high-level framework for the development of parallel
programming solutions by providing multiple processors, whether
simple or complex, that cooperate to solve problems through
concurrent execution.''
- HPSC in the 80's was dominated by vector supercomputers with
modest (shared-memory) parallelism.
- Now scalable, massivly parallel machines are the most
powerful, and networks of workstations are becoming a very
attractive alternative as well.
(See the
performance database at netlib for an extensive up-to-date list
of benchmarks.)
- D. What do I need to know about hardware?
- Achieving ``high performance'' without parallelism:
- Computer engineering: VLSI, fabrication, cooling, etc.
- Low-level parallelism (e.g., independent functional units).
- RISC.
- Instruction pipelining, superscalar, LIW.
- Hierarchical memory.
- Flynn's taxonomy: SISD, SIMD, MIMD (Flynn, 66).
- Shared- vs. distributed memory.
- A typical distributed memory multiprocessor architecture:
- Processors are interconnected by a network
(tightly- vs. loosely-coupled).
- Memory is distributed among the processors.
- Communication is by message passing.
- Issues, variations, and details:
- An alternative architecture: connect processors to separate
memory modules.
- Network topologies: ring, mesh, hypercube,
fat tree, completely connected (e.g., with a cross-bar switch or
multistage network), ...
- Issues in message passing:
- Routing technologies, special hardware.
- Overhead: startup cost, buffers, message length effects.
- Latency vs. bandwidth.
- Synchronization.
- Contention, bottlenecks.
- Some famous examples of distributed memory multiprocessors.
(For a more complete list, go
here).
- Intel iPSC
- Intel
Paragon
- IBM
SP2.
- Thinking Machines CM5
- Convex Exemplar
- KSR-1
- Cray
T3D,
T3E
- E. What do I need to know about software and programming?
- One can make a case that all major aspects of computing
are made significantly more difficult in a parallel setting:
- algorithm design
- implementation
- debugging, testing, verification
- maintenance
- portability
- Achieving ``high performance'' without parallelism:
- Optimizing compilers.
- Optimized kernels.
- Algorithms that exploit hierarchical memory:
column-oriented, block-oriented.
- Approaches to parallel processing:
- Automatic parallelization of existing sequential code
is a very difficult problem.
- Low-level (manufacturer-supplied) primitives: fortran or C
plus explicit message passing.
- ``Standards" for explicit message passing.
- Extensions to existing high-level languages.
- New parallel languages.
- Parallel programming tools and environments.
- Distributed shared memory (DSM), virtual shared memory (VSM).
- F. What do I need to know about algorithms?
- Important issues in exploiting parallelism:
- Functional (control) vs. data parallelism.
- Granularity: fine vs. coarse.
- Computation vs. communication.
- Redundant computation is not necessarily bad.
- Use of memory and communication systems is critical:
``where is the data?!''
- Mapping and load balancing are critical for irregular
computations.
- Synchronization and nondeterminism cause problems.
- Use of existing (sequential) software would be nice ...
- Common techniques in exploiting parallelism:
- Vectorize and/or parallelize loops.
- Divide and conquer.
- Pipelining.
- Master/slave paradigm.
- Data flow paradigm.
- Theoretical results: there is an extensive literature.
- G. What do I need to know about performance?
- Theory and experimentation are both important.
- What to measure?
- Speedup and efficiency.
- Scaled-speedup, iso-efficiency.
- Peak speeds vs. benchmarks.
- Time vs. computational rate.
- Time to achieve a given level of accuracy.
- Why is performance evaluation difficult?
- Complex interactions among hardware and software.
- Nonlinearities.
- Nondeterminism.
- Limits on performance: Amdahl's Law.
CS6404 class account (cs6404@ei.cs.vt.edu)