Note

These notes are taken after the fact based on the slides. Anything mentioned only in lecture is not included.

Parallel Computing

  • cliché: “cannot see the forest for the trees”
    • Trees: cores, hardware threads, vectors, caches, heterogeneity, etc.
    • Forest: parallelism & locality
  • important to understand the forest (overall concepts) before diving into low-level hardware details

Key Ideas

  • Increase parallelism exposure
  • Increase locality of reference
    • both are essential for performance today and in the future

Sequential Computing

  • traditionally, software = serial execution
    • instructions executed one after another
    • runs on single processor
    • only one instruction active at any moment

Parallel Computing

  • using multiple compute resources simultaneously
    • break problem into discrete parts solve concurrently
    • each part = series of instructions
    • instructions run on different processors at the same time
    • requires overall control/coordination mechanism

Example Programs

  • Payroll: serial vs parallel processing
  • Word counter across directory tree
  • Mandelbrot computation

Considerations

  • problem must:
    • be divisible into discrete pieces that can run simultaneously
    • allow multiple program instructions at same time
    • actually solve faster on multiple resources
  • compute resources:
    • multi-core single computer
    • multiple computers connected via network

Parallel Computers

  • modern computers are inherently parallel:
    • multiple functional units (L1/L2 cache, floating-point, GPU, integer units, etc.)
    • multiple execution cores
    • multiple hardware threads
  • ex: IBM BG/Q chip 18 cores, 16 L2 caches

Networking & Clusters

  • networks connect stand-alone computers clusters
  • cluster design:
    • each node = multi-processor system
    • nodes connected via Infiniband, Ethernet, optical fibers
    • may include special-purpose nodes
  • most supercomputers = clusters from major vendors

Why Parallel Computing?

  • Real world is massively parallel
    • better for modeling complex natural systems (weather, biology, etc.)
  • Save time & money
    • more resources shorter completion time
    • commodity hardware can be used
  • Match parallel hardware
    • all modern CPUs = multi-core
    • serial programs waste potential performance
  • End of single-core growth
    • microprocessors increased ~50%/yr until 2002
    • slowed to ~20%/yr shift to parallelism (multicore chips)

Applications & Projects

  • Science/Engineering
    • physics, biosciences, chemistry, geology, engineering, defense
  • Industrial/Commercial
    • big data, search engines, medical imaging, finance, AI, VR/graphics, pharma, oil exploration
  • Global applications
    • widely used across the world (supercomputer ranking sources: Top500.org)

Need for Performance

  • growing complexity climate models, protein folding, drug discovery, energy, data analysis
  • some problems require petaFLOPS & petaBytes
    • ex: grand challenge problems, search engines

Additional Motivations

  • Concurrency: multiple tasks in parallel vs one at a time
  • Non-local resources: use distributed computing (e.g. SETI@home, Folding@home)
  • Trends:
    • faster networks + distributed systems
    • 500,000x increase in supercomputer performance (last 20 years)
    • exascale computing race (10^18 operations/sec)

Challenges of Writing Parallel Code

  • Running multiple instances = not ideal
  • Translating serial parallel = inefficient
    • requires new algorithms (not just translation)
    • ex: matrix multiplication, global sum example
  • parallel programs must be written, not converted

Parallelization Approaches

  • Partitioning is the core idea
  • Task parallelism
    • divide into different tasks assigned to cores
  • Data parallelism
    • split dataset, same operations on subsets

Examples

  • Grading 120 exams
    • task-parallel: each grader handles 1 question across all exams
    • data-parallel: each grader handles a set of exams fully
  • Global sum
    • compute values in parallel (data-parallelism)
    • master core collects partial sums (task-parallelism)

Complexity in Parallelism

  • independent tasks = simple
  • coordinated tasks = complex
    • communication
    • load balancing
    • synchronization
  • ex: tree-structured sum with master/slave cores

Constructs

  • Explicit constructs
    • language extensions (C/C++)
    • explicit assignment (core 0 runs task 0, synchronize, etc.)
  • High-level abstractions
    • OpenMP, CUDA
    • simpler but may sacrifice some performance

What This Course Covers

  • explicit parallel programming (C, MPI, Pthreads, OpenMP)
  • major models:
    • shared memory
    • distributed memory

Concurrent vs Parallel vs Distributed

  • Concurrent: multiple tasks might be in progress
  • Parallel: tasks cooperate closely to solve one problem
  • Distributed: tasks loosely coupled, often separate programs
    • parallel vs distributed distinction often blurry

Note

End of l1a.pdf, moving to l1b.pdf

von Neumann Architecture

  • named after John von Neumann (1945 papers)
  • aka stored-program computer
    • both program instructions & data stored in electronic memory
    • earlier computers required hard wiring
  • classic design
    • main memory
    • CPU (control unit + ALU, registers, program counter)
    • bus (interconnection between CPU & memory)
    • executes one instruction at a time
    • von Neumann bottleneck: limited by memory access
  • four main components:
    • memory
    • control unit
    • arithmetic/logic unit
    • input/output
  • why important? parallel computers still follow this design (just multiplied in units)

Processes, Multitasking, Threads

  • Process = instance of program
    • contains executable code, call stack, heap, I/O descriptors, security data, execution state
  • Multitasking: multiprogramming & time slicing
    • maximize CPU utilization
  • Threads: lighter-weight execution contexts within a process
    • can fork off or join back into a process

Modifications to von Neumann Model

  • addressing performance issues:
    • caching (principle of locality)
      • CPU caches (L1, L2)
      • small but fast
      • policies like LRU
      • spatial/temporal locality
    • cache blocks/lines
      • cache hits vs misses
      • write-through vs write-back

Cache Mapping

  • where to store cache lines
    • fully associative: anywhere
    • direct mapped: unique location
    • set associative: N possible locations
  • eviction via replacement policies (e.g., LRU)

Caches & Programs

  • controlled by hardware
  • programmers can indirectly improve cache use
    • exploit spatial & temporal locality
    • ex: row-major order in C (nested loops row-first vs col-first affects cache efficiency)

Virtual Memory

  • extends main memory capacity
  • required for multiprogramming
  • main memory acts like a cache for secondary storage
  • secondary = swap space
  • unit = page
  • page table manages virtual addressing
    • TLB (translation lookaside buffer) for fast lookup
  • concepts: page hit, page fault

Instruction-level Parallelism (ILP)

  • performance improvement via multiple instructions running at once
  • approaches:
    • pipelining: divide execution into stages (assembly line)
    • multiple issue: initiate several instructions at once
  • both are standard in modern CPUs

Pipelining

  • analogy: factory line no idle units
  • example: floating-point operation split into 7 steps (1 ns each)
    • sequential: 7000 ns
    • pipelined: ~1006 ns
  • limitations:
    • not k-fold speedup
    • bottleneck = slowest stage
    • dependencies create delays

Multiple Issue

  • replicate functional units for simultaneous execution
  • ex: 2 floating-point adders working alternately
  • scheduling:
    • static (compile time)
    • dynamic (runtime, superscalar processors)
  • speculation: execute ahead on guessed outcomes

Hardware Multithreading

  • ILP is limited use thread-level parallelism (TLP)
  • methods:
    • fine-grained: switch threads every instruction (lots of stalls)
    • coarse-grained: switch on long operations (delays)
    • simultaneous multithreading (SMT): multiple threads use multiple functional units at once

Multithreading Support

  • popular and widely supported by:
    • hardware
    • OS
    • languages (Java, C++ examples)

General Parallel Terminology

  • Supercomputing/HPC: using world’s largest systems
  • Node: standalone computer-in-a-box with CPUs, memory, NICs
  • CPU / socket / processor / core:
    • CPU = once a single execution unit
    • now subdivided into multiple cores
    • multiple CPUs per node (each with multiple cores)
  • Task: logical unit of computational work, executed by a processor
  • Pipelining: task broken into steps with streamed inputs

Shared vs Distributed Memory

  • Shared memory:
    • hardware: all processors have direct access
    • programming: tasks see same logical memory space
  • SMP (symmetric multi-processor):
    • multiple processors, single address space
  • Distributed memory:
    • hardware: each node has local memory
    • programming: tasks only see local memory
    • communication needed to access remote memory

Communications & Synchronization

  • Communications:
    • tasks must exchange data (via shared memory bus or network)
  • Synchronization:
    • coordinating tasks in real time
    • usually requires waiting at synchronization points
    • may increase wall-clock execution time

Granularity

  • measure of computation vs communication
    • Coarse: lots of work between communication
    • Fine: little work between communication events