Note
These notes are taken after the fact based on the slides. Anything mentioned only in lecture is not included.
Parallel Computing
- cliché: “cannot see the forest for the trees”
- Trees: cores, hardware threads, vectors, caches, heterogeneity, etc.
- Forest: parallelism & locality
- important to understand the forest (overall concepts) before diving into low-level hardware details
Key Ideas
- Increase parallelism exposure
- Increase locality of reference
- both are essential for performance today and in the future
Sequential Computing
- traditionally, software = serial execution
- instructions executed one after another
- runs on single processor
- only one instruction active at any moment
Parallel Computing
- using multiple compute resources simultaneously
- break problem into discrete parts → solve concurrently
- each part = series of instructions
- instructions run on different processors at the same time
- requires overall control/coordination mechanism
Example Programs
- Payroll: serial vs parallel processing
- Word counter across directory tree
- Mandelbrot computation
Considerations
- problem must:
- be divisible into discrete pieces that can run simultaneously
- allow multiple program instructions at same time
- actually solve faster on multiple resources
- compute resources:
- multi-core single computer
- multiple computers connected via network
Parallel Computers
- modern computers are inherently parallel:
- multiple functional units (L1/L2 cache, floating-point, GPU, integer units, etc.)
- multiple execution cores
- multiple hardware threads
- ex: IBM BG/Q chip → 18 cores, 16 L2 caches
Networking & Clusters
- networks connect stand-alone computers → clusters
- cluster design:
- each node = multi-processor system
- nodes connected via Infiniband, Ethernet, optical fibers
- may include special-purpose nodes
- most supercomputers = clusters from major vendors
Why Parallel Computing?
- Real world is massively parallel
- better for modeling complex natural systems (weather, biology, etc.)
- Save time & money
- more resources → shorter completion time
- commodity hardware can be used
- Match parallel hardware
- all modern CPUs = multi-core
- serial programs waste potential performance
- End of single-core growth
- microprocessors increased ~50%/yr until 2002
- slowed to ~20%/yr → shift to parallelism (multicore chips)
Applications & Projects
- Science/Engineering
- physics, biosciences, chemistry, geology, engineering, defense
- Industrial/Commercial
- big data, search engines, medical imaging, finance, AI, VR/graphics, pharma, oil exploration
- Global applications
- widely used across the world (supercomputer ranking sources: Top500.org)
Need for Performance
- growing complexity → climate models, protein folding, drug discovery, energy, data analysis
- some problems require petaFLOPS & petaBytes
- ex: grand challenge problems, search engines
Additional Motivations
- Concurrency: multiple tasks in parallel vs one at a time
- Non-local resources: use distributed computing (e.g. SETI@home, Folding@home)
- Trends:
- faster networks + distributed systems
- 500,000x increase in supercomputer performance (last 20 years)
- exascale computing race (10^18 operations/sec)
Challenges of Writing Parallel Code
- Running multiple instances = not ideal
- Translating serial → parallel = inefficient
- requires new algorithms (not just translation)
- ex: matrix multiplication, global sum example
- parallel programs must be written, not converted
Parallelization Approaches
- Partitioning is the core idea
- Task parallelism
- divide into different tasks assigned to cores
- Data parallelism
- split dataset, same operations on subsets
Examples
- Grading 120 exams
- task-parallel: each grader handles 1 question across all exams
- data-parallel: each grader handles a set of exams fully
- Global sum
- compute values in parallel (data-parallelism)
- master core collects partial sums (task-parallelism)
Complexity in Parallelism
- independent tasks = simple
- coordinated tasks = complex
- communication
- load balancing
- synchronization
- ex: tree-structured sum with master/slave cores
Constructs
- Explicit constructs
- language extensions (C/C++)
- explicit assignment (core 0 runs task 0, synchronize, etc.)
- High-level abstractions
- OpenMP, CUDA
- simpler but may sacrifice some performance
What This Course Covers
- explicit parallel programming (C, MPI, Pthreads, OpenMP)
- major models:
- shared memory
- distributed memory
Concurrent vs Parallel vs Distributed
- Concurrent: multiple tasks might be in progress
- Parallel: tasks cooperate closely to solve one problem
- Distributed: tasks loosely coupled, often separate programs
- parallel vs distributed → distinction often blurry
Note
End of
l1a.pdf
, moving tol1b.pdf
von Neumann Architecture
- named after John von Neumann (1945 papers)
- aka stored-program computer
- both program instructions & data stored in electronic memory
- earlier computers required hard wiring
- classic design
- main memory
- CPU (control unit + ALU, registers, program counter)
- bus (interconnection between CPU & memory)
- executes one instruction at a time
- von Neumann bottleneck: limited by memory access
- four main components:
- memory
- control unit
- arithmetic/logic unit
- input/output
- why important? → parallel computers still follow this design (just multiplied in units)
Processes, Multitasking, Threads
- Process = instance of program
- contains executable code, call stack, heap, I/O descriptors, security data, execution state
- Multitasking: multiprogramming & time slicing
- maximize CPU utilization
- Threads: lighter-weight execution contexts within a process
- can fork off or join back into a process
Modifications to von Neumann Model
- addressing performance issues:
- caching (principle of locality)
- CPU caches (L1, L2)
- small but fast
- policies like LRU
- spatial/temporal locality
- cache blocks/lines
- cache hits vs misses
- write-through vs write-back
- caching (principle of locality)
Cache Mapping
- where to store cache lines
- fully associative: anywhere
- direct mapped: unique location
- set associative: N possible locations
- eviction via replacement policies (e.g., LRU)
Caches & Programs
- controlled by hardware
- programmers can indirectly improve cache use
- exploit spatial & temporal locality
- ex: row-major order in C (nested loops row-first vs col-first affects cache efficiency)
Virtual Memory
- extends main memory capacity
- required for multiprogramming
- main memory acts like a cache for secondary storage
- secondary = swap space
- unit = page
- page table manages virtual addressing
- TLB (translation lookaside buffer) for fast lookup
- concepts: page hit, page fault
Instruction-level Parallelism (ILP)
- performance improvement via multiple instructions running at once
- approaches:
- pipelining: divide execution into stages (assembly line)
- multiple issue: initiate several instructions at once
- both are standard in modern CPUs
Pipelining
- analogy: factory line → no idle units
- example: floating-point operation split into 7 steps (1 ns each)
- sequential: 7000 ns
- pipelined: ~1006 ns
- limitations:
- not k-fold speedup
- bottleneck = slowest stage
- dependencies create delays
Multiple Issue
- replicate functional units for simultaneous execution
- ex: 2 floating-point adders working alternately
- scheduling:
- static (compile time)
- dynamic (runtime, superscalar processors)
- speculation: execute ahead on guessed outcomes
Hardware Multithreading
- ILP is limited → use thread-level parallelism (TLP)
- methods:
- fine-grained: switch threads every instruction (lots of stalls)
- coarse-grained: switch on long operations (delays)
- simultaneous multithreading (SMT): multiple threads use multiple functional units at once
Multithreading Support
- popular and widely supported by:
- hardware
- OS
- languages (Java, C++ examples)
General Parallel Terminology
- Supercomputing/HPC: using world’s largest systems
- Node: standalone computer-in-a-box with CPUs, memory, NICs
- CPU / socket / processor / core:
- CPU = once a single execution unit
- now subdivided into multiple cores
- multiple CPUs per node (each with multiple cores)
- Task: logical unit of computational work, executed by a processor
- Pipelining: task broken into steps with streamed inputs
Shared vs Distributed Memory
- Shared memory:
- hardware: all processors have direct access
- programming: tasks see same logical memory space
- SMP (symmetric multi-processor):
- multiple processors, single address space
- Distributed memory:
- hardware: each node has local memory
- programming: tasks only see local memory
- communication needed to access remote memory
Communications & Synchronization
- Communications:
- tasks must exchange data (via shared memory bus or network)
- Synchronization:
- coordinating tasks in real time
- usually requires waiting at synchronization points
- may increase wall-clock execution time
Granularity
- measure of computation vs communication
- Coarse: lots of work between communication
- Fine: little work between communication events