COSC 3P93: Week 1 Notes

Note

These notes are taken after the fact based on the slides. Anything mentioned only in lecture is not included.

Parallel Computing

cliché: “cannot see the forest for the trees”
- Trees: cores, hardware threads, vectors, caches, heterogeneity, etc.
- Forest: parallelism & locality
important to understand the forest (overall concepts) before diving into low-level hardware details

Key Ideas

Increase parallelism exposure
Increase locality of reference
- both are essential for performance today and in the future

Sequential Computing

traditionally, software = serial execution
- instructions executed one after another
- runs on single processor
- only one instruction active at any moment

Parallel Computing

using multiple compute resources simultaneously
- break problem into discrete parts → solve concurrently
- each part = series of instructions
- instructions run on different processors at the same time
- requires overall control/coordination mechanism

Example Programs

Payroll: serial vs parallel processing
Word counter across directory tree
Mandelbrot computation

Considerations

problem must:
- be divisible into discrete pieces that can run simultaneously
- allow multiple program instructions at same time
- actually solve faster on multiple resources
compute resources:
- multi-core single computer
- multiple computers connected via network

Parallel Computers

modern computers are inherently parallel:
- multiple functional units (L1/L2 cache, floating-point, GPU, integer units, etc.)
- multiple execution cores
- multiple hardware threads
ex: IBM BG/Q chip → 18 cores, 16 L2 caches

Networking & Clusters

networks connect stand-alone computers → clusters
cluster design:
- each node = multi-processor system
- nodes connected via Infiniband, Ethernet, optical fibers
- may include special-purpose nodes
most supercomputers = clusters from major vendors

Why Parallel Computing?

Real world is massively parallel
- better for modeling complex natural systems (weather, biology, etc.)
Save time & money
- more resources → shorter completion time
- commodity hardware can be used
Match parallel hardware
- all modern CPUs = multi-core
- serial programs waste potential performance
End of single-core growth
- microprocessors increased ~50%/yr until 2002
- slowed to ~20%/yr → shift to parallelism (multicore chips)

Applications & Projects

Science/Engineering
- physics, biosciences, chemistry, geology, engineering, defense
Industrial/Commercial
- big data, search engines, medical imaging, finance, AI, VR/graphics, pharma, oil exploration
Global applications
- widely used across the world (supercomputer ranking sources: Top500.org)

Need for Performance

growing complexity → climate models, protein folding, drug discovery, energy, data analysis
some problems require petaFLOPS & petaBytes
- ex: grand challenge problems, search engines

Additional Motivations

Concurrency: multiple tasks in parallel vs one at a time
Non-local resources: use distributed computing (e.g. SETI@home, Folding@home)
Trends:
- faster networks + distributed systems
- 500,000x increase in supercomputer performance (last 20 years)
- exascale computing race (10^18 operations/sec)

Challenges of Writing Parallel Code

Running multiple instances = not ideal
Translating serial → parallel = inefficient
- requires new algorithms (not just translation)
- ex: matrix multiplication, global sum example
parallel programs must be written, not converted

Parallelization Approaches

Partitioning is the core idea
Task parallelism
- divide into different tasks assigned to cores
Data parallelism
- split dataset, same operations on subsets

Examples

Grading 120 exams
- task-parallel: each grader handles 1 question across all exams
- data-parallel: each grader handles a set of exams fully
Global sum
- compute values in parallel (data-parallelism)
- master core collects partial sums (task-parallelism)

Complexity in Parallelism

independent tasks = simple
coordinated tasks = complex
- communication
- load balancing
- synchronization
ex: tree-structured sum with master/slave cores

Constructs

Explicit constructs
- language extensions (C/C++)
- explicit assignment (core 0 runs task 0, synchronize, etc.)
High-level abstractions
- OpenMP, CUDA
- simpler but may sacrifice some performance

What This Course Covers

explicit parallel programming (C, MPI, Pthreads, OpenMP)
major models:
- shared memory
- distributed memory

Concurrent vs Parallel vs Distributed

Concurrent: multiple tasks might be in progress
Parallel: tasks cooperate closely to solve one problem
Distributed: tasks loosely coupled, often separate programs
- parallel vs distributed → distinction often blurry

Note

End of l1a.pdf, moving to l1b.pdf

von Neumann Architecture

named after John von Neumann (1945 papers)
aka stored-program computer
- both program instructions & data stored in electronic memory
- earlier computers required hard wiring
classic design
- main memory
- CPU (control unit + ALU, registers, program counter)
- bus (interconnection between CPU & memory)
- executes one instruction at a time
- von Neumann bottleneck: limited by memory access
four main components:
- memory
- control unit
- arithmetic/logic unit
- input/output
why important? → parallel computers still follow this design (just multiplied in units)

Processes, Multitasking, Threads

Process = instance of program
- contains executable code, call stack, heap, I/O descriptors, security data, execution state
Multitasking: multiprogramming & time slicing
- maximize CPU utilization
Threads: lighter-weight execution contexts within a process
- can fork off or join back into a process

Modifications to von Neumann Model

addressing performance issues:
- caching (principle of locality)
  - CPU caches (L1, L2)
  - small but fast
  - policies like LRU
  - spatial/temporal locality
- cache blocks/lines
  - cache hits vs misses
  - write-through vs write-back

Cache Mapping

where to store cache lines
- fully associative: anywhere
- direct mapped: unique location
- set associative: N possible locations
eviction via replacement policies (e.g., LRU)

Caches & Programs

controlled by hardware
programmers can indirectly improve cache use
- exploit spatial & temporal locality
- ex: row-major order in C (nested loops row-first vs col-first affects cache efficiency)

Virtual Memory

extends main memory capacity
required for multiprogramming
main memory acts like a cache for secondary storage
secondary = swap space
unit = page
page table manages virtual addressing
- TLB (translation lookaside buffer) for fast lookup
concepts: page hit, page fault

Instruction-level Parallelism (ILP)

performance improvement via multiple instructions running at once
approaches:
- pipelining: divide execution into stages (assembly line)
- multiple issue: initiate several instructions at once
both are standard in modern CPUs

Pipelining

analogy: factory line → no idle units
example: floating-point operation split into 7 steps (1 ns each)
- sequential: 7000 ns
- pipelined: ~1006 ns
limitations:
- not k-fold speedup
- bottleneck = slowest stage
- dependencies create delays

Multiple Issue

replicate functional units for simultaneous execution
ex: 2 floating-point adders working alternately
scheduling:
- static (compile time)
- dynamic (runtime, superscalar processors)
speculation: execute ahead on guessed outcomes

Hardware Multithreading

ILP is limited → use thread-level parallelism (TLP)
methods:
- fine-grained: switch threads every instruction (lots of stalls)
- coarse-grained: switch on long operations (delays)
- simultaneous multithreading (SMT): multiple threads use multiple functional units at once

Multithreading Support

popular and widely supported by:
- hardware
- OS
- languages (Java, C++ examples)

General Parallel Terminology

Supercomputing/HPC: using world’s largest systems
Node: standalone computer-in-a-box with CPUs, memory, NICs
CPU / socket / processor / core:
- CPU = once a single execution unit
- now subdivided into multiple cores
- multiple CPUs per node (each with multiple cores)
Task: logical unit of computational work, executed by a processor
Pipelining: task broken into steps with streamed inputs

Shared vs Distributed Memory

Shared memory:
- hardware: all processors have direct access
- programming: tasks see same logical memory space
SMP (symmetric multi-processor):
- multiple processors, single address space
Distributed memory:
- hardware: each node has local memory
- programming: tasks only see local memory
- communication needed to access remote memory

Communications & Synchronization

Communications:
- tasks must exchange data (via shared memory bus or network)
Synchronization:
- coordinating tasks in real time
- usually requires waiting at synchronization points
- may increase wall-clock execution time

Granularity

measure of computation vs communication
- Coarse: lots of work between communication
- Fine: little work between communication events

Connor's Notes

⬅️ Back to portfolio

Explorer

COSC 3P93: Week 1 Notes

Parallel Computing

Key Ideas

Sequential Computing

Parallel Computing

Example Programs

Considerations

Parallel Computers

Networking & Clusters

Why Parallel Computing?

Applications & Projects

Need for Performance

Additional Motivations

Challenges of Writing Parallel Code

Parallelization Approaches

Examples

Complexity in Parallelism

Constructs

What This Course Covers

Concurrent vs Parallel vs Distributed

von Neumann Architecture

Processes, Multitasking, Threads

Modifications to von Neumann Model

Cache Mapping

Caches & Programs

Virtual Memory

Instruction-level Parallelism (ILP)

Pipelining

Multiple Issue

Hardware Multithreading

Multithreading Support

General Parallel Terminology

Shared vs Distributed Memory

Communications & Synchronization

Granularity

Graph View

Table of Contents