Info

Condensed notes separated by section (Introduction, Parallel Hardware and Software, Shared-memory with threads, Shared-memory with OpenMP, Performance)

Introduction

Parallel computing: when multiple processors work together at the same time to solve a problem faster → splits a task into smaller parts that run simultaneously instead of one after another
Considerations for a possible parallel problem:
- be divisible into discrete pieces that can run simultaneously
- allow multiple program instructions at the same time
- actually solve faster on multiple resources
Concurrency: multiple tasks in parallel vs one at a time
Task parallelism: divide into different tasks assigned to cores
- e.g. each grader handles 1 question across all exams
Data parallelism: split dataset & run the same operations on subsets
- e.g. each grader handles a set of full exams (all questions)
Concurrent: multiple tasks might be in progress
Parallel: tasks cooperate closely to solve one problem
Distributed: tasks loosely coupled, often separate problems → parallel/distributed distinction is often blurry
Von Neumann architecture: computer design where the CPU, memory, and input/output all share a single memory and data bus
- stores both instructions and data in the same memory, and the CPU fetches and executes them one at a time
- Extensions: vector processors, pipelining, or multiple execution units built on top of the base design
Process: instance of a program
Thread: lighter-weight execution contexts within a process
- Hardware multithreading: allows multiple instruction streams per core (SMT is one example)
Multitasking: multitasking & time slicing
Cache mapping: where to store cache lines
- Fully associative: anywhere
- Direct mapped: unique location
- Set associative: $n$ possible locations
Virtual memory: lets a computer use part of it’s storage as extra RAM
- main memory acts like a cache for secondary storage
Instruction-level parallelism (ILP): performance improvement via multiple instructions running at once
- pipelining: divide execution into stages
- multiple issue: initiate several instructions at once
Thread-level parallelism (TLP): runs multiple threads or programs in parallel on separate cores to improve performance
Simultaneous multithreading (SMT): lets a single core execute multiple threads at once by sharing its resources → increasing hardware efficiency
Supercomputing/HPC: extremely powerful computers
Node: standalone computer (w/ CPUs, memory, …)
CPU/socket/processor/core: one single execution unit
Task: logical unit of computational work, executed by a processor
Shared memory: all processors have direct access, task see the same logical memory space
Symmetric multi-processor (SMP): multiple processors, single address space
Distributed memory: each node has local memory, tasks only see local memory, communication needed to access remote memory
Communication: tasks must exchange data (via shared memory bus or network)
- Coarse communication: lots of work between communication
- Fine communication: little work between communication
Synchronization: coordination of tasks (usually waiting at synchronization points)
Observed speedup: how much faster code that has been parallelized is
- $\frac{wall-clock time of serial execution}{wall-clock time of parallel execution}$
Parallel overhead: the amount of time required ot coordinate parallel task (as opposed to doing useful work)
- e.g. task start-up time, synchronizations, data communications, task termination time, software overhead imposed by parallel languages, libraries, OS, etc., …
Scalability: the ability to increase performance with the addition of more resources
- Massively parallel: close to linear growth in observes speedup
- Embarrassingly parallel: little to no need for coordination between tasks
  - e.g. image processing (split into multiple parts, each part handled independently)

Parallel Software

Data parallelism vs task parallelism: data parallel splits datasets into chunks processed in parallel; task parallel assigns distinct operations or tasks to different threads or processes.
Shared memory: uses threads that can be static or dynamic.
Pros and cons of threads:
- Pros: lightweight, fast communication, shared address space.
- Cons: synchronization complexity, race conditions, limited scalability.
Global vs shared variables: global variables are visible to all threads; shared variables must be protected using synchronization.
Coordinating threads: prevents nondeterminism and race conditions using locks, mutexes, semaphores, and monitors.
Transactions: group operations atomically to ensure consistency.
Distributed memory: each process has its own memory space, communicating via message passing (MPI).
- MPI communication types: blocking, non-blocking, broadcast, reduction.
One-sided communication: one process can directly read/write to another’s memory without cooperation.
Partitioned Global Address Space (PGAS): shared logical address space, but each processor has faster access to its local portion.
Parallel I/O: efficient data access across processors is critical for performance and scalability.
Flynn’s Taxonomy: distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of the instruction stream and the data stream

Type	Full Form	Description	Example
SISD	Single Instruction, Single Data	One processor executes one instruction on one data at a time	Traditional single-core CPU
SIMD	Single Instruction, Multiple Data	One instruction operates on multiple data items in parallel	GPU or vector processor
MISD	Multiple Instruction, Single Data	Multiple instructions operate on the same data stream	Rare, used in fault-tolerant systems
MIMD	Multiple Instruction, Multiple Data	Multiple processors execute different instructions on different data	Multicore processors or computer clusters

MIMD
- Shared-memory systems: use a common global memory that all processors can access directly (fast and easy → can cause conflicts if multiple processors try to access the same data at once)
  - Non-uniform memory access (NUMA): each processor has its own local memory, but can still access others memory through an interconnect ()
  - Uniform memory access (UMA): all processors share the same physical memory and access it at the same speed (simple, consistent → memory bandwidth can become a bottleneck)
- Distributed-memory systems: give each processor its own private memory (avoids memory conflicts but makes communication slower and more complex)
- Advantages & disadvantages:
  - Shared memory: easy to program but limited scalability and prone to memory contention
  - Distributed memory: scalable and fault-tolerant but requires explicit message passing
Processor-to-processor mapping:
- Anonymous: any process on any processor, dynamic scheduling, global ready list
- Dedicated: static assignment at load time, each node has its own ready list, occasional re-allocation for fault-tolerance/load-balancing
- Access latency & memory conflicts: can slow performance when multiple processors access shared memory simultaneously
- Solutions & drawbacks: caches and memory interleaving reduce latency but introduce complexity and coherence challenges
Bisection width: the minimum number of connections that must be cut to divide a network into two equal halves
Bandwidth: the rate at which data can be transferred across the network
Latency: the delay before data transfer begins following an instruction
Cache coherence: keeping copies of the same data in different caches consistent → if one processor changes a value in its cache, all other caches must see that update so everyone works with the same data
- Snooping-based coherence: caches monitor a shared bus for read/write operations to maintain consistency (good for small-scale systems with a common bus)
- Directory-based coherence: a directory keeps track of which caches have copies of each memory block
Hypercube: a network topology where each node is connected to others in a multi-dimensional cube structure
Indirect interconnects: use switches or routers between processors rather than direct point-to-links
- allows scalable communication by routing messages through intermediate notes
- Direct interconnect features: include mode of operation, control strategy, and switching techniques that define how data flows between nodes
False sharing: occurs when processors cache different variables that are on the same cache line
- causes severe performance degradation → avoid by using proper data alignment and padding
Message transmission time: the amount of time it takes to push all the bits of a message onto the transmission medium

T_{t r an s mi ss i o n} = \frac{Message size (bits)}{Transmission rate (bits/second)}

Static threads: created once and persist for the lifetime of the program (useful for predictable workloads)
Dynamic threads: created and destroyed as needed (provide flexibility but add overhead)

Aspect	Static Threads	Dynamic Threads
Lifetime	Fixed for program duration	Created and destroyed as needed
Overhead	Low	Higher due to creation/destruction
Flexibility	Less flexible	More flexible
Use case	Predictable workloads	Irregular or dynamic workloads

Pros and cons of threads: threads share memory and communicate quickly but need synchronization to prevent conflicts.
Global vs shared variables: global variables are accessible by all threads, while shared variables in parallel regions require synchronization mechanisms.
Thread communication: exchange of data between threads through shared memory or synchronization constructs.

Shared-Memory Programming with Threads

Nondeterminism: arises when thread execution order is unpredictable
Race conditions: occur when multiple threads access shared data concurrently, and at least one thread modifies it without proper synchronization
Mutexes (locks): ensure mutual exclusion by allowing only one thread to access a critical section at a time
Busy waiting: threads continuously check for a condition, wasting CPU cycles
Semaphores: counting mechanism to control access to resources
- Monitors: synchronization construct combining mutual exclusion and condition variables
- Join vs barrier: join waits for one thread; barrier waits for all threads to reach the same point
- Condition variables: allow threads to wait for specific conditions before continuing
- Read-write locks: let multiple readers or one writer access a resource at once
- Caches, cache coherence, and false sharing: cache updates by one thread must be visible to others; false sharing occurs when threads modify variables on the same cache line.
- Reentrant functions: can be safely called by multiple threads simultaneously
Transactions: group operations that execute atomically, rolling back if conflicts occur
Message passing interface (MPI): widely used standard
Two-sided communication: sender and receiver both participate in message exchange
One-sided communication: one process can directly access memory of another without explicit cooperation
Partitioned global address space (PGAS): all processors share one big memory space, but each has its own local part thats faster to access
Hybrid programming: combines shared memory and distributed memory models
POSIX threads: standard for unix-like OS’s (can be linked with C programs) (also known as P-threads)
I/O: parallel I/O performance can affect scalability and overall throughput

Shared-Memory Programming with OpenMP

OpenMP: API for shared memory programming → designed for systems in which each thread or process can potentially have access to all available memory
- Team: collection of threads executing the parallel block → original thread and new threads
- Master: the original thread
- Slave/worker: all additional threads created
- Shared scope: a variable that can be accessed by all the threads in the team (default)
- Private scope: a variable that can only be accessed by a single thread
Pragmas: special instructions for the compiler that tell it how to handle certain parts of your code
Clause: allows programmer to specify the number of threads that should execute the block
Reduction clause: combines multiple threads into a single value after parallel work finishes
- e.g. reduction(+:sum adds them all together into one final sum at the end
Parallel for directive: makes a loop run in parallel, splitting its iterations among multiple threads
Default clause: allow the programmer to specify the scope of each variable of a block
Schedule clause: controls how variables are shared between threads in a parallel region
- Schedule types: static, dynamic, guided, auto, runtime
- #pragma omp parallel syntax: used to define parallel regions in C/C++
- Data dependencies: handled through private variables to avoid race conditions
Thread pooling: reusing a set of threads instead of making new ones every time → helps programs run faster and use less memory
Producer: produces requests for data
Consumer: “consumes” the request by finding or generating the requested data
Message passing: each thread could have a shared message queue → when one thread wants to send a message, it enqueues the message in the destination thread’s queue
- Sending messages: adding messages to the queue
- Receiving messages: dequeueing messages from the queue
Thread startup: the process of creating and beginning a new thread’s execution
Atomic directive: can only protect critical sections that consist of a single C assignment statement (e.g. x++;, ++x;, x--;, --x;)
Critical sections: can add a name to a critical directive → two blocks protected with critical directives can be executed simultaneously

Method	What It Does	When to Use
critical	Only one thread runs the enclosed code block	For larger sections of code
atomic	Makes a single variable update thread-safe	For simple updates like `x++`
lock	Manually control when threads lock/unlock	For custom or complex synchronization

do not mix different types of mutual exclusion for a single critical section
Thread-safe code: code is thread safe if it can be simultaneously executed by multiple threads without causing issues
- some libraries might be strictly sequential
Embarrassingly parallel computations: tasks that require no communication between processors, easily parallelized.

Performance

Performance: how efficiently the system uses multiple processors or threads to finish a task
Definition: performance measures how much useful work is completed per unit time relative to resource usage.
Scalability: how the program behaves in terms of complexity and system size (number of processors)
- Strongly scalable: same efficiency when increasing system size when problem size does not change
- Weakly scalable: same efficiency when increasing system size at the same rate as problem size
Speedup: $\frac{time (sequential)}{time (parallel)}$
Efficiency: $\frac{speedup}{number of processors}$
Cost: $time (parallel) * number of processors$
Amdahl’s Law: shows the speedup you can get from parallelizing a program
- major limiting factors:
  - after a certain number of processors, adding more gives little to no improvement
  - if processors don’t get equal work, total performance drops
  - sequential portion ( $1 - P$ ): even a small serial part limits total speedup
- $S$ : overall speedup
- $P$ : fraction of program that can run in parallel
- $N$ : number of processors

S = \frac{1}{( 1 - P ) + \frac{P}{N}}

Limits & costs: non-parallelizable sections and synchronization overheads reduce achievable speedup
Overhead: the extra time or work caused by managing parallel tasks
- $T_{o v er h e a d}$ : total overhead time
- $P$ : number of processors
- $T_{p a r a ll e l}$ : parallel execution time
- $T_{se q u e n t ia l}$ : sequential execution time

T_{o v er h e a d} = (P \times T_{p a r a ll e l}) - T_{se q u e n t ia l}

Gustafson-Barsis’ Law: as problem size grows, parallel systems can achieve better speedup than Amdahl’s Law predicts
- $S$ : scaled speedup
- $N$ : number of processors
- $P$ : fraction of the program that runs in parallel

S = N - (1 - P) (N - 1)

Workflow DAG (directed acyclic graph): shows how tasks in a parallel program depend on each other
- task cannot execute until all the inputs to the task are available → come from outputs of earlier executing tasks
Work-span model:
- Work: the total time if you run all tasks one by one on a single processor
- Span: the time of the longest chain of dependent tasks → the fastest you could ever finish, even with unlimited processors
Bounds:
- Lower Bound: no matter how good the scheduler is, the parallel time can’t be less than either the average load per processor or the critical path
  - $- T_{P} \geq max (\frac{T _{1}}{P},; T_{\infty})$
- Upper bound: assumes near-perfect scheduling and minimal overhead
  - $- T_{P} \leq \frac{T _{1} - T _{\infty}}{P} + T_{\infty}$
Asymptotic complexity analysis: shows how an algorithm’s runtime or work grows as the input gets larger
- O (Big O): upper bound → at most this fast
- $Ω$ (Omega): lower bound → at least this fast
- $Θ$ (Theta): tight bound → exactly this fast (up to constants)

Practice Questions

Which of the following statements best explains the motivation for parallel computing in modern computer systems?
1. Increasing the processor clock frequency indefinitely is the most practical way to enhance system performance.
2. Memory access speeds have increased at the same rate as CPU speeds, eliminating performance bottlenecks.
3. Power, heat dissipation, and physical design limitations restrict further increases in single-core clock frequency, making parallel execution across multiple cores a more scalable performance solution.
4. Sequential programming models can automatically achieve high throughput without explicit parallelism.

Answer

Power, heat dissipation, and physical design limitations restrict further increases in single-core clock frequency, making parallel execution across multiple cores a more scalable performance solution.

The table below shows the execution time (in seconds) for a parallel algorithm executed on multiple processors.

Processors( $p$ )	Execution Time ( $Tp$ )
1	80
2	45
4	25
8	18
16	15

i) Calculate the speedup and efficiency for each processor unit

Answer

Processors ( $p$ ) Execution Time ( $Tp$ ) Speedup ( $S_{p} = T_{1} / T_{p}$ ) Efficiency ( $E_{p} = S_{p} / p$ )
1 80 1.000 1.000
2 45 1.778 0.889
4 25 3.200 0.800
8 18 4.444 0.556
16 15 5.333 0.333

Processors ( $p$ )	Execution Time ( $Tp$ )	Speedup ( $S_{p} = T_{1} / T_{p}$ )	Efficiency ( $E_{p} = S_{p} / p$ )
1	80	1.000	1.000
2	45	1.778	0.889
4	25	3.200	0.800
8	18	4.444	0.556
16	15	5.333	0.333

ii) Based on your results, determine if the program exhibits good strong scalability.

Answer

Not great. With fixed problem size, efficiency drops from 0.889 to 0.333 as cores increase, and speedup is far from linear (8 to 16 cores improves speedup only 4.444 to 5.333). This indicates rising parallel overhead/serialization and weak strong scaling beyond~4-8 cores.

iii) Assuming the problem size doubles while keeping the same number of processors, predict how the efficiency would change and justify your reasoning.

Answer

If the problem size doubles while keeping the same number of processors, efficiency will increase. Each processor gets more work, so the fixed overhead and idle time become less significant. The program uses resources more effectively, showing better performance scaling across all processor counts.

Connor's Notes

⬅️ Back to portfolio

Explorer

COSC 3P93: Midterm

Introduction

Parallel Software

Shared-Memory Programming with Threads

Shared-Memory Programming with OpenMP

Performance

Practice Questions

Graph View

Table of Contents