Info
Condensed notes separated by section (Introduction, Parallel Hardware and Software, Shared-memory with threads, Shared-memory with OpenMP, Performance)
Introduction
-
Parallel computing: when multiple processors work together at the same time to solve a problem faster → splits a task into smaller parts that run simultaneously instead of one after another
-
Considerations for a possible parallel problem:
- be divisible into discrete pieces that can run simultaneously
- allow multiple program instructions at the same time
- actually solve faster on multiple resources
-
Concurrency: multiple tasks in parallel vs one at a time
-
Task parallelism: divide into different tasks assigned to cores
- e.g. each grader handles 1 question across all exams
-
Data parallelism: split dataset & run the same operations on subsets
- e.g. each grader handles a set of full exams (all questions)
-
Concurrent: multiple tasks might be in progress
-
Parallel: tasks cooperate closely to solve one problem
-
Distributed: tasks loosely coupled, often separate problems → parallel/distributed distinction is often blurry
-
Von Neumann architecture: computer design where the CPU, memory, and input/output all share a single memory and data bus
- stores both instructions and data in the same memory, and the CPU fetches and executes them one at a time
- Extensions: vector processors, pipelining, or multiple execution units built on top of the base design
-
Process: instance of a program
-
Thread: lighter-weight execution contexts within a process
- Hardware multithreading: allows multiple instruction streams per core (SMT is one example)
-
Multitasking: multitasking & time slicing
-
Cache mapping: where to store cache lines
- Fully associative: anywhere
- Direct mapped: unique location
- Set associative: possible locations
-
Virtual memory: lets a computer use part of it’s storage as extra RAM
- main memory acts like a cache for secondary storage
-
Instruction-level parallelism (ILP): performance improvement via multiple instructions running at once
- pipelining: divide execution into stages
- multiple issue: initiate several instructions at once
-
Thread-level parallelism (TLP): runs multiple threads or programs in parallel on separate cores to improve performance
-
Simultaneous multithreading (SMT): lets a single core execute multiple threads at once by sharing its resources → increasing hardware efficiency
-
Supercomputing/HPC: extremely powerful computers
-
Node: standalone computer (w/ CPUs, memory, …)
-
CPU/socket/processor/core: one single execution unit
-
Task: logical unit of computational work, executed by a processor
-
Shared memory: all processors have direct access, task see the same logical memory space
-
Symmetric multi-processor (SMP): multiple processors, single address space
-
Distributed memory: each node has local memory, tasks only see local memory, communication needed to access remote memory
-
Communication: tasks must exchange data (via shared memory bus or network)
- Coarse communication: lots of work between communication
- Fine communication: little work between communication
-
Synchronization: coordination of tasks (usually waiting at synchronization points)
-
Observed speedup: how much faster code that has been parallelized is
-
Parallel overhead: the amount of time required ot coordinate parallel task (as opposed to doing useful work)
- e.g. task start-up time, synchronizations, data communications, task termination time, software overhead imposed by parallel languages, libraries, OS, etc., …
-
Scalability: the ability to increase performance with the addition of more resources
- Massively parallel: close to linear growth in observes speedup
- Embarrassingly parallel: little to no need for coordination between tasks
- e.g. image processing (split into multiple parts, each part handled independently)
Parallel Software
-
Data parallelism vs task parallelism: data parallel splits datasets into chunks processed in parallel; task parallel assigns distinct operations or tasks to different threads or processes.
-
Shared memory: uses threads that can be static or dynamic.
-
Pros and cons of threads:
- Pros: lightweight, fast communication, shared address space.
- Cons: synchronization complexity, race conditions, limited scalability.
-
Global vs shared variables: global variables are visible to all threads; shared variables must be protected using synchronization.
-
Coordinating threads: prevents nondeterminism and race conditions using locks, mutexes, semaphores, and monitors.
-
Transactions: group operations atomically to ensure consistency.
-
Distributed memory: each process has its own memory space, communicating via message passing (MPI).
- MPI communication types: blocking, non-blocking, broadcast, reduction.
-
One-sided communication: one process can directly read/write to another’s memory without cooperation.
-
Partitioned Global Address Space (PGAS): shared logical address space, but each processor has faster access to its local portion.
-
Parallel I/O: efficient data access across processors is critical for performance and scalability.
-
Flynn’s Taxonomy: distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of the instruction stream and the data stream
Type | Full Form | Description | Example |
---|---|---|---|
SISD | Single Instruction, Single Data | One processor executes one instruction on one data at a time | Traditional single-core CPU |
SIMD | Single Instruction, Multiple Data | One instruction operates on multiple data items in parallel | GPU or vector processor |
MISD | Multiple Instruction, Single Data | Multiple instructions operate on the same data stream | Rare, used in fault-tolerant systems |
MIMD | Multiple Instruction, Multiple Data | Multiple processors execute different instructions on different data | Multicore processors or computer clusters |
-
MIMD
- Shared-memory systems: use a common global memory that all processors can access directly (fast and easy → can cause conflicts if multiple processors try to access the same data at once)
- Non-uniform memory access (NUMA): each processor has its own local memory, but can still access others memory through an interconnect ()
- Uniform memory access (UMA): all processors share the same physical memory and access it at the same speed (simple, consistent → memory bandwidth can become a bottleneck)
- Distributed-memory systems: give each processor its own private memory (avoids memory conflicts but makes communication slower and more complex)
- Advantages & disadvantages:
- Shared memory: easy to program but limited scalability and prone to memory contention
- Distributed memory: scalable and fault-tolerant but requires explicit message passing
- Shared-memory systems: use a common global memory that all processors can access directly (fast and easy → can cause conflicts if multiple processors try to access the same data at once)
-
Processor-to-processor mapping:
- Anonymous: any process on any processor, dynamic scheduling, global ready list
- Dedicated: static assignment at load time, each node has its own ready list, occasional re-allocation for fault-tolerance/load-balancing
- Access latency & memory conflicts: can slow performance when multiple processors access shared memory simultaneously
- Solutions & drawbacks: caches and memory interleaving reduce latency but introduce complexity and coherence challenges
-
Bisection width: the minimum number of connections that must be cut to divide a network into two equal halves
-
Bandwidth: the rate at which data can be transferred across the network
-
Latency: the delay before data transfer begins following an instruction
-
Cache coherence: keeping copies of the same data in different caches consistent → if one processor changes a value in its cache, all other caches must see that update so everyone works with the same data
- Snooping-based coherence: caches monitor a shared bus for read/write operations to maintain consistency (good for small-scale systems with a common bus)
- Directory-based coherence: a directory keeps track of which caches have copies of each memory block
-
Hypercube: a network topology where each node is connected to others in a multi-dimensional cube structure
-
Indirect interconnects: use switches or routers between processors rather than direct point-to-links
- allows scalable communication by routing messages through intermediate notes
- Direct interconnect features: include mode of operation, control strategy, and switching techniques that define how data flows between nodes
-
False sharing: occurs when processors cache different variables that are on the same cache line
- causes severe performance degradation → avoid by using proper data alignment and padding
-
Message transmission time: the amount of time it takes to push all the bits of a message onto the transmission medium
- Static threads: created once and persist for the lifetime of the program (useful for predictable workloads)
- Dynamic threads: created and destroyed as needed (provide flexibility but add overhead)
Aspect | Static Threads | Dynamic Threads |
---|---|---|
Lifetime | Fixed for program duration | Created and destroyed as needed |
Overhead | Low | Higher due to creation/destruction |
Flexibility | Less flexible | More flexible |
Use case | Predictable workloads | Irregular or dynamic workloads |
- Pros and cons of threads: threads share memory and communicate quickly but need synchronization to prevent conflicts.
- Global vs shared variables: global variables are accessible by all threads, while shared variables in parallel regions require synchronization mechanisms.
- Thread communication: exchange of data between threads through shared memory or synchronization constructs.
Shared-Memory Programming with Threads
-
Nondeterminism: arises when thread execution order is unpredictable
-
Race conditions: occur when multiple threads access shared data concurrently, and at least one thread modifies it without proper synchronization
-
Mutexes (locks): ensure mutual exclusion by allowing only one thread to access a critical section at a time
-
Busy waiting: threads continuously check for a condition, wasting CPU cycles
-
Semaphores: counting mechanism to control access to resources
- Monitors: synchronization construct combining mutual exclusion and condition variables
- Join vs barrier: join waits for one thread; barrier waits for all threads to reach the same point
- Condition variables: allow threads to wait for specific conditions before continuing
- Read-write locks: let multiple readers or one writer access a resource at once
- Caches, cache coherence, and false sharing: cache updates by one thread must be visible to others; false sharing occurs when threads modify variables on the same cache line.
- Reentrant functions: can be safely called by multiple threads simultaneously
-
Transactions: group operations that execute atomically, rolling back if conflicts occur
-
Message passing interface (MPI): widely used standard
-
Two-sided communication: sender and receiver both participate in message exchange
-
One-sided communication: one process can directly access memory of another without explicit cooperation
-
Partitioned global address space (PGAS): all processors share one big memory space, but each has its own local part thats faster to access
-
Hybrid programming: combines shared memory and distributed memory models
-
POSIX threads: standard for unix-like OS’s (can be linked with C programs) (also known as P-threads)
-
I/O: parallel I/O performance can affect scalability and overall throughput
Shared-Memory Programming with OpenMP
- OpenMP: API for shared memory programming → designed for systems in which each thread or process can potentially have access to all available memory
- Team: collection of threads executing the parallel block → original thread and new threads
- Master: the original thread
- Slave/worker: all additional threads created
- Shared scope: a variable that can be accessed by all the threads in the team (default)
- Private scope: a variable that can only be accessed by a single thread
- Pragmas: special instructions for the compiler that tell it how to handle certain parts of your code
- Clause: allows programmer to specify the number of threads that should execute the block
- Reduction clause: combines multiple threads into a single value after parallel work finishes
- e.g.
reduction(+:sum
adds them all together into one final sum at the end
- e.g.
- Parallel for directive: makes a loop run in parallel, splitting its iterations among multiple threads
- Default clause: allow the programmer to specify the scope of each variable of a block
- Schedule clause: controls how variables are shared between threads in a parallel region
- Schedule types: static, dynamic, guided, auto, runtime
#pragma omp parallel
syntax: used to define parallel regions in C/C++- Data dependencies: handled through private variables to avoid race conditions
- Thread pooling: reusing a set of threads instead of making new ones every time → helps programs run faster and use less memory
- Producer: produces requests for data
- Consumer: “consumes” the request by finding or generating the requested data
- Message passing: each thread could have a shared message queue → when one thread wants to send a message, it enqueues the message in the destination thread’s queue
- Sending messages: adding messages to the queue
- Receiving messages: dequeueing messages from the queue
- Thread startup: the process of creating and beginning a new thread’s execution
- Atomic directive: can only protect critical sections that consist of a single C assignment statement (e.g.
x++;
,++x;
,x--;
,--x;
) - Critical sections: can add a name to a critical directive → two blocks protected with critical directives can be executed simultaneously
Method | What It Does | When to Use |
---|---|---|
critical | Only one thread runs the enclosed code block | For larger sections of code |
atomic | Makes a single variable update thread-safe | For simple updates like x++ |
lock | Manually control when threads lock/unlock | For custom or complex synchronization |
-
do not mix different types of mutual exclusion for a single critical section
-
Thread-safe code: code is thread safe if it can be simultaneously executed by multiple threads without causing issues
- some libraries might be strictly sequential
-
Embarrassingly parallel computations: tasks that require no communication between processors, easily parallelized.
Performance
- Performance: how efficiently the system uses multiple processors or threads to finish a task
- Definition: performance measures how much useful work is completed per unit time relative to resource usage.
- Scalability: how the program behaves in terms of complexity and system size (number of processors)
- Strongly scalable: same efficiency when increasing system size when problem size does not change
- Weakly scalable: same efficiency when increasing system size at the same rate as problem size
- Speedup:
- Efficiency:
- Cost:
- Amdahl’s Law: shows the speedup you can get from parallelizing a program
- major limiting factors:
- after a certain number of processors, adding more gives little to no improvement
- if processors don’t get equal work, total performance drops
- sequential portion (): even a small serial part limits total speedup
- : overall speedup
- : fraction of program that can run in parallel
- : number of processors
- major limiting factors:
- Limits & costs: non-parallelizable sections and synchronization overheads reduce achievable speedup
- Overhead: the extra time or work caused by managing parallel tasks
- : total overhead time
- : number of processors
- : parallel execution time
- : sequential execution time
- Gustafson-Barsis’ Law: as problem size grows, parallel systems can achieve better speedup than Amdahl’s Law predicts
- : scaled speedup
- : number of processors
- : fraction of the program that runs in parallel
- Workflow DAG (directed acyclic graph): shows how tasks in a parallel program depend on each other
- task cannot execute until all the inputs to the task are available → come from outputs of earlier executing tasks
- Work-span model:
- Work: the total time if you run all tasks one by one on a single processor
- Span: the time of the longest chain of dependent tasks → the fastest you could ever finish, even with unlimited processors
- Bounds:
- Lower Bound: no matter how good the scheduler is, the parallel time can’t be less than either the average load per processor or the critical path
- Upper bound: assumes near-perfect scheduling and minimal overhead
- Lower Bound: no matter how good the scheduler is, the parallel time can’t be less than either the average load per processor or the critical path
- Asymptotic complexity analysis: shows how an algorithm’s runtime or work grows as the input gets larger
- O (Big O): upper bound → at most this fast
- (Omega): lower bound → at least this fast
- (Theta): tight bound → exactly this fast (up to constants)
Practice Questions
- Which of the following statements best explains the motivation for parallel computing in modern computer systems?
- Increasing the processor clock frequency indefinitely is the most practical way to enhance system performance.
- Memory access speeds have increased at the same rate as CPU speeds, eliminating performance bottlenecks.
- Power, heat dissipation, and physical design limitations restrict further increases in single-core clock frequency, making parallel execution across multiple cores a more scalable performance solution.
- Sequential programming models can automatically achieve high throughput without explicit parallelism.
Answer
- Power, heat dissipation, and physical design limitations restrict further increases in single-core clock frequency, making parallel execution across multiple cores a more scalable performance solution.
- The table below shows the execution time (in seconds) for a parallel algorithm executed on multiple processors.
Processors() | Execution Time () |
---|---|
1 | 80 |
2 | 45 |
4 | 25 |
8 | 18 |
16 | 15 |
i) Calculate the speedup and efficiency for each processor unit
Answer
Processors () Execution Time () Speedup () Efficiency () 1 80 1.000 1.000 2 45 1.778 0.889 4 25 3.200 0.800 8 18 4.444 0.556 16 15 5.333 0.333
ii) Based on your results, determine if the program exhibits good strong scalability.
Answer
Not great. With fixed problem size, efficiency drops from 0.889 to 0.333 as cores increase, and speedup is far from linear (8 to 16 cores improves speedup only 4.444 to 5.333). This indicates rising parallel overhead/serialization and weak strong scaling beyond~4-8 cores.
iii) Assuming the problem size doubles while keeping the same number of processors, predict how the efficiency would change and justify your reasoning.
Answer
If the problem size doubles while keeping the same number of processors, efficiency will increase. Each processor gets more work, so the fixed overhead and idle time become less significant. The program uses resources more effectively, showing better performance scaling across all processor counts.