Network topologies and interconnects
Bisection width and bandwidth
- Bisection width refers to the minimum number of connections that must be cut to divide a network into two equal halves.
- a higher bisection width generally indicates better potential for communication bandwidth across the network.
- Bandwidth is the rate at which data can be transferred across the network.
- network performance depends on both bandwidth and latency.
- effective communication requires balancing these factors.
Topology | Bisection Width |
---|---|
Ring | 2 |
Mesh (2D) | √N (for N nodes) |
Torus | 2√N |
Hypercube (nD) | N/2 |
Fully Connected | N(N-1)/2 |
Hypercube networks
- a hypercube is a network topology where each node is connected to others in a multi-dimensional cube structure.
- for an n-dimensional hypercube, each node has n connections.
- hypercubes provide low diameter and high connectivity, enabling efficient parallel communication.
- advantages:
- logarithmic diameter relative to the number of nodes.
- good scalability and fault tolerance.
- used in some parallel computer architectures due to efficient routing properties.
graph LR A0((00)) --- A1((01)) A0 --- A2((10)) A1 --- A3((11)) A2 --- A3
Indirect interconnects
- indirect interconnects use switches or routers between processors rather than direct point-to-point links.
- examples include crossbars, multistage networks, and meshes.
- these interconnects allow scalable communication by routing messages through intermediate nodes.
- trade-offs exist between complexity, cost, and communication latency.
Latency and bandwidth
- Latency is the delay before data transfer begins following an instruction.
- includes startup time, propagation delay, and queuing delays.
- Bandwidth is the volume of data that can be transmitted per unit time.
- both latency and bandwidth affect overall communication performance.
- high latency can be mitigated by overlapping communication and computation.
- bandwidth limits the sustained data transfer rate.
Memory architectures and cache coherence
Shared vs. distributed memory
- Shared memory systems have a single physical memory accessible by all processors.
- easier programming model but can suffer from contention and scalability issues.
- Distributed memory systems have separate local memories for each processor.
- communication occurs via message passing.
- scales well but requires explicit data movement.
Cache coherence
- ensures that all processors see a consistent view of memory when caching is used.
- two main protocols:
- Snooping-based coherence:
- caches monitor a shared bus for read/write operations to maintain consistency.
- efficient for small-scale systems with a common bus.
- Directory-based coherence:
- a directory keeps track of which caches have copies of each memory block.
- scales better for large systems by reducing broadcast traffic.
- Snooping-based coherence:
False sharing
- occurs when processors cache different variables that reside on the same cache line.
- even if processors access different variables, the entire cache line is invalidated or updated.
- leads to unnecessary coherence traffic and performance degradation.
- avoided by careful data alignment or padding.
Warning
False sharing causes severe performance degradation due to frequent invalidations of cache lines, even when threads access distinct variables. Proper data alignment and padding are essential to avoid this pitfall.
Parallel programming models
Spmd vs simd
- SPMD (Single program multiple data):
- multiple processors run the same program but operate on different data.
- common in distributed memory and multi-core systems.
- SIMD (Single instruction multiple data):
- a single instruction controls multiple processing elements performing the same operation on multiple data points.
- common in vector processors and GPUs.
Thread and process coordination
- threads/processes must coordinate access to shared resources to avoid conflicts.
- coordination mechanisms include locks, barriers, and signals.
- proper synchronization is essential to avoid race conditions and ensure correctness.
Shared memory programming
Dynamic vs static threads
- Static threads:
- created once and persist for the lifetime of the program.
- useful for predictable workloads.
- Dynamic threads:
- created and destroyed as needed.
- provide flexibility but add overhead.
Aspect | Static Threads | Dynamic Threads |
---|---|---|
Lifetime | Fixed for program duration | Created and destroyed as needed |
Overhead | Low | Higher due to creation/destruction |
Flexibility | Less flexible | More flexible |
Use case | Predictable workloads | Irregular or dynamic workloads |
Nondeterminism and race conditions
- Nondeterminism arises when thread execution order is unpredictable.
- Race conditions occur when multiple threads access shared data concurrently, and at least one thread modifies it without proper synchronization.
- can lead to inconsistent or incorrect results.
Note
Nondeterminism means that the exact order and timing of thread execution cannot be predicted, which can cause different program outcomes on different runs even with the same input.
Synchronization primitives
- Mutexes (locks):
- ensure mutual exclusion by allowing only one thread to access a critical section at a time.
- Busy waiting:
- threads continuously check for a condition, wasting CPU cycles.
- Semaphores:
- counting mechanisms to control access to resources.
- Transactions:
- group operations that execute atomically, rolling back if conflicts occur.
- these primitives help enforce thread safety and prevent race conditions.
flowchart TD Start[Attempt to acquire mutex] Start --> CheckLock{Is mutex locked?} CheckLock -- No --> AcquireLock[Acquire mutex] AcquireLock --> CriticalSection[Enter critical section] CriticalSection --> ReleaseLock[Release mutex] ReleaseLock --> End[Exit critical section] CheckLock -- Yes --> Wait[Wait or retry] Wait --> CheckLock
Tip
Transactional memory allows a group of operations to execute atomically by tracking changes and rolling back if conflicts occur, simplifying synchronization and avoiding deadlocks.
Distributed memory and message passing
- processes communicate by explicitly sending and receiving messages.
- Message Passing Interface (MPI) is a widely used standard.
- communication models:
- Two-sided communication: sender and receiver both participate in message exchange.
- One-sided communication: one process can directly access memory of another without explicit cooperation.
- programming distributed systems requires managing data distribution and communication efficiently.
sequenceDiagram participant P1 as Process 1 participant P2 as Process 2 P1->>P2: Send Message P2-->>P1: Acknowledge
Two-sided vs one-sided communication
Aspect | Two-sided Communication | One-sided Communication |
---|---|---|
Participation | Both sender and receiver actively involved | Only one process initiates communication |
Synchronization | Requires synchronization between processes | No explicit synchronization required |
Complexity | Simpler to understand and implement | More complex but potentially more efficient |
Use cases | General message passing | Remote memory access, shared data structures |
Partitioned global address space (PGAS)
- a programming model that provides a global memory address space partitioned among processes.
- allows one-sided communication with a global view of memory.
- simplifies programming compared to pure message passing.
Hybrid programming
- combines shared memory and distributed memory models.
- for example, MPI between nodes and threads (e.g., OpenMP) within a node.
- enables efficient use of hierarchical hardware architectures.
Parallel software caveats
Challenges in parallel programming
- debugging is more difficult due to nondeterminism.
- performance tuning requires understanding of hardware and communication costs.
- load balancing is critical to avoid idle processors.
- careful design needed to minimize synchronization overhead and contention.
I/O issues in parallel systems
- parallel applications often require efficient input/output operations.
- I/O bottlenecks can limit scalability.
- strategies include:
- collective I/O operations.
- asynchronous I/O to overlap computation and data transfer.
- using parallel file systems optimized for concurrent access.