COSC 3P93: Lecture 5 Notes

Network topologies and interconnects

Bisection width and bandwidth

Bisection width refers to the minimum number of connections that must be cut to divide a network into two equal halves.
- a higher bisection width generally indicates better potential for communication bandwidth across the network.
Bandwidth is the rate at which data can be transferred across the network.
- network performance depends on both bandwidth and latency.
- effective communication requires balancing these factors.

Topology	Bisection Width
Ring	2
Mesh (2D)	√N (for N nodes)
Torus	2√N
Hypercube (nD)	N/2
Fully Connected	N(N-1)/2

Hypercube networks

a hypercube is a network topology where each node is connected to others in a multi-dimensional cube structure.
- for an n-dimensional hypercube, each node has n connections.
- hypercubes provide low diameter and high connectivity, enabling efficient parallel communication.
advantages:
- logarithmic diameter relative to the number of nodes.
- good scalability and fault tolerance.
used in some parallel computer architectures due to efficient routing properties.

graph LR
    A0((00)) --- A1((01))
    A0 --- A2((10))
    A1 --- A3((11))
    A2 --- A3

Indirect interconnects

indirect interconnects use switches or routers between processors rather than direct point-to-point links.
- examples include crossbars, multistage networks, and meshes.
these interconnects allow scalable communication by routing messages through intermediate nodes.
trade-offs exist between complexity, cost, and communication latency.

Latency and bandwidth

Latency is the delay before data transfer begins following an instruction.
- includes startup time, propagation delay, and queuing delays.
Bandwidth is the volume of data that can be transmitted per unit time.
both latency and bandwidth affect overall communication performance.
- high latency can be mitigated by overlapping communication and computation.
- bandwidth limits the sustained data transfer rate.

Memory architectures and cache coherence

Shared vs. distributed memory

Shared memory systems have a single physical memory accessible by all processors.
- easier programming model but can suffer from contention and scalability issues.
Distributed memory systems have separate local memories for each processor.
- communication occurs via message passing.
- scales well but requires explicit data movement.

Cache coherence

ensures that all processors see a consistent view of memory when caching is used.
two main protocols:
- Snooping-based coherence:
  - caches monitor a shared bus for read/write operations to maintain consistency.
  - efficient for small-scale systems with a common bus.
- Directory-based coherence:
  - a directory keeps track of which caches have copies of each memory block.
  - scales better for large systems by reducing broadcast traffic.

occurs when processors cache different variables that reside on the same cache line.
- even if processors access different variables, the entire cache line is invalidated or updated.
leads to unnecessary coherence traffic and performance degradation.
avoided by careful data alignment or padding.

Warning

False sharing causes severe performance degradation due to frequent invalidations of cache lines, even when threads access distinct variables. Proper data alignment and padding are essential to avoid this pitfall.

Parallel programming models

Spmd vs simd

SPMD (Single program multiple data):
- multiple processors run the same program but operate on different data.
- common in distributed memory and multi-core systems.
SIMD (Single instruction multiple data):
- a single instruction controls multiple processing elements performing the same operation on multiple data points.
- common in vector processors and GPUs.

Thread and process coordination

threads/processes must coordinate access to shared resources to avoid conflicts.
coordination mechanisms include locks, barriers, and signals.
proper synchronization is essential to avoid race conditions and ensure correctness.

Shared memory programming

Dynamic vs static threads

Static threads:
- created once and persist for the lifetime of the program.
- useful for predictable workloads.
Dynamic threads:
- created and destroyed as needed.
- provide flexibility but add overhead.

Aspect	Static Threads	Dynamic Threads
Lifetime	Fixed for program duration	Created and destroyed as needed
Overhead	Low	Higher due to creation/destruction
Flexibility	Less flexible	More flexible
Use case	Predictable workloads	Irregular or dynamic workloads

Nondeterminism and race conditions

Nondeterminism arises when thread execution order is unpredictable.
Race conditions occur when multiple threads access shared data concurrently, and at least one thread modifies it without proper synchronization.
can lead to inconsistent or incorrect results.

Note

Nondeterminism means that the exact order and timing of thread execution cannot be predicted, which can cause different program outcomes on different runs even with the same input.

Synchronization primitives

Mutexes (locks):
- ensure mutual exclusion by allowing only one thread to access a critical section at a time.
Busy waiting:
- threads continuously check for a condition, wasting CPU cycles.
Semaphores:
- counting mechanisms to control access to resources.
Transactions:
- group operations that execute atomically, rolling back if conflicts occur.
these primitives help enforce thread safety and prevent race conditions.

flowchart TD
    Start[Attempt to acquire mutex]
    Start --> CheckLock{Is mutex locked?}
    CheckLock -- No --> AcquireLock[Acquire mutex]
    AcquireLock --> CriticalSection[Enter critical section]
    CriticalSection --> ReleaseLock[Release mutex]
    ReleaseLock --> End[Exit critical section]
    CheckLock -- Yes --> Wait[Wait or retry]
    Wait --> CheckLock

Tip

Transactional memory allows a group of operations to execute atomically by tracking changes and rolling back if conflicts occur, simplifying synchronization and avoiding deadlocks.

Distributed memory and message passing

processes communicate by explicitly sending and receiving messages.
Message Passing Interface (MPI) is a widely used standard.
communication models:
- Two-sided communication: sender and receiver both participate in message exchange.
- One-sided communication: one process can directly access memory of another without explicit cooperation.
programming distributed systems requires managing data distribution and communication efficiently.

sequenceDiagram
    participant P1 as Process 1
    participant P2 as Process 2
    P1->>P2: Send Message
    P2-->>P1: Acknowledge

Two-sided vs one-sided communication

Aspect	Two-sided Communication	One-sided Communication
Participation	Both sender and receiver actively involved	Only one process initiates communication
Synchronization	Requires synchronization between processes	No explicit synchronization required
Complexity	Simpler to understand and implement	More complex but potentially more efficient
Use cases	General message passing	Remote memory access, shared data structures

Partitioned global address space (PGAS)

a programming model that provides a global memory address space partitioned among processes.
allows one-sided communication with a global view of memory.
simplifies programming compared to pure message passing.

Hybrid programming

combines shared memory and distributed memory models.
- for example, MPI between nodes and threads (e.g., OpenMP) within a node.
enables efficient use of hierarchical hardware architectures.

Parallel software caveats

Challenges in parallel programming

debugging is more difficult due to nondeterminism.
performance tuning requires understanding of hardware and communication costs.
load balancing is critical to avoid idle processors.
careful design needed to minimize synchronization overhead and contention.

I/O issues in parallel systems

parallel applications often require efficient input/output operations.
I/O bottlenecks can limit scalability.
strategies include:
- collective I/O operations.
- asynchronous I/O to overlap computation and data transfer.
- using parallel file systems optimized for concurrent access.

Connor's Notes

⬅️ Back to portfolio

Explorer