Network topologies and interconnects

Bisection width and bandwidth

  • Bisection width refers to the minimum number of connections that must be cut to divide a network into two equal halves.
    • a higher bisection width generally indicates better potential for communication bandwidth across the network.
  • Bandwidth is the rate at which data can be transferred across the network.
    • network performance depends on both bandwidth and latency.
    • effective communication requires balancing these factors.
TopologyBisection Width
Ring2
Mesh (2D)√N (for N nodes)
Torus2√N
Hypercube (nD)N/2
Fully ConnectedN(N-1)/2

Hypercube networks

  • a hypercube is a network topology where each node is connected to others in a multi-dimensional cube structure.
    • for an n-dimensional hypercube, each node has n connections.
    • hypercubes provide low diameter and high connectivity, enabling efficient parallel communication.
  • advantages:
    • logarithmic diameter relative to the number of nodes.
    • good scalability and fault tolerance.
  • used in some parallel computer architectures due to efficient routing properties.
graph LR
    A0((00)) --- A1((01))
    A0 --- A2((10))
    A1 --- A3((11))
    A2 --- A3

Indirect interconnects

  • indirect interconnects use switches or routers between processors rather than direct point-to-point links.
    • examples include crossbars, multistage networks, and meshes.
  • these interconnects allow scalable communication by routing messages through intermediate nodes.
  • trade-offs exist between complexity, cost, and communication latency.

Latency and bandwidth

  • Latency is the delay before data transfer begins following an instruction.
    • includes startup time, propagation delay, and queuing delays.
  • Bandwidth is the volume of data that can be transmitted per unit time.
  • both latency and bandwidth affect overall communication performance.
    • high latency can be mitigated by overlapping communication and computation.
    • bandwidth limits the sustained data transfer rate.

Memory architectures and cache coherence

Shared vs. distributed memory

  • Shared memory systems have a single physical memory accessible by all processors.
    • easier programming model but can suffer from contention and scalability issues.
  • Distributed memory systems have separate local memories for each processor.
    • communication occurs via message passing.
    • scales well but requires explicit data movement.

Cache coherence

  • ensures that all processors see a consistent view of memory when caching is used.
  • two main protocols:
    • Snooping-based coherence:
      • caches monitor a shared bus for read/write operations to maintain consistency.
      • efficient for small-scale systems with a common bus.
    • Directory-based coherence:
      • a directory keeps track of which caches have copies of each memory block.
      • scales better for large systems by reducing broadcast traffic.

False sharing

  • occurs when processors cache different variables that reside on the same cache line.
    • even if processors access different variables, the entire cache line is invalidated or updated.
  • leads to unnecessary coherence traffic and performance degradation.
  • avoided by careful data alignment or padding.

Warning

False sharing causes severe performance degradation due to frequent invalidations of cache lines, even when threads access distinct variables. Proper data alignment and padding are essential to avoid this pitfall.

Parallel programming models

Spmd vs simd

  • SPMD (Single program multiple data):
    • multiple processors run the same program but operate on different data.
    • common in distributed memory and multi-core systems.
  • SIMD (Single instruction multiple data):
    • a single instruction controls multiple processing elements performing the same operation on multiple data points.
    • common in vector processors and GPUs.

Thread and process coordination

  • threads/processes must coordinate access to shared resources to avoid conflicts.
  • coordination mechanisms include locks, barriers, and signals.
  • proper synchronization is essential to avoid race conditions and ensure correctness.

Shared memory programming

Dynamic vs static threads

  • Static threads:
    • created once and persist for the lifetime of the program.
    • useful for predictable workloads.
  • Dynamic threads:
    • created and destroyed as needed.
    • provide flexibility but add overhead.
AspectStatic ThreadsDynamic Threads
LifetimeFixed for program durationCreated and destroyed as needed
OverheadLowHigher due to creation/destruction
FlexibilityLess flexibleMore flexible
Use casePredictable workloadsIrregular or dynamic workloads

Nondeterminism and race conditions

  • Nondeterminism arises when thread execution order is unpredictable.
  • Race conditions occur when multiple threads access shared data concurrently, and at least one thread modifies it without proper synchronization.
  • can lead to inconsistent or incorrect results.

Note

Nondeterminism means that the exact order and timing of thread execution cannot be predicted, which can cause different program outcomes on different runs even with the same input.

Synchronization primitives

  • Mutexes (locks):
    • ensure mutual exclusion by allowing only one thread to access a critical section at a time.
  • Busy waiting:
    • threads continuously check for a condition, wasting CPU cycles.
  • Semaphores:
    • counting mechanisms to control access to resources.
  • Transactions:
    • group operations that execute atomically, rolling back if conflicts occur.
  • these primitives help enforce thread safety and prevent race conditions.
flowchart TD
    Start[Attempt to acquire mutex]
    Start --> CheckLock{Is mutex locked?}
    CheckLock -- No --> AcquireLock[Acquire mutex]
    AcquireLock --> CriticalSection[Enter critical section]
    CriticalSection --> ReleaseLock[Release mutex]
    ReleaseLock --> End[Exit critical section]
    CheckLock -- Yes --> Wait[Wait or retry]
    Wait --> CheckLock

Tip

Transactional memory allows a group of operations to execute atomically by tracking changes and rolling back if conflicts occur, simplifying synchronization and avoiding deadlocks.

Distributed memory and message passing

  • processes communicate by explicitly sending and receiving messages.
  • Message Passing Interface (MPI) is a widely used standard.
  • communication models:
    • Two-sided communication: sender and receiver both participate in message exchange.
    • One-sided communication: one process can directly access memory of another without explicit cooperation.
  • programming distributed systems requires managing data distribution and communication efficiently.
sequenceDiagram
    participant P1 as Process 1
    participant P2 as Process 2
    P1->>P2: Send Message
    P2-->>P1: Acknowledge

Two-sided vs one-sided communication

AspectTwo-sided CommunicationOne-sided Communication
ParticipationBoth sender and receiver actively involvedOnly one process initiates communication
SynchronizationRequires synchronization between processesNo explicit synchronization required
ComplexitySimpler to understand and implementMore complex but potentially more efficient
Use casesGeneral message passingRemote memory access, shared data structures

Partitioned global address space (PGAS)

  • a programming model that provides a global memory address space partitioned among processes.
  • allows one-sided communication with a global view of memory.
  • simplifies programming compared to pure message passing.

Hybrid programming

  • combines shared memory and distributed memory models.
    • for example, MPI between nodes and threads (e.g., OpenMP) within a node.
  • enables efficient use of hierarchical hardware architectures.

Parallel software caveats

Challenges in parallel programming

  • debugging is more difficult due to nondeterminism.
  • performance tuning requires understanding of hardware and communication costs.
  • load balancing is critical to avoid idle processors.
  • careful design needed to minimize synchronization overhead and contention.

I/O issues in parallel systems

  • parallel applications often require efficient input/output operations.
  • I/O bottlenecks can limit scalability.
  • strategies include:
    • collective I/O operations.
    • asynchronous I/O to overlap computation and data transfer.
    • using parallel file systems optimized for concurrent access.