Parallel hardware

MIMD systems

  • two types (based on memory arrangement):
    • Shared-memory systems: processors connected to common memory via interconnect, communication by shared data structures
    • Distributed-memory systems: each processor has private memory, communicate by message passing / special access functions

Shared-memory systems

  • multiprocessor architecture (MIMD general-purpose)
  • widely used in servers, workstations
  • usually homogeneous: n identical processors
  • each node = cpu + local memory/caches + i/o
  • all share main memory (firmware level)
  • any processor can access any memory
  • some cache levels may be shared
  • Multicore chips: most common case (private L1, higher caches sometimes shared)

Processing nodes

  • Interface unit (W): adapts cpu memory requests to memory modules + interconnect, packages requests into messages (src/dest, routing info, etc.)
  • I/O unit: handles direct node-to-node comm, uses DMA, same interconnect

Physical addressing

  • Multiprocessor memory: scales to TBs (40+ bit addresses)
  • CPU: must generate large physical addresses
  • requires indivisible access sequences + cache coherence
  • logical vs physical address sizes are independent

Classes of multiprocessor architectures

  • Process-to-processor mapping:
    • Anonymous: any process on any processor, dynamic scheduling, global ready list
    • Dedicated: static assignment at load time, each node has own ready list, occasional re-allocation for fault-tolerance/load-balancing
  • Modular memory organization:
    • UMA (SMP): uniform access time
    • NUMA: non-uniform access time

Shared-memory types

Type 1 – UMA

  • Definition: all processors directly connect to same memory, equal latency (indep. of i, j)
  • tightly-coupled, resource-sharing
  • easier to program
  • Local access: L1/L2 caches
  • Remote access: L3/main memory
  • Symmetric multiprocessors: all processors identical, equal access to devices (Intel Xeon SMP)
  • Asymmetric multiprocessors: master processor runs OS, others specialized (ARM big.LITTLE)

Type 2 – NUMA

  • Definition: each processor has local memory, but all form global address space
  • local faster than remote, different access times (cache, local, remote)
  • COMA (Cache-only memory architecture): experimental, treats local memory as cache
  • examples: AMD EPYC, Intel Xeon Scalable

Architecture combinations

  • anonymous + UMA
  • anonymous + NUMA
  • dedicated + UMA
  • dedicated + NUMA
  • most natural = 1 & 4
  • Memory hierarchies: smooth differences between combos

Issues in shared-memory

  • Access latency
  • Memory conflicts
  • optimization target = minimize both

Minimizing access latency

  • Interconnection network latency:
    • bus = linear
    • butterflies/high-dim cubes/trees = logarithmic
    • low-dim cubes = √n
  • shared memory is expensive
    • UMA: all remote
    • NUMA: mix of local + remote
  • UMA goal: dynamic allocation in caches
  • NUMA goal: static allocation in local memory + dynamic caching
    • dedicated: private info (code, data) mapped to local memory, remote only for shared info

Minimizing memory conflicts

  • Queueing model: nodes = clients, memory modules = servers
  • latency = server response time
  • depends on server utilization (conflicts)
  • higher interarrival time = less congestion
  • Local accesses: key for performance (NUMA memories, SMP caches)

Cache coherence

  • Problem: writable shared info in caches can cause inconsistencies
  • Solution: mechanisms needed to ensure consistency across caches