Parallel hardware
MIMD systems
- two types (based on memory arrangement):
- Shared-memory systems: processors connected to common memory via interconnect, communication by shared data structures
- Distributed-memory systems: each processor has private memory, communicate by message passing / special access functions
Shared-memory systems
- multiprocessor architecture (MIMD general-purpose)
- widely used in servers, workstations
- usually homogeneous: n identical processors
- each node = cpu + local memory/caches + i/o
- all share main memory (firmware level)
- any processor can access any memory
- some cache levels may be shared
- Multicore chips: most common case (private L1, higher caches sometimes shared)
Processing nodes
- Interface unit (W): adapts cpu memory requests to memory modules + interconnect, packages requests into messages (src/dest, routing info, etc.)
- I/O unit: handles direct node-to-node comm, uses DMA, same interconnect
Physical addressing
- Multiprocessor memory: scales to TBs (40+ bit addresses)
- CPU: must generate large physical addresses
- requires indivisible access sequences + cache coherence
- logical vs physical address sizes are independent
Classes of multiprocessor architectures
- Process-to-processor mapping:
- Anonymous: any process on any processor, dynamic scheduling, global ready list
- Dedicated: static assignment at load time, each node has own ready list, occasional re-allocation for fault-tolerance/load-balancing
- Modular memory organization:
- UMA (SMP): uniform access time
- NUMA: non-uniform access time
Shared-memory types
Type 1 – UMA
- Definition: all processors directly connect to same memory, equal latency (indep. of i, j)
- tightly-coupled, resource-sharing
- easier to program
- Local access: L1/L2 caches
- Remote access: L3/main memory
- Symmetric multiprocessors: all processors identical, equal access to devices (Intel Xeon SMP)
- Asymmetric multiprocessors: master processor runs OS, others specialized (ARM big.LITTLE)
Type 2 – NUMA
- Definition: each processor has local memory, but all form global address space
- local faster than remote, different access times (cache, local, remote)
- COMA (Cache-only memory architecture): experimental, treats local memory as cache
- examples: AMD EPYC, Intel Xeon Scalable
Architecture combinations
- anonymous + UMA
- anonymous + NUMA
- dedicated + UMA
- dedicated + NUMA
- most natural = 1 & 4
- Memory hierarchies: smooth differences between combos
Issues in shared-memory
- Access latency
- Memory conflicts
- optimization target = minimize both
Minimizing access latency
- Interconnection network latency:
- bus = linear
- butterflies/high-dim cubes/trees = logarithmic
- low-dim cubes = √n
- shared memory is expensive
- UMA: all remote
- NUMA: mix of local + remote
- UMA goal: dynamic allocation in caches
- NUMA goal: static allocation in local memory + dynamic caching
- dedicated: private info (code, data) mapped to local memory, remote only for shared info
Minimizing memory conflicts
- Queueing model: nodes = clients, memory modules = servers
- latency = server response time
- depends on server utilization (conflicts)
- higher interarrival time = less congestion
- Local accesses: key for performance (NUMA memories, SMP caches)
Cache coherence
- Problem: writable shared info in caches can cause inconsistencies
- Solution: mechanisms needed to ensure consistency across caches