SIMT vs SIMD vs SMT
Noted from http://goo.gl/S6HOZD
SIMD:
- elements of short vectors are processed in parallel.
- SIMD uses a "short vector" spelling – You break your data into short vectors, and your loop processes them using instructions with ugly names.
void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
for(int i=0; i<n; i+=4) {
//compute c[i], c[i+1], c[i+2], c[i+3]
uint32x4_t a4 = vld1q_u32(a+i);
uint32x4_t b4 = vld1q_u32(b+i);
uint32x4_t c4 = vaddq_u32(a4,b4);
vst1q_u32(c+i,c4);
}
}
SMT (Simultaneous Multithreading)
- instructions of several threads are run in parallel.
SIMT
- A hybrid between vector processing and hardware threading.
- "Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.
__global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop! }
- Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
- Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.
Conclusion
SIMT is more flexible in SIMD in three areas
- Single instruction, multiple register sets
- Single instruction, multiple addresses
- Single instruction, multiple flow paths
SIMT is less flexible than SMT in three areas
- Low occupancy greatly reduces performance
- Flow divergence greatly reduces performance
- Synchronization options are very limited