SIMT vs SIMD vs SMT

Noted from http://goo.gl/S6HOZD

SIMD:

  • elements of short vectors are processed in parallel.
  • SIMD uses a "short vector" spelling – You break your data into short vectors, and your loop processes them using instructions with ugly names.
void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) {
  for(int i=0; i<n; i+=4) {
    //compute c[i], c[i+1], c[i+2], c[i+3]
    uint32x4_t a4 = vld1q_u32(a+i);
    uint32x4_t b4 = vld1q_u32(b+i);
    uint32x4_t c4 = vaddq_u32(a4,b4);
    vst1q_u32(c+i,c4);
  }
}

SMT (Simultaneous Multithreading)

  • instructions of several threads are run in parallel.

SIMT

  • A hybrid between vector processing and hardware threading.
  • "Scalar spelling", where you write the code of a single thread using standard arithmetic operators, is arguably a better interface than SIMD loops with ugly assembly-like opcodes.
    __global__ void add(float *a, float *b, float *c) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    a[i]=b[i]+c[i]; //no loop!
    }
    
  • Several threads – a "warp" in NVIDIA terminology – run simultaneously. So each thread needs its own registers.
  • Several warps, making up a "block", are mapped to an SM, and an SM instantaneously switches between the warps of a block. So each warp needs separate registers for each of its threads.

Conclusion

SIMT is more flexible in SIMD in three areas

  • Single instruction, multiple register sets
  • Single instruction, multiple addresses
  • Single instruction, multiple flow paths

SIMT is less flexible than SMT in three areas

  • Low occupancy greatly reduces performance
  • Flow divergence greatly reduces performance
  • Synchronization options are very limited

results matching ""

    No results matching ""