Shared Memory
Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.
- (Shared memory is much faster than local and global memory. Shared memory latency is roughly 100x lower than global memory latency.)
Bank Conflict
- Bank conflicts in GPUs are specific to shared memory and it is one of the many reasons to slow down the GPU kernel.
- Bank conflicts arise because of some specific access pattern of data in shared memory.
It also depends on the hardware.
- For example, a bank conflict on a GPU device with compute capability 1.x may not be a bank conflict on a device with compute capability 2.x.
Since fast shared memory access is restricted to threads in a block. The shared memory is divided into multiple banks (similar to banks in DRAM modules).
- Each bank can service only one request at a time. The shared memory is therefore interleaved to increase the throughput.
- If the shared memory is interleaved by 32 bits, then the bandwidth of each bank is 32 bits or one float data type. The total number of banks is fixed. It is 16 on older GPUs (with compute capability 1.x ) and 32 on modern GPUs (with compute capability 2.x).
- To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called
banks
, which can be accessed simultaneously. - Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.
- However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized.
- The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests.
- If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts. To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts.
Organization of Shared memory Banks in CUDA
Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The bandwidth of shared memory is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16.
How the request to shared memory works in a Wrap
A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.
Shared memory enables cooperation between threads in a block. When multiple threads in a block use the same data from global memory, shared memory can be used to access the data from global memory only once.
Shared memory can also be used to avoid un-coalesced memory (discussed below) accesses by loading and storing data in a coalesced pattern (discussed below) from global memory and then reordering it in shared memory.
Aside from memory bank conflicts, there is no penalty for non-sequential or unaligned accesses by a half warp in shared memory. For better understanding we should know about what is Coalesced memory in CUDA, that I have discussed below.
GTX 670 (compute capability 3.0) Specification
- Shared memory has 32 banks with two addressing modes that are described below.
- The addressing mode can be
- queried using
cudaDeviceGetSharedMemConfig()
- set using
cudaDeviceSetSharedMemConfig()
- queried using
- Each bank has a bandwidth of 64 bits per clock cycle.