What’s the Difference between Multicore and Manycore?

Fundamentally different design philosophy

Multicore

Yoke of oxen
Optimized for reducing execution latency of a few threads
Sophisticated instruction controls, large caches
Each core optimized for executing a single thread

Manycore

Flock of chickens
Assumes 1000-way concurrency readily available in applications
More resources dedicated to compute Cores optimized for aggregate throughput, deemphasizing individual performance

When does using a GPU make sense?

Applications with a lot of concurrency
- 1000-way,fine-grained concurrency
Some memory intensive applications
- Aggregate memory bandwidth is higher on the GPU
- Advantage diminishes when task granularity becomes to large to fit in shared memory

What is the memory hierarchy inversion? And why is it there?

Manycore's philosophy: maximize throughput
- Factors: task, data, synchronization
- Difficulty: the latency of getting data
- What GPU tries to do: maximize concurrency so that the total # of tasks concurrently managed can be maximized, such that total throughtput of processor can be maximized

What is the memory wall?

Increasing gap between fetching data from memory and processing data on processor

How to get around memory wall?

Manycore Processors try to utilize application concurrency to hide memory latency
Each Fermi core can maintain 48 warps of architectural context
- core = streaming multiprocessor
Each warp manages a 32-wide SIMD vector worth of computation
With ~20 registers for each thread:
- 4 (Bytes/register) x 20 (Registers) x 32 (SIMD lanes) x 48 (Warps) $\rightarrow$ 128KB per core $\rightarrow$ x 16 (core) = 2MB total of register files

Why warps?

Software abstraction to hide an extra level of architectural complexity
A 128KB register file is a large memory
- It takes more than one clock cycle to retrieve information
Hardware provide 16-wide physical SIMD units, half-pump register files
- Provide half the operand per clock cycle,then the other half the following cycle
To simplify the programming model: Assume we are only working with 32-wide SIMD unit,where each 32-bit instruction has a bit more latency

How do we deal with GPUs of different sizes?

Thread block abstraction

CUDA provides an abstraction for concurrency to be fully exposed

What are the implications of the thread block abstraction?

Computation is grouped into blocks of independent, concurrently executable work
The HW/Runtime makes the decision to selectively sequentialize the execution as necessary
One can often achieve higher performance to have less threads/block,but multiple blocks concurrently running on the same core

How do threads communicate with each other?

Manycore processors provide memory local to each core
Computations in SIMD-lanes in the same core can communicate via memory read/write
- Two types of memory:
  - Programmer-managed scratch pad memory
- HW-managed L1 cache
For NVIDIA Fermi architecture, you get 64KB per core with two configurations
- 48KB scratch pad (Shared Memory),16kBL1cache
- 16KB scratch pad (SharedMemory),48 kB L1 cache

What is the caveat in synchronizing threads `__syncthreads()` in a thread block?

waits until all threads in the thread block have reached this point and all global and shared memory accesses made by these threads prior to __syncthreads() are visible to all threads in the block
used to coordinate communication between the threads of the same block
Keep __syncthreads() outside any condition!!

Misc: SIMT

Threads are the computation performed in each SIMD lane in a core
CUDA provides a SIMT programming abstraction to assist users

SIMT: Single Instruction Multiple Threads
- A single instruction controls multiple processing element
  - Different from SIMD
  - SIMD exposes the SIMD width to the programmer
  - SIMT abstract the #threads in a thread block as a user-specified parameter
  - SIMT enables programmers to write thread-level parallel code for independent, scalar threads
  - Data-parallel code for coordinated threads
  - For functional correctness, programmers can ignore SIMT behavior
  - For performance, programmers can tune applications with SIMT in mind

Misc: Hardware and Software Mental Models

Misc: why not just put all my work in one thread block?

A many core processor has more than one core
- If one uses only one core, one is not fully utilizing available HW
Hardware must maintain the context of all threads a thread block
- There is a limited amount of resources on-chip
In Fermi
- 48 warps of context are maintained per core
- Each thread block can have up to 1024 threads
  Misc: Compute capability
The compute capability of a device is defined by a major revision number and a minor revision number.
Devices with the same major revision number are of the same core architecture.
The major revision number is 3 for devices based on the Kepler architecture, 2 for devices based on the Fermi architecture, and 1 for devices based on the Tesla architecture.
The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.

Misc: Fence Synchronization

__threadfence_block()

waits until all global and shared memory accesses made by the calling thread prior to __threadfence_block() are visible to all threads in the thread block

__threadfence()

waits until all global and shared memory accesses made by the calling thread prior to __threadfence() are visible to:

All threads in the thread block for shared memory accesses
All threads in the device for global memory accesses

__threadfence_system()

waits until all global and shared memory accesses made by the calling thread prior to __threadfence_system() are visible to:

All threads in the thread block for shared memory accesses

All threads in the device for global memory accesses

Host threads for page-locked host memory accesses.

(only supported by devices of compute capability 2.x.)

Introduction to CUDA

What’s the Difference between Multicore and Manycore?

When does using a GPU make sense?

What is the memory hierarchy inversion? And why is it there?

What is the memory wall?

How to get around memory wall?

Why warps?

How do we deal with GPUs of different sizes?

What are the implications of the thread block abstraction?

How do threads communicate with each other?

What is the caveat in synchronizing threads `__syncthreads()` in a thread block?

Misc: SIMT

Misc: Hardware and Software Mental Models

Misc: why not just put all my work in one thread block?

Misc: Compute capability

Misc: Fence Synchronization

Misc: Atomics

results matching ""

No results matching ""

What’s the Difference between Multicore and Manycore?

When does using a GPU make sense?

What is the memory hierarchy inversion? And why is it there?

What is the memory wall?

How to get around memory wall?

Why warps?

How do we deal with GPUs of different sizes?

What are the implications of the thread block abstraction?

How do threads communicate with each other?

What is the caveat in synchronizing threads __syncthreads() in a thread block?

Misc: SIMT

Misc: Hardware and Software Mental Models

Misc: why not just put all my work in one thread block?

Misc: Compute capability

Misc: Fence Synchronization

Misc: Atomics

results matching ""

No results matching ""

What is the caveat in synchronizing threads `__syncthreads()` in a thread block?