GPUs are known as parallel processors because they can perform tasks simultaneously. Work is divided into smaller sub-tasks, which are executed at the same time by multiple processing units. Once completed, these sub-tasks are combined to produce the final result. These processing units—including threads, warps, thread blocks, cores, and multiprocessors—share resources such as memory. This sharing enhances collaboration among them and improves the overall efficiency of the GPU.
One unit, warps, is a cornerstone of parallel processing. By grouping threads into a single execution unit, warps simplify thread management, share data and resources among threads, and mask memory latency with effective scheduling.
It may be helpful to read this “CUDA refresher” before proceeding**
In this article, we will outline how warps are useful for optimizing the performance of GPU-accelerated applications. By building an intuition around warps, developers can achieve significant gains in computational speed and efficiency.
Thread blocks are partitioned into warps comprised of 32 threads each. All threads in a warp run on the same Streaming Multiprocessor. Figure from an NVIDIA presentation on GPGPU and Accelerator Trends. When a Streaming Multiprocessor (SM) is assigned thread blocks for execution, it subdivides the threads into warps. Modern GPU architectures typically have a warp size of 32 threads. The number of warps in a thread block depends on the thread block size configured by the CUDA programmer. For example, if the thread block size is 96 threads and the warp size is 32 threads, the number of warps per thread block would be 96 threads/ 32 threads per warp = 3 warps per thread block.
GPU Compute and Memory Architecture
In this figure, three thread blocks are assigned to the SM. The thread blocks are comprised of 3 warps each. A warp contains 32 consecutive threads. Note how, in the figure, the threads are indexed, starting at zero and continuing between the warps in the thread block. The first warp comprises the first 32 threads (0-31), the subsequent warp has the following 32 threads (32-63), and so forth. Now that we’ve defined warps, let’s take a step back and look at Flynn’s taxonomy, which focuses on how this categorization scheme applies to GPUs and warp-level thread management.
Flynn’s Taxonomy classifies computer architectures based on their instruction and data streams, dividing them into four categories: SISD, SIMD, MISD, and MIMD. GPUs typically fall under the SIMD (Single Instruction Multiple Data) category, as they execute the same operation across multiple data points simultaneously. However, NVIDIA introduced SIMT (Single Instruction Multiple Thread) to better describe its GPUs’ thread-level parallelism. In SIMT architecture, multiple threads execute the same instructions on different data, with the CUDA compiler and GPU working together to synchronize threads within a warp. This synchronization helps maximize efficiency by ensuring threads execute identical instructions in unison whenever possible. While both SIMD and SIMT exploit data-level parallelism, they are differentiated in their approaches. SIMD excels at uniform data processing, whereas SIMT offers increased flexibility due to its dynamic thread management and conditional execution.
In the context of warps, latency is the number of clock cycles for a warp to finish executing an instruction and become available to process the next one.
CalTech’s CS179
W denotes warp, and T denotes thread. GPUs leverage warp scheduling to hide latency, whereas CPUs execute sequentially with context switching. Maximum utilization is achieved when all warp schedulers have instructions to issue during every clock cycle. The number of resident warps—those that are actively being executed on the Streaming Multiprocessor (SM) at any given moment—directly impacts utilization. In other words, there must be warps available for the warp schedulers to issue instructions to. Having multiple resident warps allows the SM to switch between them, effectively hiding latency and maximizing throughput.
Program counters increment each instruction cycle to retrieve the program sequence from memory, guiding the flow of the program’s execution. While threads in a warp share a common starting program address, they maintain separate program counters, allowing for autonomous execution and branching of the individual threads.
Pre-Volta GPUs had a single program counter for a 32-thread warp. Following the introduction of the Volta microarchitecture, each thread has its program counter. As Stephen Jones puts it during his GTC’ 17 talk: "So now all these threads are wholly independent- they still work better if you gang them together…but you’re no longer dead in the water if you split them up.”
Separate program counters allow for branching, an if-then-else programming structure in which instructions are processed only if threads are active. Since optimal performance is attained when a warp’s 32 threads converge on one instruction, programmers are advised to write code that minimizes instances where threads within a warp take a divergent path.
Separate program counters allow for branching, an if-then-else programming structure in which instructions are processed only if threads are active. Since optimal performance is attained when a warp’s 32 threads converge on one instruction, programmers are advised to write code that minimizes instances where threads within a warp take a divergent path.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!