GPU Cache Hierarchy: Understanding L1, L2, and VRAM

Every texture sample in your shader triggers a memory request. Whether that request takes 4 cycles or 400+ cycles depends on where the data lives in the GPU's memory hierarchy. I've seen shaders go from 30fps to 60fps just by fixing cache access patterns—no algorithmic changes, just better memory locality.

TL;DR — Quick Reference

GPU Memory Hierarchy At a Glance text

┌──────────────────┬─────────────┬──────────────────┬───────────────┐
│ Memory Level     │ Latency     │ Size (per SM)    │ Bandwidth     │
├──────────────────┼─────────────┼──────────────────┼───────────────┤
│ Registers        │ ~1 cycle    │ 256KB            │ ~8 TB/s       │
│ L1 Cache         │ ~30 cycles  │ 128-192KB        │ ~15-20 TB/s   │
│ L2 Cache         │ ~150 cycles │ 6-96MB (shared)  │ ~3 TB/s       │
│ VRAM (GDDR6)     │ ~500 cycles │ 8-24GB           │ ~1 TB/s       │
└──────────────────┴─────────────┴──────────────────┴───────────────┘

L1 hit = ~30 cycles, VRAM miss = ~500 cycles. That&#39;s 15x.

GPU Memory Hierarchy Overview

The Memory Hierarchy

GPU memory is a hierarchy. Registers are tiny but essentially free to access. L1 and L2 caches sit in between, staging frequently-used data. VRAM holds everything—textures, buffers, render targets—but hitting it cold costs you hundreds of cycles.

Registers: ~1 cycle access, 256KB total per SM (shared across thousands of threads)
L1 Cache: ~28-35 cycles, 128-192KB per SM, dedicated to each streaming multiprocessor
L2 Cache: ~150-200 cycles, 6-96MB shared across all SMs (up to 128MB with AMD Infinity Cache), unified memory controller
VRAM (GDDR6/HBM): ~400-800 cycles, 8-24GB, massive bandwidth but high latency

Memory Access Latency Comparison

Cache Lines and Spatial Locality

When you sample a single texel, the GPU doesn't fetch just that pixel—it grabs an entire cache line. That's 128 bytes for L1, 32 bytes for texture/L2 transactions. For RGBA8, an L1 cache line holds 32 texels. The hardware bets that if you're reading one pixel, you'll probably want its neighbors too. Usually it's right.

Cache Line Fetching

Coherent vs Incoherent Texture Sampling glsl

// GOOD: Coherent access - neighboring fragments sample neighboring texels
vec4 color = texture(uTexture, vTexCoord);

// BAD: Random access - cache thrashing, every sample misses
vec2 randomUV = hash2D(gl_FragCoord.xy); // Random UV per pixel
vec4 color = texture(uTexture, randomUV); // Cache line wasted

// UGLY: Dependent reads - latency stacking
vec2 offset = texture(uOffsetMap, vTexCoord).xy; // First fetch
vec4 color = texture(uTexture, vTexCoord + offset); // Must wait!

The Warp and Cache Coalescing

GPUs run threads in groups: warps (32 threads on NVIDIA) or wavefronts (32-64 on AMD). When a warp hits memory, the hardware coalesces requests—if all 32 threads need addresses in the same 128-byte chunk, that's one fetch instead of 32.

Memory Access Pattern Hierarchy

Not all access patterns are equal. Here's a rough ranking from best to worst performance:

Broadcast: All threads read the same address. Hardware broadcasts the single value to all threads—extremely efficient.
Coalesced: Each thread reads a unique cell within a contiguous chunk (~128 bytes). One memory transaction serves the entire warp.
Partially coalesced: Threads mostly hit the same chunk with some outliers. Slightly more transactions, still reasonable.
Strided: Threads access memory at regular intervals (e.g., every 4th element). May require multiple transactions depending on stride.
Scattered: Completely random addresses. Worst case—potentially 32 separate memory transactions for one warp.
Bank conflicts: Multiple threads access the same memory bank but different addresses within it. Serializes those accesses—up to 32x slower in the worst case.

Memory Coalescing in a Warp

Fragment shaders get coalescing almost for free. The rasterizer processes 2x2 quads, groups them into tiles, and those tiles map to consecutive threads. Neighboring pixels sample neighboring texels—spatial locality happens automatically unless you go out of your way to break it.

What Causes Cache Misses?

Random UVs: Procedural distortion, noise-based sampling, stippling patterns
Dependent texture reads: UV offset stored in another texture (parallax, flow maps)
Large UV jumps: Sampling distant mip levels, cubemap edge cases
Texture thrashing: Too many unique textures accessed in same shader pass
Working set overflow: Total active texture data exceeds L2 capacity

Common Cache Miss Patterns

Mipmapping: The Cache's Best Friend

Mipmaps are secretly a cache optimization. When a surface is far away, sampling the full-res texture would pull in texels scattered across memory. The mipmap is smaller, so nearby samples actually land in the same cache lines. This matters more than the visual quality argument—disable mipmaps and watch your frame rate tank.

Mipmap Cache Efficiency

Explicit LOD for Cache Control glsl

// Auto mip selection (usually correct)
vec4 color = texture(uTexture, uv);

// Force lower-res mip for cache efficiency in blur passes
// Saves bandwidth when you&#39;re averaging anyway
vec4 color = textureLod(uTexture, uv, 2.0); // Mip level 2

// Bias toward higher mips (cache-friendly, softer)
vec4 color = texture(uTexture, uv, 1.0); // +1 mip bias

// textureGrad: Explicit derivatives for anisotropic filtering
// Controls footprint precisely for cache-aware sampling
vec4 color = textureGrad(uTexture, uv, dFdx(uv) * 0.5, dFdy(uv) * 0.5);

Measuring Cache Efficiency

GPU profilers like NVIDIA Nsight, AMD Radeon GPU Profiler, and RenderDoc expose cache hit rates and memory throughput. Key metrics to watch:

L1 Hit Rate: Should be >80% for well-optimized shaders. Below 60% indicates access pattern problems.
L2 Hit Rate: Measures working set fit. Below 70% suggests texture thrashing or too many unique textures.
Texture Memory Throughput: Compare to theoretical max. Low throughput with high latency = cache misses.
Memory-bound vs Compute-bound: If memory is the bottleneck, cache optimization has highest impact.

Cache Hit Rate Impact on Performance

Interactive Demo: Cache Access Patterns

This demo visualizes how different sampling patterns affect cache efficiency. Coherent access (left) keeps samples within cache lines, while random access (right) causes cache thrashing. Watch how the 'heat' of memory access spreads differently.

Initializing WebGL...

Access Pattern (0=Coherent, 1=Random) 0.5

Cache Line Size (texels) 4

Texture Access Pattern Visualization

Texture Fetch Latency Hiding

The GPU's trick for surviving 500-cycle memory stalls: while one warp waits for data, the scheduler runs another warp that's ready. More warps in flight = more chances to do useful work while waiting. This is why occupancy matters—low occupancy means the GPU sits idle during memory fetches.

Latency Hiding Through Warp Scheduling

What actually matters: avoid dependent texture reads (where the UV comes from another texture), keep your working set small enough to fit in cache, and maintain high occupancy so the scheduler has warps to switch between.

Texture Memory Layout

Textures aren't stored row-by-row in memory. GPUs use Morton order (Z-order curves) or similar swizzle patterns. The idea: keep 2D-neighboring texels close in the 1D memory address space, so a cache line actually contains a useful block of pixels instead of a random horizontal stripe.

Morton Order (Z-curve) Memory Layout

Practical Optimization Strategies

1. Texture Atlasing

Throw related textures into an atlas and they'll share cache space. Draw a bunch of sprites from the same atlas and the cache stays warm. Downside: UV math gets messier, and you'll fight filtering artifacts at tile edges if you're not careful with padding.

2. Channel Packing

Channel Packing for Fewer Fetches glsl

// BAD: 4 separate texture fetches (4x memory traffic)
vec4 albedo = texture(uAlbedo, uv);     // RGBA
float rough = texture(uRoughness, uv).r; // R (wastes GBA)
float metal = texture(uMetallic, uv).r;  // R (wastes GBA)  
float ao = texture(uAO, uv).r;           // R (wastes GBA)

// GOOD: 2 fetches with packed data
vec4 albedo = texture(uAlbedo, uv);      // RGB = color, A = unused
vec4 orm = texture(uORM, uv);            // R = AO, G = Roughness, B = Metallic
float ao = orm.r;
float rough = orm.g;
float metal = orm.b;

3. Reduce Dependent Reads

Dependent reads are latency killers. If your UV comes from a texture lookup (parallax mapping, flow maps), the GPU can't even start the second fetch until the first one finishes. Two 500-cycle stalls back-to-back. Compute offsets mathematically when you can, or accept the hit.

Dependent Read Latency Stacking

4. Bindless Textures

Bindless textures (GL_ARB_bindless_texture, Vulkan descriptor indexing) skip the binding overhead and keep all textures resident. Sometimes this helps cache reuse across the frame. But there's a catch: if threads in a warp index into different textures, you pay for the divergence. Measure it—bindless isn't automatically faster.

Architecture Differences

Cache behavior isn't portable. NVIDIA's L1 shares silicon with shared memory (you configure the split). AMD threw in Infinity Cache as an extra layer. Mobile GPUs have smaller caches but lean on tile-based rendering to compensate. What flies on desktop might crawl on mobile, and vice versa.

Cache Architecture Comparison

Quick Wins: Copy-Paste Patterns

Stuff you can copy into your shaders right now:

Pattern 1: ORM Texture Packing glsl

// Pack Occlusion, Roughness, Metallic into one texture
// BEFORE: 3 texture fetches (wasteful)
float ao = texture(uAO, uv).r;        // Only uses R channel
float rough = texture(uRough, uv).r;  // Only uses R channel
float metal = texture(uMetal, uv).r;  // Only uses R channel

// AFTER: 1 texture fetch (3x faster)
vec3 orm = texture(uORM, uv).rgb;
float ao = orm.r;
float rough = orm.g;
float metal = orm.b;

Pattern 2: Mip Bias for Blur Passes glsl

// When doing blur/downsample, use higher mip levels
// This saves bandwidth since you&#39;re averaging anyway

// Standard blur (fetches full-res texels)
vec4 blur = vec4(0.0);
for (int i = 0; i < 9; i++) {
    blur += texture(uTexture, uv + offsets[i]) * weights[i];
}

// Cache-friendly blur (uses lower mip = fewer cache lines)
vec4 blur = vec4(0.0);
float mipBias = 1.0; // Use mip level 1 (half res)
for (int i = 0; i < 9; i++) {
    blur += texture(uTexture, uv + offsets[i], mipBias) * weights[i];
}

Key Takeaways

Cache hits are ~15x faster than VRAM fetches (30 vs 500 cycles). Coherent access patterns matter.
Mipmaps are a cache optimization, not just a visual quality thing.
Pack related data into fewer textures. One RGBA fetch beats four R fetches.
Avoid dependent texture reads—each one serializes memory access.
Don't manually "prefetch" textures. The compiler and hardware scheduler handle latency hiding.
Profile with Nsight/RGP. Cache hit rates reveal problems that frame timing won't.

Sources & Further Reading

NVIDIA Ada GPU Architecture Tuning Guide — Official L1/L2 cache sizes, shared memory configuration
NVIDIA Ampere GPU Architecture Tuning Guide — A100 cache hierarchy details
Chips and Cheese: Measuring GPU Memory Latency — Empirical latency measurements for NVIDIA and AMD
Chips and Cheese: Microbenchmarking NVIDIA's RTX 4090 — Ada Lovelace cache architecture analysis
NVIDIA Developer Blog: Using Shared Memory in CUDA — Bank conflicts and shared memory optimization
Fabian Giesen: Texture Tiling and Swizzling — Morton order and GPU texture memory layout
RasterGrid: Understanding GPU Caches — Cache line sizes and memory hierarchy overview