Every texture sample in your shader triggers a memory request. Whether that request takes 4 cycles or 400+ cycles depends on where the data lives in the GPU's memory hierarchy. I've seen shaders go from 30fps to 60fps just by fixing cache access patterns—no algorithmic changes, just better memory locality.
TL;DR — Quick Reference
┌──────────────────┬─────────────┬──────────────────┬───────────────┐ │ Memory Level │ Latency │ Size (per SM) │ Bandwidth │ ├──────────────────┼─────────────┼──────────────────┼───────────────┤ │ Registers │ ~1 cycle │ 256KB │ ~8 TB/s │ │ L1 Cache │ ~30 cycles │ 128-192KB │ ~15-20 TB/s │ │ L2 Cache │ ~150 cycles │ 6-96MB (shared) │ ~3 TB/s │ │ VRAM (GDDR6) │ ~500 cycles │ 8-24GB │ ~1 TB/s │ └──────────────────┴─────────────┴──────────────────┴───────────────┘ L1 hit = ~30 cycles, VRAM miss = ~500 cycles. That39;s 15x.
The Memory Hierarchy
GPU memory is a hierarchy. Registers are tiny but essentially free to access. L1 and L2 caches sit in between, staging frequently-used data. VRAM holds everything—textures, buffers, render targets—but hitting it cold costs you hundreds of cycles.
- Registers: ~1 cycle access, 256KB total per SM (shared across thousands of threads)
- L1 Cache: ~28-35 cycles, 128-192KB per SM, dedicated to each streaming multiprocessor
- L2 Cache: ~150-200 cycles, 6-96MB shared across all SMs (up to 128MB with AMD Infinity Cache), unified memory controller
- VRAM (GDDR6/HBM): ~400-800 cycles, 8-24GB, massive bandwidth but high latency
Cache Lines and Spatial Locality
When you sample a single texel, the GPU doesn't fetch just that pixel—it grabs an entire cache line. That's 128 bytes for L1, 32 bytes for texture/L2 transactions. For RGBA8, an L1 cache line holds 32 texels. The hardware bets that if you're reading one pixel, you'll probably want its neighbors too. Usually it's right.
// GOOD: Coherent access - neighboring fragments sample neighboring texels vec4 color = texture(uTexture, vTexCoord); // BAD: Random access - cache thrashing, every sample misses vec2 randomUV = hash2D(gl_FragCoord.xy); // Random UV per pixel vec4 color = texture(uTexture, randomUV); // Cache line wasted // UGLY: Dependent reads - latency stacking vec2 offset = texture(uOffsetMap, vTexCoord).xy; // First fetch vec4 color = texture(uTexture, vTexCoord + offset); // Must wait!
The Warp and Cache Coalescing
GPUs run threads in groups: warps (32 threads on NVIDIA) or wavefronts (32-64 on AMD). When a warp hits memory, the hardware coalesces requests—if all 32 threads need addresses in the same 128-byte chunk, that's one fetch instead of 32.
Memory Access Pattern Hierarchy
Not all access patterns are equal. Here's a rough ranking from best to worst performance:
- Broadcast: All threads read the same address. Hardware broadcasts the single value to all threads—extremely efficient.
- Coalesced: Each thread reads a unique cell within a contiguous chunk (~128 bytes). One memory transaction serves the entire warp.
- Partially coalesced: Threads mostly hit the same chunk with some outliers. Slightly more transactions, still reasonable.
- Strided: Threads access memory at regular intervals (e.g., every 4th element). May require multiple transactions depending on stride.
- Scattered: Completely random addresses. Worst case—potentially 32 separate memory transactions for one warp.
- Bank conflicts: Multiple threads access the same memory bank but different addresses within it. Serializes those accesses—up to 32x slower in the worst case.
Fragment shaders get coalescing almost for free. The rasterizer processes 2x2 quads, groups them into tiles, and those tiles map to consecutive threads. Neighboring pixels sample neighboring texels—spatial locality happens automatically unless you go out of your way to break it.
What Causes Cache Misses?
- Random UVs: Procedural distortion, noise-based sampling, stippling patterns
- Dependent texture reads: UV offset stored in another texture (parallax, flow maps)
- Large UV jumps: Sampling distant mip levels, cubemap edge cases
- Texture thrashing: Too many unique textures accessed in same shader pass
- Working set overflow: Total active texture data exceeds L2 capacity
Mipmapping: The Cache's Best Friend
Mipmaps are secretly a cache optimization. When a surface is far away, sampling the full-res texture would pull in texels scattered across memory. The mipmap is smaller, so nearby samples actually land in the same cache lines. This matters more than the visual quality argument—disable mipmaps and watch your frame rate tank.
// Auto mip selection (usually correct) vec4 color = texture(uTexture, uv); // Force lower-res mip for cache efficiency in blur passes // Saves bandwidth when you39;re averaging anyway vec4 color = textureLod(uTexture, uv, 2.0); // Mip level 2 // Bias toward higher mips (cache-friendly, softer) vec4 color = texture(uTexture, uv, 1.0); // +1 mip bias // textureGrad: Explicit derivatives for anisotropic filtering // Controls footprint precisely for cache-aware sampling vec4 color = textureGrad(uTexture, uv, dFdx(uv) * 0.5, dFdy(uv) * 0.5);
Measuring Cache Efficiency
GPU profilers like NVIDIA Nsight, AMD Radeon GPU Profiler, and RenderDoc expose cache hit rates and memory throughput. Key metrics to watch:
- L1 Hit Rate: Should be >80% for well-optimized shaders. Below 60% indicates access pattern problems.
- L2 Hit Rate: Measures working set fit. Below 70% suggests texture thrashing or too many unique textures.
- Texture Memory Throughput: Compare to theoretical max. Low throughput with high latency = cache misses.
- Memory-bound vs Compute-bound: If memory is the bottleneck, cache optimization has highest impact.
Interactive Demo: Cache Access Patterns
This demo visualizes how different sampling patterns affect cache efficiency. Coherent access (left) keeps samples within cache lines, while random access (right) causes cache thrashing. Watch how the 'heat' of memory access spreads differently.
Texture Fetch Latency Hiding
The GPU's trick for surviving 500-cycle memory stalls: while one warp waits for data, the scheduler runs another warp that's ready. More warps in flight = more chances to do useful work while waiting. This is why occupancy matters—low occupancy means the GPU sits idle during memory fetches.
What actually matters: avoid dependent texture reads (where the UV comes from another texture), keep your working set small enough to fit in cache, and maintain high occupancy so the scheduler has warps to switch between.
Texture Memory Layout
Textures aren't stored row-by-row in memory. GPUs use Morton order (Z-order curves) or similar swizzle patterns. The idea: keep 2D-neighboring texels close in the 1D memory address space, so a cache line actually contains a useful block of pixels instead of a random horizontal stripe.
Practical Optimization Strategies
1. Texture Atlasing
Throw related textures into an atlas and they'll share cache space. Draw a bunch of sprites from the same atlas and the cache stays warm. Downside: UV math gets messier, and you'll fight filtering artifacts at tile edges if you're not careful with padding.
2. Channel Packing
// BAD: 4 separate texture fetches (4x memory traffic) vec4 albedo = texture(uAlbedo, uv); // RGBA float rough = texture(uRoughness, uv).r; // R (wastes GBA) float metal = texture(uMetallic, uv).r; // R (wastes GBA) float ao = texture(uAO, uv).r; // R (wastes GBA) // GOOD: 2 fetches with packed data vec4 albedo = texture(uAlbedo, uv); // RGB = color, A = unused vec4 orm = texture(uORM, uv); // R = AO, G = Roughness, B = Metallic float ao = orm.r; float rough = orm.g; float metal = orm.b;
3. Reduce Dependent Reads
Dependent reads are latency killers. If your UV comes from a texture lookup (parallax mapping, flow maps), the GPU can't even start the second fetch until the first one finishes. Two 500-cycle stalls back-to-back. Compute offsets mathematically when you can, or accept the hit.
4. Bindless Textures
Bindless textures (GL_ARB_bindless_texture, Vulkan descriptor indexing) skip the binding overhead and keep all textures resident. Sometimes this helps cache reuse across the frame. But there's a catch: if threads in a warp index into different textures, you pay for the divergence. Measure it—bindless isn't automatically faster.
Architecture Differences
Cache behavior isn't portable. NVIDIA's L1 shares silicon with shared memory (you configure the split). AMD threw in Infinity Cache as an extra layer. Mobile GPUs have smaller caches but lean on tile-based rendering to compensate. What flies on desktop might crawl on mobile, and vice versa.
Quick Wins: Copy-Paste Patterns
Stuff you can copy into your shaders right now:
// Pack Occlusion, Roughness, Metallic into one texture // BEFORE: 3 texture fetches (wasteful) float ao = texture(uAO, uv).r; // Only uses R channel float rough = texture(uRough, uv).r; // Only uses R channel float metal = texture(uMetal, uv).r; // Only uses R channel // AFTER: 1 texture fetch (3x faster) vec3 orm = texture(uORM, uv).rgb; float ao = orm.r; float rough = orm.g; float metal = orm.b;
// When doing blur/downsample, use higher mip levels // This saves bandwidth since you39;re averaging anyway // Standard blur (fetches full-res texels) vec4 blur = vec4(0.0); for (int i = 0; i < 9; i++) { blur += texture(uTexture, uv + offsets[i]) * weights[i]; } // Cache-friendly blur (uses lower mip = fewer cache lines) vec4 blur = vec4(0.0); float mipBias = 1.0; // Use mip level 1 (half res) for (int i = 0; i < 9; i++) { blur += texture(uTexture, uv + offsets[i], mipBias) * weights[i]; }
Key Takeaways
- Cache hits are ~15x faster than VRAM fetches (30 vs 500 cycles). Coherent access patterns matter.
- Mipmaps are a cache optimization, not just a visual quality thing.
- Pack related data into fewer textures. One RGBA fetch beats four R fetches.
- Avoid dependent texture reads—each one serializes memory access.
- Don't manually "prefetch" textures. The compiler and hardware scheduler handle latency hiding.
- Profile with Nsight/RGP. Cache hit rates reveal problems that frame timing won't.
Sources & Further Reading
- NVIDIA Ada GPU Architecture Tuning Guide — Official L1/L2 cache sizes, shared memory configuration
- NVIDIA Ampere GPU Architecture Tuning Guide — A100 cache hierarchy details
- Chips and Cheese: Measuring GPU Memory Latency — Empirical latency measurements for NVIDIA and AMD
- Chips and Cheese: Microbenchmarking NVIDIA's RTX 4090 — Ada Lovelace cache architecture analysis
- NVIDIA Developer Blog: Using Shared Memory in CUDA — Bank conflicts and shared memory optimization
- Fabian Giesen: Texture Tiling and Swizzling — Morton order and GPU texture memory layout
- RasterGrid: Understanding GPU Caches — Cache line sizes and memory hierarchy overview