The most common error message in machine learning isn't a Python traceback or a shape mismatch. It's CUDA out of memory. Your model is too big, your batch size is too large, or your intermediate activations don't fit in your GPU's VRAM. An RTX 4090 has 24 GB. A 70B parameter model in half precision needs ~140 GB. The math doesn't work, and throwing money at bigger GPUs only postpones the problem — an H100 with 80 GB still can't fit the largest models in a single card.
But what if you could transparently extend GPU memory using system RAM or even NVMe storage? Tools like NVIDIA's Greenboost and similar projects are doing exactly this — using the memory hierarchy (VRAM → system RAM → SSD) to run workloads that shouldn't fit on your hardware. The performance trade-offs are real but surprisingly manageable for many use cases. Understanding how this works requires understanding the GPU memory hierarchy itself.
Why GPU Memory Is Different From CPU Memory
CPU and GPU memory serve fundamentally different access patterns. CPU workloads are latency-sensitive — a single thread needs a piece of data and blocks until it arrives. CPU caches are designed to minimize latency for random access patterns.
GPU workloads are throughput-sensitive — thousands of threads each need their data, and the GPU can switch between threads to hide latency. GPU memory (HBM on datacenter GPUs, GDDR on consumer cards) is designed for bandwidth: delivering vast amounts of data per second, even if any individual access takes longer than a CPU cache hit.
Memory bandwidth comparison (approximate):
RTX 4090 GDDR6X: 1,000 GB/s
H100 HBM3: 3,350 GB/s
DDR5 System RAM: 50 GB/s
PCIe 5.0 x16: 64 GB/s (theoretical max)
NVMe SSD: 7 GB/s
The bandwidth cliff between VRAM and system RAM is ~20x.
Between VRAM and NVMe it's ~140x.
This is why naive offloading to system RAM kills performance —
you're trying to feed a 1,000 GB/s appetite through a 50 GB/s straw.
This bandwidth gap is why simply making GPU memory 'virtual' — paging data to system RAM like the CPU does with disk — doesn't work naively. A CPU tolerates page faults with maybe a 10x slowdown. A GPU hitting system RAM instead of VRAM sees a 20x bandwidth reduction, which for bandwidth-bound workloads (most ML inference) means a 20x slowdown.
How VRAM Offloading Actually Works
The trick isn't to treat system RAM as slow VRAM. It's to prefetch data from system RAM into VRAM before the GPU needs it, hiding the latency behind computation. This is the key insight behind every practical VRAM extension technique.
During neural network inference, the computation is sequential through layers. While the GPU processes layer 5, it knows layer 6 is next. A smart offloading system can start transferring layer 6's weights from system RAM to VRAM while layer 5 is computing. If the computation takes longer than the transfer (often the case for large layers), the transfer is completely hidden — the GPU never stalls.
# Conceptual overlap of compute and transfer
# (simplified pseudocode)
def inference_with_offloading(model, input_data):
# Only 2 layers fit in VRAM at a time
# Rest are in system RAM
for i, layer in enumerate(model.layers):
# Start async transfer of NEXT layer while computing current
if i + 1 < len(model.layers):
async_transfer_to_gpu(model.layers[i + 1])
# Compute on current layer (GPU is busy, transfer happens in parallel)
output = layer.forward(input_data)
# Evict current layer from VRAM (it's done)
transfer_to_ram(layer)
# Wait for next layer transfer to complete (usually already done)
sync_transfer()
input_data = output
return output
This pipeline approach works well for inference because the computation graph is predictable — you know exactly which weights are needed next. Training is harder because backward passes need activations from the forward pass, creating more complex data movement patterns.
The Approaches in Practice
CUDA Unified Memory
NVIDIA's CUDA Unified Memory creates a single address space spanning both GPU VRAM and system RAM. The CUDA runtime automatically migrates pages between them based on access patterns. When the GPU accesses a page in system RAM, it triggers a page fault and the page is migrated to VRAM.
The advantage: it's transparent to the application. Your CUDA code doesn't need to manage data placement. The disadvantage: page faults are expensive, and the runtime's migration heuristics don't always match the application's access pattern. For predictable workloads like neural network inference, explicit management outperforms automatic migration.
Layer-by-Layer Offloading
Tools like Hugging Face Accelerate, DeepSpeed ZeRO-Inference, and llama.cpp's --mmap flag implement explicit layer offloading. They keep only the active layers in VRAM and stream the rest from system RAM or disk. The model doesn't need to fit entirely in VRAM — it only needs to fit one or two layers at a time.
This is how running 70B models on consumer hardware actually works. A 4-bit quantized 70B model needs ~40 GB total, but any single layer only needs ~1-2 GB. With 24 GB of VRAM and prefetching, you can run the model with modest performance impact — maybe 30-50% slower than fitting entirely in VRAM.
NVMe as Extended Memory
The most aggressive approach uses NVMe SSDs as a third tier of GPU memory. The bandwidth is terrible compared to VRAM (~7 GB/s vs ~1,000 GB/s), but the capacity is essentially unlimited. A 4 TB NVMe drive costs $200 and can store dozens of large models simultaneously.
Projects like NVIDIA's Greenboost implement this transparently — the GPU memory system extends to NVMe, with intelligent prefetching to minimize the impact of the bandwidth limitation. For inference workloads where the GPU spends significant time computing (not just moving data), the NVMe latency can be fully hidden by computation overlap.
The performance depends heavily on the workload. Compute-bound operations (large matrix multiplications) hide transfer latency well. Memory-bound operations (attention mechanisms with large context) don't. In practice, NVMe offloading works best for batch inference of large models with small batch sizes — exactly the use case for local LLM inference.
Apple's Unified Memory Advantage
Apple Silicon's unified memory architecture takes a different approach entirely: eliminate the VRAM/RAM distinction. The CPU and GPU share the same physical memory pool. There's no 'offloading' because there's no separation — the GPU accesses the same memory the CPU uses.
This doesn't eliminate the bandwidth problem — the M-series memory bandwidth (~400 GB/s for M4 Max) is lower than a dedicated GPU's — but it eliminates the PCIe bottleneck that makes discrete GPU offloading slow. A Mac with 128 GB of unified memory can fit a 70B model entirely in GPU-accessible memory without any offloading overhead.
The trade-off: lower peak throughput for workloads that fit entirely in VRAM on a dedicated GPU, but dramatically better performance for workloads that don't fit. For large model inference, where the model exceeds discrete GPU VRAM, Apple Silicon is often faster than a discrete GPU with offloading despite having less raw compute.
What This Means for Developers
If you're building applications that use GPUs — ML inference, graphics, scientific computing — the VRAM constraint affects your architecture decisions in concrete ways.
- Know your working set size. Profile your GPU memory usage. Not peak allocation — the working set at any point in time. If your peak is 48 GB but no single operation needs more than 8 GB of active data, offloading will work well. If a single operation genuinely needs 48 GB simultaneously, you need a bigger GPU.
- Choose your offloading strategy based on access pattern. Sequential access (layer-by-layer inference) works great with prefetching. Random access (attention over large contexts) doesn't. Know which pattern your workload follows.
- Quantization is usually cheaper than offloading. Reducing your model from FP16 to INT4 cuts memory by 4x with modest quality impact. Offloading to system RAM adds latency with zero quality impact but limited memory savings. Do quantization first, offloading second.
- Batch size is your tuning knob. Larger batches need more memory but amortize overhead better. Smaller batches need less memory but process fewer items per second. When you're near the VRAM limit, reducing batch size is the simplest fix.
- Monitor memory fragmentation. CUDA memory allocation can fragment VRAM over time, especially with variable-length inputs. You might have 8 GB free in total but no single contiguous 2 GB block. PyTorch's
torch.cuda.memory_stats()shows fragmentation.torch.cuda.empty_cache()can help, though it's not a cure-all.
GPU memory will always be the bottleneck for large-scale compute workloads. Models grow faster than VRAM does. But the tooling for managing that bottleneck — unified memory, intelligent offloading, multi-tier caching — is getting good enough that 'doesn't fit in VRAM' is no longer a hard barrier. It's a performance trade-off, and increasingly, a manageable one.