VRAM
Definition
Video RAM, the memory on a GPU used to store AI model weights during inference.
Why It Matters
VRAM is the binding constraint on which models you can host. The model's weights, the activations during a forward pass, and the KV cache for the current context all have to fit in VRAM at the same time. Run out and the GPU crashes the request, no graceful fallback.
Key Points
- Memory bandwidth is more important than FLOPS for inference at low batch sizes. H100 SXM5 has 3.35 TB/s HBM3 bandwidth.
- KV cache size formula: 2 × n_layers × n_kv_heads × head_dim × context_length × 2 bytes (FP16). A 70B model at 200K tokens needs ~32 GB for the KV cache alone.
- NVLink bridges allow multi-GPU tensor parallelism, effective VRAM pool equals the sum of all linked GPUs' VRAM.
- 7B FP16 needs ~14 GB weights; add ~2–4 GB activation memory and ~2 GB per 32K context tokens for KV cache to get total VRAM requirement.
- Gradient checkpointing during training trades compute for memory, recomputes activations on the backward pass instead of storing them, reducing peak VRAM by 40–70 %.
Example
A 7B model in FP16 needs ~14 GB just for weights; add ~2–4 GB for activations and ~2 GB per 32K of context for the KV cache. A 24 GB consumer GPU (RTX 4090) runs a quantised 13B comfortably. The 80 GB on an H100 fits a quantised 70B.
Common Misconception
VRAM usage spikes during warm-up and batch-processing peaks. A model that fits in 24 GB VRAM with a typical input will OOM on the first long-context request. Leave at least 20 % VRAM headroom above the model-weights footprint to absorb KV cache and activation peaks.
Related Terms
- GPU (Graphics Processing Unit)Specialized hardware that runs AI models much faster than CPUs. NVIDIA A100, H100, etc.
- ParameterA trainable weight in an AI model. Larger models have more parameters (7B, 70B, 400B).
- QuantizationA technique to compress AI models (e.g., from 16-bit to 4-bit) so they use less memory while maintaining quality.
VRAM on Rewind.ai
The model picker on Rewind.ai hides the VRAM math from you, we route requests to GPU pools sized for each model. But the same principle determines why a 70B costs more per token than a 7B: it ties up more VRAM.
Explore the ToolsQuick Facts
| Term | VRAM |
| Related | GPU (Graphics Processing Unit), Parameter, Quantization |