Skip to main content

VRAM

Definition

Video RAM, the memory on a GPU used to store AI model weights during inference.

Why It Matters

VRAM is the binding constraint on which models you can host. The model's weights, the activations during a forward pass, and the KV cache for the current context all have to fit in VRAM at the same time. Run out and the GPU crashes the request, no graceful fallback.

Key Points

  • Memory bandwidth is more important than FLOPS for inference at low batch sizes. H100 SXM5 has 3.35 TB/s HBM3 bandwidth.
  • KV cache size formula: 2 × n_layers × n_kv_heads × head_dim × context_length × 2 bytes (FP16). A 70B model at 200K tokens needs ~32 GB for the KV cache alone.
  • NVLink bridges allow multi-GPU tensor parallelism, effective VRAM pool equals the sum of all linked GPUs' VRAM.
  • 7B FP16 needs ~14 GB weights; add ~2–4 GB activation memory and ~2 GB per 32K context tokens for KV cache to get total VRAM requirement.
  • Gradient checkpointing during training trades compute for memory, recomputes activations on the backward pass instead of storing them, reducing peak VRAM by 40–70 %.

Example

A 7B model in FP16 needs ~14 GB just for weights; add ~2–4 GB for activations and ~2 GB per 32K of context for the KV cache. A 24 GB consumer GPU (RTX 4090) runs a quantised 13B comfortably. The 80 GB on an H100 fits a quantised 70B.

Common Misconception

VRAM usage spikes during warm-up and batch-processing peaks. A model that fits in 24 GB VRAM with a typical input will OOM on the first long-context request. Leave at least 20 % VRAM headroom above the model-weights footprint to absorb KV cache and activation peaks.

Related Terms

  • GPU (Graphics Processing Unit)Specialized hardware that runs AI models much faster than CPUs. NVIDIA A100, H100, etc.
  • ParameterA trainable weight in an AI model. Larger models have more parameters (7B, 70B, 400B).
  • QuantizationA technique to compress AI models (e.g., from 16-bit to 4-bit) so they use less memory while maintaining quality.

VRAM on Rewind.ai

The model picker on Rewind.ai hides the VRAM math from you, we route requests to GPU pools sized for each model. But the same principle determines why a 70B costs more per token than a 7B: it ties up more VRAM.

Explore the Tools

FAQ

VRAM on Rewind.ai is a free AI tool. There's no charge and no sign up needed to start.

Yes. You get 2,500 free tokens per day to use VRAM and every other tool on Rewind.ai. A free account raises that to 5,000 tokens/day. You can buy more starting at $1.

VRAM runs open-source AI models on our GPU servers. Send your request and the result comes back in seconds.

No. You can use VRAM right away without signing up. A free account doubles your daily usage to 5,000 tokens and saves your history.

Anonymous users get 2,500 tokens/day. Free accounts get 5,000 tokens/day. Tokens reset every 24 hours. Each generation costs ~100-5,000 tokens depending on the operation.

Your data is processed on our servers and isn't stored permanently unless you choose to save it. We don't sell or share it.

Yes. Content from VRAM is yours to use for personal or commercial work. The AI models we run are commercially licensed.

VRAM matches the quality of paid services because it runs the latest open-source AI models. The difference is you don't pay per use.

VRAM runs open-source AI models including Qwen 2.5, FLUX and Whisper. We update to newer models as they ship.

Yes. VRAM works in any mobile browser, and the layout adapts to your screen size.

Sign up for a free account to get 5,000 tokens/day, double the anonymous limit. Or buy token packs starting at $5 for 200,000 tokens. See /pricing/ for all options.

Yes. After you generate content, you can download it, copy it, or share it via a unique link. Signed-in users can also view their generation history.

Love Rewind.ai? Tell your friends!

Rate this page