Quantization
Definition
A technique to compress AI models (e.g., from 16-bit to 4-bit) so they use less memory while maintaining quality.
Why It Matters
Quantisation is how a 70B-parameter model fits on a single 24 GB GPU instead of needing a multi-GPU node. By dropping each weight from 16-bit floats to 4-bit integers, the model's memory footprint shrinks ~4× with usually <5 % quality loss. That's the difference between "runnable locally" and "cloud-only."
Key Points
- Common precision formats: FP16 (2 bytes/param), BF16 (2 bytes), INT8 (1 byte), Q4_K_M (~0.56 bytes), Q3_K_S (~0.38 bytes).
- GPTQ: post-training quantisation calibrated on a small dataset to minimise output deviation. Quality roughly matches GGUF Q4_K_M.
- AWQ (Activated Weight Quantisation, 2023) protects the ~1 % of weights with highest activation magnitude, often beats GPTQ at the same bit-width.
- Quality loss at Q4_K_M: typically < 3 % on MMLU vs. full FP16. At Q3_K_S it rises to 5–8 %.
- Apple M-series chips have native INT4/INT8 support via Metal, quantised models run faster and cooler on Apple Silicon than FP16 equivalents.
Example
A 70B model in FP16 needs ~140 GB of VRAM. Quantised to Q4_K_M (~4.5 bits per weight) it fits in ~40 GB, one A100 instead of three. Same model, same architecture; only the precision of the stored weights changes.
Common Misconception
Quantisation does not speed up all hardware equally. On NVIDIA GPUs it primarily saves VRAM and memory bandwidth. The real compute speedup appears on CPUs and Apple Silicon with native integer arithmetic. On NVIDIA, INT8 via TensorRT-LLM speeds inference; INT4 is mainly a memory footprint win.
Related Terms
- ParameterA trainable weight in an AI model. Larger models have more parameters (7B, 70B, 400B).
- VRAMVideo RAM, the memory on a GPU used to store AI model weights during inference.
- InferenceThe process of running an AI model to generate a response. When you send a message to ChatGPT, the model performs inference.
Quantization on Rewind.ai
Rewind.ai's self-hosted lineup uses quantised weights where the quality trade-off is small. You'll see a Q4 / Q8 tag on some models in the picker, that's the precision tier.
Explore the Tools