Skip to main content

Quantization

Definition

A technique to compress AI models (e.g., from 16-bit to 4-bit) so they use less memory while maintaining quality.

Why It Matters

Quantisation is how a 70B-parameter model fits on a single 24 GB GPU instead of needing a multi-GPU node. By dropping each weight from 16-bit floats to 4-bit integers, the model's memory footprint shrinks ~4× with usually <5 % quality loss. That's the difference between "runnable locally" and "cloud-only."

Key Points

  • Common precision formats: FP16 (2 bytes/param), BF16 (2 bytes), INT8 (1 byte), Q4_K_M (~0.56 bytes), Q3_K_S (~0.38 bytes).
  • GPTQ: post-training quantisation calibrated on a small dataset to minimise output deviation. Quality roughly matches GGUF Q4_K_M.
  • AWQ (Activated Weight Quantisation, 2023) protects the ~1 % of weights with highest activation magnitude, often beats GPTQ at the same bit-width.
  • Quality loss at Q4_K_M: typically < 3 % on MMLU vs. full FP16. At Q3_K_S it rises to 5–8 %.
  • Apple M-series chips have native INT4/INT8 support via Metal, quantised models run faster and cooler on Apple Silicon than FP16 equivalents.

Example

A 70B model in FP16 needs ~140 GB of VRAM. Quantised to Q4_K_M (~4.5 bits per weight) it fits in ~40 GB, one A100 instead of three. Same model, same architecture; only the precision of the stored weights changes.

Common Misconception

Quantisation does not speed up all hardware equally. On NVIDIA GPUs it primarily saves VRAM and memory bandwidth. The real compute speedup appears on CPUs and Apple Silicon with native integer arithmetic. On NVIDIA, INT8 via TensorRT-LLM speeds inference; INT4 is mainly a memory footprint win.

Related Terms

  • ParameterA trainable weight in an AI model. Larger models have more parameters (7B, 70B, 400B).
  • VRAMVideo RAM, the memory on a GPU used to store AI model weights during inference.
  • InferenceThe process of running an AI model to generate a response. When you send a message to ChatGPT, the model performs inference.

Quantization on Rewind.ai

Rewind.ai's self-hosted lineup uses quantised weights where the quality trade-off is small. You'll see a Q4 / Q8 tag on some models in the picker, that's the precision tier.

Explore the Tools

Quick Facts

TermQuantization
RelatedParameter, VRAM, Inference

Browse Glossary

View All AI Terms

FAQ

Quantization on Rewind.ai is a free AI tool. There's no charge and no sign up needed to start.

Yes. You get 2,500 free tokens per day to use Quantization and every other tool on Rewind.ai. A free account raises that to 5,000 tokens/day. You can buy more starting at $1.

Quantization runs open-source AI models on our GPU servers. Send your request and the result comes back in seconds.

No. You can use Quantization right away without signing up. A free account doubles your daily usage to 5,000 tokens and saves your history.

Anonymous users get 2,500 tokens/day. Free accounts get 5,000 tokens/day. Tokens reset every 24 hours. Each generation costs ~100-5,000 tokens depending on the operation.

Your data is processed on our servers and isn't stored permanently unless you choose to save it. We don't sell or share it.

Yes. Content from Quantization is yours to use for personal or commercial work. The AI models we run are commercially licensed.

Quantization matches the quality of paid services because it runs the latest open-source AI models. The difference is you don't pay per use.

Quantization runs open-source AI models including Qwen 2.5, FLUX and Whisper. We update to newer models as they ship.

Yes. Quantization works in any mobile browser, and the layout adapts to your screen size.

Sign up for a free account to get 5,000 tokens/day, double the anonymous limit. Or buy token packs starting at $5 for 200,000 tokens. See /pricing/ for all options.

Yes. After you generate content, you can download it, copy it, or share it via a unique link. Signed-in users can also view their generation history.

Love Rewind.ai? Tell your friends!

Rate this page