Inference
Definition
The process of running an AI model to generate a response. When you send a message to ChatGPT, the model performs inference.
Why It Matters
Training is a one-time investment; inference is the running cost. Every chat reply, image generation, or transcription is one inference. Inference economics decide whether a tool is free, $0.01 per call, or pay-as-you-go.
Key Points
- Two cost centres: TTFT (Time To First Token, latency) and TPS (Tokens Per Second, throughput). Optimising one often trades off against the other.
- Batching multiple requests together is the primary throughput lever, a batch of 8 shares the model-weight memory transfer cost across all 8 outputs.
- Speculative decoding uses a small draft model to propose token sequences that the large model verifies in parallel, typically 2–3× speedup on average output.
- vLLM's PagedAttention (2023) reduces GPU memory fragmentation, enabling ~24× higher throughput vs. naive KV-cache management.
- Output token generation requires a full forward pass per token; input processing is parallelisable, that is why output tokens cost more than input tokens on every provider.
Example
Inference latency for a chat model is roughly "time-to-first-token + time-per-token × output length." A 7B model on an A100 produces ~100 tokens/sec; a GPT-4-class model produces ~50 tokens/sec. Image diffusion is measured in seconds rather than tokens.
Common Misconception
Output token cost is not the same as input token cost. Most providers charge output tokens at 2–5× the input rate. A prompt with a 10,000-token document and a 50-word answer is cheap; a 10-word prompt generating a 5,000-token report is expensive.
Related Terms
- GPU (Graphics Processing Unit)Specialized hardware that runs AI models much faster than CPUs. NVIDIA A100, H100, etc.
- VRAMVideo RAM, the memory on a GPU used to store AI model weights during inference.
- TokenThe basic unit of text processing in AI models. Roughly 1 token = 4 characters of English text. Used for billing and context limits.
Inference on Rewind.ai
All token costs you see on Rewind.ai are inference costs, the model weights are static, we're paying for GPU-seconds to run them. Heavier models (Claude, GPT-4o) cost more tokens per call than self-hosted Qwen for the same input length.
Explore the ToolsQuick Facts
| Term | Inference |
| Related | GPU (Graphics Processing Unit), VRAM, Token |