Skip to main content

Inference

Definition

The process of running an AI model to generate a response. When you send a message to ChatGPT, the model performs inference.

Why It Matters

Training is a one-time investment; inference is the running cost. Every chat reply, image generation, or transcription is one inference. Inference economics decide whether a tool is free, $0.01 per call, or pay-as-you-go.

Key Points

  • Two cost centres: TTFT (Time To First Token, latency) and TPS (Tokens Per Second, throughput). Optimising one often trades off against the other.
  • Batching multiple requests together is the primary throughput lever, a batch of 8 shares the model-weight memory transfer cost across all 8 outputs.
  • Speculative decoding uses a small draft model to propose token sequences that the large model verifies in parallel, typically 2–3× speedup on average output.
  • vLLM's PagedAttention (2023) reduces GPU memory fragmentation, enabling ~24× higher throughput vs. naive KV-cache management.
  • Output token generation requires a full forward pass per token; input processing is parallelisable, that is why output tokens cost more than input tokens on every provider.

Example

Inference latency for a chat model is roughly "time-to-first-token + time-per-token × output length." A 7B model on an A100 produces ~100 tokens/sec; a GPT-4-class model produces ~50 tokens/sec. Image diffusion is measured in seconds rather than tokens.

Common Misconception

Output token cost is not the same as input token cost. Most providers charge output tokens at 2–5× the input rate. A prompt with a 10,000-token document and a 50-word answer is cheap; a 10-word prompt generating a 5,000-token report is expensive.

Related Terms

  • GPU (Graphics Processing Unit)Specialized hardware that runs AI models much faster than CPUs. NVIDIA A100, H100, etc.
  • VRAMVideo RAM, the memory on a GPU used to store AI model weights during inference.
  • TokenThe basic unit of text processing in AI models. Roughly 1 token = 4 characters of English text. Used for billing and context limits.

Inference on Rewind.ai

All token costs you see on Rewind.ai are inference costs, the model weights are static, we're paying for GPU-seconds to run them. Heavier models (Claude, GPT-4o) cost more tokens per call than self-hosted Qwen for the same input length.

Explore the Tools

Quick Facts

TermInference
RelatedGPU (Graphics Processing Unit), VRAM, Token

Browse Glossary

View All AI Terms

FAQ

Inference on Rewind.ai is a free AI tool. There's no charge and no sign up needed to start.

Yes. You get 2,500 free tokens per day to use Inference and every other tool on Rewind.ai. A free account raises that to 5,000 tokens/day. You can buy more starting at $1.

Inference runs open-source AI models on our GPU servers. Send your request and the result comes back in seconds.

No. You can use Inference right away without signing up. A free account doubles your daily usage to 5,000 tokens and saves your history.

Anonymous users get 2,500 tokens/day. Free accounts get 5,000 tokens/day. Tokens reset every 24 hours. Each generation costs ~100-5,000 tokens depending on the operation.

Your data is processed on our servers and isn't stored permanently unless you choose to save it. We don't sell or share it.

Yes. Content from Inference is yours to use for personal or commercial work. The AI models we run are commercially licensed.

Inference matches the quality of paid services because it runs the latest open-source AI models. The difference is you don't pay per use.

Inference runs open-source AI models including Qwen 2.5, FLUX and Whisper. We update to newer models as they ship.

Yes. Inference works in any mobile browser, and the layout adapts to your screen size.

Sign up for a free account to get 5,000 tokens/day, double the anonymous limit. Or buy token packs starting at $5 for 200,000 tokens. See /pricing/ for all options.

Yes. After you generate content, you can download it, copy it, or share it via a unique link. Signed-in users can also view their generation history.

Love Rewind.ai? Tell your friends!

Rate this page