Attention Mechanism
Definition
A technique that allows AI models to focus on relevant parts of the input when generating output.
Why It Matters
Before attention, sequence models processed input one token at a time and lost track of long-range relationships. Attention lets every output token look back at the entire input simultaneously and weight which positions matter, that's what makes a transformer keep track of a referent across a 100K-token document.
Key Points
- Attention's compute scales as O(n²) in sequence length, doubling the context window quadruples the memory and compute for that layer.
- Multi-head attention runs the operation multiple times in parallel with different projections; 32–128 heads is typical in large models.
- Flash Attention (2022) rewrites the math to be I/O-bound rather than compute-bound, same output, 3–5× less VRAM, enabling 100K+ contexts.
- The KV cache stores previous tokens' attention results so generation is fast; a 200K-token context on a 70B model can require 40–80 GB of VRAM for the cache alone.
- Self-attention (query = key = value source) vs. cross-attention (query from decoder, key/value from encoder) are the two main variants.
Example
When translating "the dog chased the cat because it was scared," attention ties "it" to "cat" rather than "dog" by learning which input position the pronoun refers to instead of guessing from word order.
Common Misconception
Attention does not mean the model reads tokens in sequential order. All positions are computed in parallel simultaneously, the quadratic cost comes from the pairwise comparisons across every token pair, not from recurrence or sequential processing.
Related Terms
- TransformerThe neural network architecture behind modern AI models. Introduced in the 2017 paper "Attention Is All You Need."
- LLM (Large Language Model)A neural network trained on massive text datasets that can generate, understand and manipulate human language. Examples: GPT-4, Qwen, Claude.
- Context WindowThe maximum amount of text an AI model can process at once, measured in tokens. GPT-4o has 128K tokens.
Attention Mechanism on Rewind.ai
Every chat model on Rewind.ai is a transformer that uses attention. The 100K-200K-token context windows you see in the model picker are bounded mostly by attention's quadratic memory cost.
Explore the ToolsQuick Facts
| Term | Attention Mechanism |
| Related | Transformer, LLM (Large Language Model), Context Window |