Transformer
Definition
The neural network architecture behind modern AI models. Introduced in the 2017 paper "Attention Is All You Need."
Why It Matters
The transformer architecture (2017) is the engine inside every modern LLM, every diffusion model worth shipping, and most speech and vision models. It replaced LSTMs and CNNs for sequence tasks because attention scales better with hardware than recurrence, bigger transformers reliably outperform bigger LSTMs at the same parameter count.
Key Points
- Core components: multi-head self-attention, position-wise feed-forward network, layer normalisation, and residual connections, unchanged since the 2017 paper.
- Encoder-only (BERT, RoBERTa): strong at classification and embedding. Decoder-only (GPT, Llama, Qwen): strong at generation. Encoder-decoder (T5, BART): strong at translation and summarisation.
- Positional encoding is necessary because attention is permutation-invariant. RoPE (Rotary Position Embeddings) is the current standard, it extrapolates better to context lengths beyond those seen during training.
- Flash Attention 2 (2023): 2–4× faster self-attention via kernel fusion and tiling, enabling practical 100K+ context training on single-node clusters.
- Architecture variants since 2017: sparse attention, linear attention, state-space models (Mamba), all proposed to reduce the O(n²) attention cost, none yet universally adopted at frontier scale.
Example
GPT, Claude, Gemini, Llama, Qwen, Mistral, DeepSeek, FLUX, Stable Diffusion 3, Whisper, all transformers, varying only in size, training data, and decoder/encoder layout. The architectural details have stayed remarkably stable since 2017.
Common Misconception
Transformer is an architecture, not a product or capability level. Saying a model is a transformer describes its structural blueprint, not its size, training data, or quality. Two transformers with the same parameter count trained on different data and objectives have completely different capabilities.
Related Terms
- Attention MechanismA technique that allows AI models to focus on relevant parts of the input when generating output.
- LLM (Large Language Model)A neural network trained on massive text datasets that can generate, understand and manipulate human language. Examples: GPT-4, Qwen, Claude.
- ParameterA trainable weight in an AI model. Larger models have more parameters (7B, 70B, 400B).
Transformer on Rewind.ai
Every model on Rewind.ai is a transformer. The differences you see in the picker (context window, parameter count, speed) are different ways of scaling the same underlying architecture.
Explore the ToolsQuick Facts
| Term | Transformer |
| Related | Attention Mechanism, LLM (Large Language Model), Parameter |