RAG (Retrieval-Augmented Generation)
Definition
A technique where AI retrieves relevant documents before generating a response, improving accuracy.
Why It Matters
LLMs only know what they were trained on, and their context windows are finite. RAG is the standard workaround: index your documents into a vector database at ingest time, retrieve the top-K relevant chunks per query, prepend them to the prompt. The model gets fresh, source-grounded context without retraining.
Key Points
- RAG pipeline steps: chunk → embed → index (vector store) → query → retrieve top-K → augment prompt → generate.
- Common chunk sizes: 256–512 tokens with 20 % overlap. Larger chunks carry more context per result; smaller chunks improve retrieval precision.
- Vector stores: Pinecone, Weaviate, Qdrant, Chroma (local), pgvector (Postgres), all implement approximate nearest-neighbour (ANN) search.
- Reranking: a cross-encoder model scores each retrieved chunk against the query before sending to the LLM, typically improves answer accuracy by 10–20 % over retrieval alone.
- Hybrid retrieval (dense vector + sparse BM25 keyword) consistently outperforms either alone for real-world document Q&A.
Example
Upload a 200-page product manual to a chat. Without RAG you'd paste 50 pages at a time and hit context limits. With RAG the system retrieves the 5 most relevant paragraphs for each question and answers from those, fits any context window, with the source chunks visible to double-check.
Common Misconception
RAG is not a magic accuracy fix. Poorly chunked, low-quality, or outdated source documents produce low-quality retrievals regardless of embedding model or vector store choice. The quality of your ingestion pipeline (chunking strategy, metadata extraction, deduplication) determines RAG quality at least as much as the retrieval mechanism.
Related Terms
- EmbeddingA numerical representation of text, images, or other data that AI models can process and compare.
- Context WindowThe maximum amount of text an AI model can process at once, measured in tokens. GPT-4o has 128K tokens.
- HallucinationWhen an AI model generates false or fabricated information that sounds confident and plausible.
RAG (Retrieval-Augmented Generation) on Rewind.ai
The file-upload feature in chat is RAG. Upload a PDF, ask questions, get answers grounded in the document with the relevant chunks visible. The same primitive powers the search tool's citations.
Explore the ToolsQuick Facts
| Term | RAG (Retrieval-Augmented Generation) |
| Related | Embedding, Context Window, Hallucination |