Retrieval-augmented generation
RAG (rhymes with bag)
Fetching relevant data and feeding it to an LLM so the response is grounded in real, current information instead of training data alone.
RAG is a pattern where you fetch relevant documents first, then pass them to an LLM along with the user's question. The model generates its answer based on the retrieved content, not just its training data. This solves two problems: the model stays current (training data has a cutoff), and the model stays accurate (it answers from your actual documents, not its memory, reducing hallucinations).
The typical RAG pipeline works like this. A user asks a question. Your system converts the question into an embedding. It searches a vector database for the most similar documents. It passes those documents plus the question to the LLM. The LLM generates an answer grounded in the retrieved content.
Every company building an AI-powered support bot, documentation assistant, or internal knowledge tool is using some version of RAG. Vercel's AI SDK has RAG built in. LangChain and LlamaIndex are frameworks specifically designed for RAG pipelines. If your documentation is well-structured, RAG makes your product easier for AI to recommend accurately.
Examples
An AI-powered documentation assistant.
Supabase built an AI assistant that answers questions about their platform. When a developer asks "How do I set up Row Level Security?", the system retrieves the relevant docs pages, passes them to the LLM, and generates a step-by-step answer with links to the original documentation.
A sales team knowledge base.
A sales rep asks the internal AI: "What is our competitive positioning against Datadog?" RAG retrieves the latest battle card, recent win/loss reports, and pricing comparisons. The LLM synthesizes a concise answer from current internal documents, not outdated training data.
A customer support chatbot with current data.
Without RAG, the bot only knows what was in its training data. With RAG, it retrieves the customer's account info, recent support tickets, and current product docs before answering. The response is specific, current, and grounded in real data.
In practice
Read more on the blog
Frequently asked questions
When should I use RAG versus fine-tuning?
Use RAG when you need current, factual answers from specific documents. Use fine-tuning when you need the model to learn a new style, format, or domain-specific behavior. RAG is for knowledge. Fine-tuning is for skill. Many production systems use both.
Related terms
Numerical representations of text that capture semantic meaning. Two similar sentences produce similar numbers, enabling AI-powered search.
A database optimized for storing and searching embeddings. The backbone of every RAG pipeline and semantic search system.
A neural network trained on massive text data to generate and understand language. The technology behind ChatGPT, Claude, and Gemini.
The maximum amount of text an LLM can process in a single request. Measured in tokens. Bigger windows handle more information at once.
When an AI model generates confident but factually incorrect output. It sounds right. It reads well. It is wrong.

Want the complete playbook?
Picks and Shovels is the definitive guide to developer marketing. Amazon #1 bestseller with practical strategies from 30 years of marketing to developers.