RAG done right: the patterns that survive production
The pipeline
Production RAG isn't a single component, it's a six-stage pipeline:
[Query]
↓
[Query rewrite / expansion (optional)]
↓
[Hybrid retrieval: vector + BM25, metadata-filtered]
↓
[Rerank top-k with cross-encoder]
↓
[Construct context window with chunks + provenance]
↓
[Generation model (Claude / GPT) with citation rendering]
↓
[Response with inline citations + eval-grade logging]
Each stage matters. Skip one and the system breaks in a different way.
Stage 1: query understanding
Don't pass the raw user query to retrieval. Pre-process:
- Decontextualize: in multi-turn chat, the current question often references prior turns. Rewrite the query to be self-contained.
- Expand: generate 2-3 paraphrases. Retrieve over the expanded set, dedupe results.
- Classify: short queries vs complex, factual vs comparative. Route accordingly.
For single-turn knowledge bases this stage is optional. For chat systems it doubles retrieval recall.
Stage 2: hybrid retrieval
Pure vector search hits the same wall every time: it finds "semantically similar but actually irrelevant" chunks. Pure BM25 misses paraphrases.
Hybrid retrieval blends both:
const vectorResults = await pgvector.search(queryEmbedding, k=20);
const bm25Results = await postgres.search(queryText, k=20);
const merged = blendScores(vectorResults, bm25Results, alpha=0.7);
// alpha=0.7 means 70% vector, 30% BM25 — tune per use case
const filtered = filterByMetadata(merged, queryContext);
filterByMetadata is the unsung hero: if the query mentions "Q1 2026," filter to chunks from Q1 2026. If the user is in tenant X, filter to tenant X's documents. Metadata filters dramatically improve precision.
Stage 3: reranking
Vector similarity ≠ relevance. The reranker is what closes that gap.
const reranked = await cohereRerank({
query: queryText,
documents: merged.slice(0, 30).map(r => r.text),
topN: 5,
});
Cohere rerank-v3 or a small cross-encoder fine-tuned on your domain. Either way, you go from "top 30 similar chunks" to "top 5 actually-relevant chunks."
In our pipelines, adding a reranker improves end-to-end answer quality by 10-30% on most evals.
Stage 4: context construction
How you pack the top-k chunks into the context window matters more than people think.
Best practices:
- Include provenance with every chunk. "Source: docs/billing.md, section 3.2." The model uses this to cite.
- Order by relevance, not by source. Most relevant chunk first.
- Stay below 50-80k tokens of context even when you can do more. The model attends to the start and end more reliably than the middle.
- Deduplicate near-identical chunks. If three sources say the same thing, one chunk + "also confirmed by X, Y."
Stage 5: generation with citation rendering
Instruct the model to cite. Validate the citations.
System prompt fragment:
Answer the question using only the provided sources. Cite sources
inline as [1], [2] etc. matching the numbers in the SOURCES section.
If the sources don't contain the answer, say so — don't make
something up.
SOURCES:
[1] docs/billing.md — "Customers on the Pro plan..."
[2] docs/refunds.md — "Refunds are processed within..."
[3] api/spec.yaml — "POST /v1/refunds..."
After generation, validate: every [N] citation maps to a source you actually provided. Hallucinated citations are filtered out before render.
Stage 6: observability + evals
Every query, every retrieved chunk, every generated answer, every user feedback signal goes to Langfuse (or equivalent).
Per-trace fields:
- Query (and rewritten query if applicable)
- Retrieved chunk IDs + scores + ranks
- Reranker scores
- Final context window content
- Model output + citations
- User feedback (thumbs up/down, escalation)
- End-to-end latency, cost
Plus an eval suite that runs offline:
- 50-200 representative queries with reference answers
- LLM-as-judge grader scoring each answer
- Run on every prompt / model / chunking change
- Surfaces regressions before they ship
Common failure modes
Stale index
The vector store has 6-month-old chunks because the ingestion webhook broke. Nobody noticed because queries still return something. Fix: monitor ingestion lag from day one; alert on stale chunks.
Chunk size mismatch
Same chunk size for everything: 400 tokens. But your codebase has 2000-line files and your FAQ has 50-word entries. Fix: chunking strategy per document type.
No metadata filtering
The HR knowledge base also gets queries about engineering. The agent helpfully retrieves engineering docs. Fix: tag chunks with source + topic; filter by query context.
Hallucinated citations
The model generates [3] referring to a source you never gave it. Fix: validate every citation post-generation; strip the unverified ones.
Single-source bias
The top chunk by similarity dominates the answer; other relevant sources are ignored. Fix: encourage the model to synthesize across sources; check for citation diversity in evals.
Lost in the middle
Chunks placed in the middle of a long context window get under-attended. Fix: keep context tight (<50k tokens), most-relevant chunks at the start and end.
What you should measure
| Metric | What it tells you |
|---|---|
| Retrieval recall@k | Did the right chunk make the top-k? |
| Retrieval precision@k | Of the top-k, how many are actually relevant? |
| Answer accuracy (LLM-graded) | End-to-end quality |
| Citation coverage | % of claims with a valid citation |
| Hallucinated citation rate | should be <1% |
| User CSAT (thumbs) | online signal |
| Escalation rate | when the user gave up |
| Per-query cost | budget reality |
Stack we tend to reach for
| Layer | Default |
|---|---|
| Embeddings | OpenAI text-embedding-3-large or Cohere embed-multilingual-v3 |
| Vector store | pgvector (if you have Postgres) / Pinecone / Vectorize |
| BM25 | Postgres full-text search or OpenSearch |
| Reranker | Cohere rerank-v3 |
| Generation | Claude Sonnet 4.6 (default) |
| Observability | Langfuse |
| Ingestion | Webhooks + scheduled fallback sync |
For a deeper look at the agent architecture above retrieval, see How AI agents actually work. For why chatbots fail without proper RAG, see Why your AI chatbot fails.
If you're building a RAG system and want a feasibility review, drop us a note.
Frequently asked questions
Keep reading
How AI agents actually work (under the hood)
An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.
Why your AI chatbot fails (and what to fix)
Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.
ChatGPT API vs Claude API vs Gemini: which to pick (2026)
Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Conversational Agent
Internal or customer chat grounded in your knowledge base with citations and escalation
Want this delivered in your stack?
If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.