RAG

RAG done right: the patterns that survive production

· updated May 21, 20265 min read

The pipeline

Production RAG isn't a single component, it's a six-stage pipeline:

[Query]
  ↓
[Query rewrite / expansion (optional)]
  ↓
[Hybrid retrieval: vector + BM25, metadata-filtered]
  ↓
[Rerank top-k with cross-encoder]
  ↓
[Construct context window with chunks + provenance]
  ↓
[Generation model (Claude / GPT) with citation rendering]
  ↓
[Response with inline citations + eval-grade logging]

Each stage matters. Skip one and the system breaks in a different way.

Stage 1: query understanding

Don't pass the raw user query to retrieval. Pre-process:

  • Decontextualize: in multi-turn chat, the current question often references prior turns. Rewrite the query to be self-contained.
  • Expand: generate 2-3 paraphrases. Retrieve over the expanded set, dedupe results.
  • Classify: short queries vs complex, factual vs comparative. Route accordingly.

For single-turn knowledge bases this stage is optional. For chat systems it doubles retrieval recall.

Stage 2: hybrid retrieval

Pure vector search hits the same wall every time: it finds "semantically similar but actually irrelevant" chunks. Pure BM25 misses paraphrases.

Hybrid retrieval blends both:

const vectorResults = await pgvector.search(queryEmbedding, k=20);
const bm25Results = await postgres.search(queryText, k=20);
const merged = blendScores(vectorResults, bm25Results, alpha=0.7);
// alpha=0.7 means 70% vector, 30% BM25 — tune per use case
const filtered = filterByMetadata(merged, queryContext);

filterByMetadata is the unsung hero: if the query mentions "Q1 2026," filter to chunks from Q1 2026. If the user is in tenant X, filter to tenant X's documents. Metadata filters dramatically improve precision.

Stage 3: reranking

Vector similarity ≠ relevance. The reranker is what closes that gap.

const reranked = await cohereRerank({
  query: queryText,
  documents: merged.slice(0, 30).map(r => r.text),
  topN: 5,
});

Cohere rerank-v3 or a small cross-encoder fine-tuned on your domain. Either way, you go from "top 30 similar chunks" to "top 5 actually-relevant chunks."

In our pipelines, adding a reranker improves end-to-end answer quality by 10-30% on most evals.

Stage 4: context construction

How you pack the top-k chunks into the context window matters more than people think.

Best practices:

  • Include provenance with every chunk. "Source: docs/billing.md, section 3.2." The model uses this to cite.
  • Order by relevance, not by source. Most relevant chunk first.
  • Stay below 50-80k tokens of context even when you can do more. The model attends to the start and end more reliably than the middle.
  • Deduplicate near-identical chunks. If three sources say the same thing, one chunk + "also confirmed by X, Y."

Stage 5: generation with citation rendering

Instruct the model to cite. Validate the citations.

System prompt fragment:

Answer the question using only the provided sources. Cite sources
inline as [1], [2] etc. matching the numbers in the SOURCES section.
If the sources don't contain the answer, say so — don't make
something up.

SOURCES:
[1] docs/billing.md — "Customers on the Pro plan..."
[2] docs/refunds.md — "Refunds are processed within..."
[3] api/spec.yaml — "POST /v1/refunds..."

After generation, validate: every [N] citation maps to a source you actually provided. Hallucinated citations are filtered out before render.

Stage 6: observability + evals

Every query, every retrieved chunk, every generated answer, every user feedback signal goes to Langfuse (or equivalent).

Per-trace fields:

  • Query (and rewritten query if applicable)
  • Retrieved chunk IDs + scores + ranks
  • Reranker scores
  • Final context window content
  • Model output + citations
  • User feedback (thumbs up/down, escalation)
  • End-to-end latency, cost

Plus an eval suite that runs offline:

  • 50-200 representative queries with reference answers
  • LLM-as-judge grader scoring each answer
  • Run on every prompt / model / chunking change
  • Surfaces regressions before they ship

Common failure modes

Stale index

The vector store has 6-month-old chunks because the ingestion webhook broke. Nobody noticed because queries still return something. Fix: monitor ingestion lag from day one; alert on stale chunks.

Chunk size mismatch

Same chunk size for everything: 400 tokens. But your codebase has 2000-line files and your FAQ has 50-word entries. Fix: chunking strategy per document type.

No metadata filtering

The HR knowledge base also gets queries about engineering. The agent helpfully retrieves engineering docs. Fix: tag chunks with source + topic; filter by query context.

Hallucinated citations

The model generates [3] referring to a source you never gave it. Fix: validate every citation post-generation; strip the unverified ones.

Single-source bias

The top chunk by similarity dominates the answer; other relevant sources are ignored. Fix: encourage the model to synthesize across sources; check for citation diversity in evals.

Lost in the middle

Chunks placed in the middle of a long context window get under-attended. Fix: keep context tight (<50k tokens), most-relevant chunks at the start and end.

What you should measure

MetricWhat it tells you
Retrieval recall@kDid the right chunk make the top-k?
Retrieval precision@kOf the top-k, how many are actually relevant?
Answer accuracy (LLM-graded)End-to-end quality
Citation coverage% of claims with a valid citation
Hallucinated citation rateshould be <1%
User CSAT (thumbs)online signal
Escalation ratewhen the user gave up
Per-query costbudget reality

Stack we tend to reach for

LayerDefault
EmbeddingsOpenAI text-embedding-3-large or Cohere embed-multilingual-v3
Vector storepgvector (if you have Postgres) / Pinecone / Vectorize
BM25Postgres full-text search or OpenSearch
RerankerCohere rerank-v3
GenerationClaude Sonnet 4.6 (default)
ObservabilityLangfuse
IngestionWebhooks + scheduled fallback sync

For a deeper look at the agent architecture above retrieval, see How AI agents actually work. For why chatbots fail without proper RAG, see Why your AI chatbot fails.

If you're building a RAG system and want a feasibility review, drop us a note.

Frequently asked questions

Keep reading

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.