Vector search vs BM25 vs hybrid?

Vector search wins on semantic matches (paraphrases, synonyms). BM25 wins on keyword precision (exact terms, codes, names). Hybrid wins on production — you blend the scores and take the union. Most production RAG pipelines we ship are hybrid by default.

What chunk size should I use?

It depends on the document type. Conversational docs: 200-400 tokens with 50-token overlap. Technical docs (with code or tables): 400-800 tokens. Legal / contract: chunk by clause or section, not by token count. The right answer comes from testing chunk sizes on your eval set and picking the one that maximises retrieval recall.

Do I need a reranker?

Almost always yes. Vector search returns the top-k by similarity, but similarity ≠ relevance. A cross-encoder reranker (Cohere rerank or a small LLM) re-orders the top-k by relevance to the query and dramatically improves precision. Without reranking, your retrieval surfaces lots of 'similar but not actually helpful' chunks.

Should I fine-tune the embedding model?

Rarely. OpenAI text-embedding-3-large or Cohere embed-multilingual-v3 work well out of the box for most use cases. Fine-tuning helps when your domain has heavy jargon the general models don't handle well. Even then, try keyword filtering first.

Vector search vs BM25 vs hybrid?

Vector search wins on semantic matches (paraphrases, synonyms). BM25 wins on keyword precision (exact terms, codes, names). Hybrid wins on production — you blend the scores and take the union. Most production RAG pipelines we ship are hybrid by default.

What chunk size should I use?

It depends on the document type. Conversational docs: 200-400 tokens with 50-token overlap. Technical docs (with code or tables): 400-800 tokens. Legal / contract: chunk by clause or section, not by token count. The right answer comes from testing chunk sizes on your eval set and picking the one that maximises retrieval recall.

Do I need a reranker?

Almost always yes. Vector search returns the top-k by similarity, but similarity ≠ relevance. A cross-encoder reranker (Cohere rerank or a small LLM) re-orders the top-k by relevance to the query and dramatically improves precision. Without reranking, your retrieval surfaces lots of 'similar but not actually helpful' chunks.

Should I fine-tune the embedding model?

Rarely. OpenAI text-embedding-3-large or Cohere embed-multilingual-v3 work well out of the box for most use cases. Fine-tuning helps when your domain has heavy jargon the general models don't handle well. Even then, try keyword filtering first.

All resources

RAG

RAG done right: the patterns that survive production

May 7, 2026· updated May 21, 20265 min read

The pipeline

Production RAG isn't a single component, it's a six-stage pipeline:

[Query]
  ↓
[Query rewrite / expansion (optional)]
  ↓
[Hybrid retrieval: vector + BM25, metadata-filtered]
  ↓
[Rerank top-k with cross-encoder]
  ↓
[Construct context window with chunks + provenance]
  ↓
[Generation model (Claude / GPT) with citation rendering]
  ↓
[Response with inline citations + eval-grade logging]

Each stage matters. Skip one and the system breaks in a different way.

Stage 1: query understanding

Don't pass the raw user query to retrieval. Pre-process:

Decontextualize: in multi-turn chat, the current question often references prior turns. Rewrite the query to be self-contained.
Expand: generate 2-3 paraphrases. Retrieve over the expanded set, dedupe results.
Classify: short queries vs complex, factual vs comparative. Route accordingly.

For single-turn knowledge bases this stage is optional. For chat systems it doubles retrieval recall.

Stage 2: hybrid retrieval

Pure vector search hits the same wall every time: it finds "semantically similar but actually irrelevant" chunks. Pure BM25 misses paraphrases.

Hybrid retrieval blends both:

const vectorResults = await pgvector.search(queryEmbedding, k=20);
const bm25Results = await postgres.search(queryText, k=20);
const merged = blendScores(vectorResults, bm25Results, alpha=0.7);
// alpha=0.7 means 70% vector, 30% BM25 — tune per use case
const filtered = filterByMetadata(merged, queryContext);

filterByMetadata is the unsung hero: if the query mentions "Q1 2026," filter to chunks from Q1 2026. If the user is in tenant X, filter to tenant X's documents. Metadata filters dramatically improve precision.

Stage 3: reranking

Vector similarity ≠ relevance. The reranker is what closes that gap.

const reranked = await cohereRerank({
  query: queryText,
  documents: merged.slice(0, 30).map(r => r.text),
  topN: 5,
});

Cohere rerank-v3 or a small cross-encoder fine-tuned on your domain. Either way, you go from "top 30 similar chunks" to "top 5 actually-relevant chunks."

In our pipelines, adding a reranker improves end-to-end answer quality by 10-30% on most evals.

Stage 4: context construction

How you pack the top-k chunks into the context window matters more than people think.

Best practices:

Include provenance with every chunk. "Source: docs/billing.md, section 3.2." The model uses this to cite.
Order by relevance, not by source. Most relevant chunk first.
Stay below 50-80k tokens of context even when you can do more. The model attends to the start and end more reliably than the middle.
Deduplicate near-identical chunks. If three sources say the same thing, one chunk + "also confirmed by X, Y."

Stage 5: generation with citation rendering

Instruct the model to cite. Validate the citations.

System prompt fragment:

Answer the question using only the provided sources. Cite sources
inline as [1], [2] etc. matching the numbers in the SOURCES section.
If the sources don't contain the answer, say so — don't make
something up.

SOURCES:
[1] docs/billing.md — "Customers on the Pro plan..."
[2] docs/refunds.md — "Refunds are processed within..."
[3] api/spec.yaml — "POST /v1/refunds..."

After generation, validate: every [N] citation maps to a source you actually provided. Hallucinated citations are filtered out before render.

Stage 6: observability + evals

Every query, every retrieved chunk, every generated answer, every user feedback signal goes to Langfuse (or equivalent).

Per-trace fields:

Query (and rewritten query if applicable)
Retrieved chunk IDs + scores + ranks
Reranker scores
Final context window content
Model output + citations
User feedback (thumbs up/down, escalation)
End-to-end latency, cost

Plus an eval suite that runs offline:

50-200 representative queries with reference answers
LLM-as-judge grader scoring each answer
Run on every prompt / model / chunking change
Surfaces regressions before they ship

Common failure modes

Stale index

The vector store has 6-month-old chunks because the ingestion webhook broke. Nobody noticed because queries still return something. Fix: monitor ingestion lag from day one; alert on stale chunks.

Chunk size mismatch

Same chunk size for everything: 400 tokens. But your codebase has 2000-line files and your FAQ has 50-word entries. Fix: chunking strategy per document type.

No metadata filtering

The HR knowledge base also gets queries about engineering. The agent helpfully retrieves engineering docs. Fix: tag chunks with source + topic; filter by query context.

Hallucinated citations

The model generates [3] referring to a source you never gave it. Fix: validate every citation post-generation; strip the unverified ones.

Single-source bias

The top chunk by similarity dominates the answer; other relevant sources are ignored. Fix: encourage the model to synthesize across sources; check for citation diversity in evals.

Lost in the middle

Chunks placed in the middle of a long context window get under-attended. Fix: keep context tight (<50k tokens), most-relevant chunks at the start and end.

What you should measure

Metric	What it tells you
Retrieval recall@k	Did the right chunk make the top-k?
Retrieval precision@k	Of the top-k, how many are actually relevant?
Answer accuracy (LLM-graded)	End-to-end quality
Citation coverage	% of claims with a valid citation
Hallucinated citation rate	should be <1%
User CSAT (thumbs)	online signal
Escalation rate	when the user gave up
Per-query cost	budget reality

Stack we tend to reach for

Layer	Default
Embeddings	OpenAI text-embedding-3-large or Cohere embed-multilingual-v3
Vector store	pgvector (if you have Postgres) / Pinecone / Vectorize
BM25	Postgres full-text search or OpenSearch
Reranker	Cohere rerank-v3
Generation	Claude Sonnet 4.6 (default)
Observability	Langfuse
Ingestion	Webhooks + scheduled fallback sync

For a deeper look at the agent architecture above retrieval, see How AI agents actually work. For why chatbots fail without proper RAG, see Why your AI chatbot fails.

If you're building a RAG system and want a feasibility review, drop us a note.

Frequently asked questions

Keep reading

Article

How AI agents actually work (under the hood)

An AI agent is a reasoning loop: the model plans, calls a tool, observes the result, replans. Underneath: function-calling APIs, retrieval-augmented context, typed tool schemas, guardrails, evals, and observability. This is the technical breakdown — what each layer does and how they fit together.

Article

Why your AI chatbot fails (and what to fix)

Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.

Article

ChatGPT API vs Claude API vs Gemini: which to pick (2026)

Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.

Service

AI Agents Development

Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.

Agent type

Conversational Agent

Internal or customer chat grounded in your knowledge base with citations and escalation

Want this delivered in your stack?

If the article describes a workflow you'd like to ship, drop us a note. We reply within one business day.

Get a proposal