Question 1

Why does our existing chatbot suck?

Accepted Answer

Usually one of four reasons. (1) No retrieval — it's just a generic LLM with no access to your knowledge. (2) Bad retrieval — chunks are too big or too small, no reranking, no metadata filtering. (3) No evals — every prompt change is a coin flip. (4) No escalation — the bot tries to handle calls it shouldn't, frustrating users. We fix the four root causes; we don't 'tune the prompt' and hope.

Question 2

What's the difference between RAG and fine-tuning?

Accepted Answer

RAG (retrieval-augmented generation) pulls relevant chunks from your knowledge base at query time and gives them to the LLM as context. Fine-tuning bakes knowledge into the model weights. For 95% of conversational use cases, RAG is the right answer — it's cheaper, your knowledge stays current as docs change, and you can audit exactly which sources the answer came from. Fine-tuning is for style/format adaptation, not for teaching new facts.

Question 3

How do we keep the agent's knowledge fresh?

Accepted Answer

Two paths. (1) Push-based: when a doc changes in your CMS / Notion / SharePoint, a webhook re-ingests the relevant chunks. (2) Pull-based: a scheduled sync re-crawls the knowledge base every N hours. We default to push where the source supports webhooks and pull otherwise, with a daily fallback sync to catch anything missed.

Question 4

Can the agent cite its sources?

Accepted Answer

Yes — and it should. We render inline citations linking to the source document and the exact passage. Citations build user trust and let admins audit answers. They also serve as the hook for evals: was the cited passage actually relevant to the question?

Question 5

How do we measure if the agent is good?

Accepted Answer

Three layers. (1) Offline evals: a fixed set of questions and 'good' answers we run on every prompt change. (2) Online metrics: deflection rate, escalation rate, user satisfaction (thumbs up/down), per-question latency and cost. (3) Sampled human review: roughly 1% of real conversations get human-graded weekly. Without all three, the agent will silently drift.

Question 6

Where does the agent stop and a human take over?

Accepted Answer

Explicit escalation triggers: low retrieval confidence, repeated user misunderstanding, explicit 'speak to a person' intent, sensitive topics (legal, billing disputes, complaints). Warm transfer with the conversation context attached, so the human doesn't restart from scratch.

Layer	Default
Orchestration	LangGraph (when multi-step) or plain SDK (single-shot)
Vector store	pgvector (if you have Postgres) / Pinecone (managed) / Vectorize (Cloudflare)
Embedding model	OpenAI text-embedding-3-large or Cohere embed-multilingual-v3
Reranker	Cohere rerank-v3 or a small cross-encoder
Reasoning model	Claude Sonnet 4.6 (default), GPT-4o (latency-sensitive)
Ingestion	Per-source webhooks + scheduled fallback sync
Observability	Langfuse for traces and evals, Sentry for errors
UI	Next.js + streaming + Server Components

Scope	Typical investment
Discovery + knowledge audit (1 week)	€4,000–6,000
Internal Q&A agent (4–6 weeks)	€25,000–40,000
Customer support agent with CRM + escalation (6–10 weeks)	€40,000–80,000
Multi-domain enterprise agent (10–16 weeks)	€80,000–150,000
Ongoing retainer	from €1,500/month

Conversational Agent

What a conversational agent actually is

The four reasons your current chatbot is bad

1. No retrieval

2. Bad retrieval

3. No evals

4. No escalation

Anatomy of a working pipeline

Stack we tend to reach for

What good evals look like

Cost and timeline

Common failure modes we've seen

Where it pairs

Frequently asked questions

Related

AI Agents Development

Mastery

RAG done right: the patterns that survive production

Why your AI chatbot fails (and what to fix)

ChatGPT API vs Claude API vs Gemini: which to pick (2026)

Want to scope a conversational agent project?