Conversational Agent
Internal or customer chat grounded in your knowledge base with citations and escalation
What a conversational agent actually is
You have knowledge — documentation, runbooks, policy manuals, product specs, customer FAQs, internal wiki — and you have people who need to find answers in that knowledge. A conversational agent sits between them. It takes a question in natural language, retrieves the relevant chunks, generates an answer grounded in those chunks, and shows the citations.
It is not:
- A generic LLM chatbot with no access to your data (which would make things up).
- A search bar with a different UI.
- A keyword-matching FAQ tree.
It is a system with a clear job: answer questions from your knowledge, cite the sources, and hand off to a human when it shouldn't be answering.
The four reasons your current chatbot is bad
Most production chatbots we audit fail for one of four reasons:
1. No retrieval
The bot is just an LLM with a system prompt. It has no access to your knowledge. So it hallucinates plausible-sounding garbage. Fix: add proper RAG with vector search over your actual docs.
2. Bad retrieval
The bot has retrieval but the chunks are wrong-sized, there's no reranking, no metadata filtering, no hybrid search. Top-3 chunks are irrelevant. Fix: tune chunk sizes per document type, add BM25 hybrid search, rerank with a smaller model, filter chunks by metadata when the query implies it.
3. No evals
Every prompt change ships and someone hopes. The bot got better at the question they personally tested and worse at three others. Fix: build an eval set of 50–200 representative questions with reference answers; run it in CI; surface scores on a dashboard.
4. No escalation
The bot tries to handle questions it shouldn't (billing disputes, legal advice, complex troubleshooting) and infuriates users. Fix: define explicit escalation intents; warm transfer with conversation context; never make users repeat themselves.
Our Why your AI chatbot fails post walks through each failure mode in detail.
Anatomy of a working pipeline
[User question]
↓
[Query rewrite (optional, for multi-turn)]
↓
[Hybrid retrieval: vector + BM25, metadata-filtered]
↓
[Rerank (cross-encoder or smaller LLM)]
↓ → top-k chunks with provenance
[Answer generation (Claude / GPT) with chunks in context]
↓
[Citation rendering + escalation classifier]
↓
[Response with inline citations]
Plus the observability layer — every query, every retrieved chunk, every generated token, every user feedback signal — flowing into Langfuse for evaluation and debugging.
Stack we tend to reach for
| Layer | Default |
|---|---|
| Orchestration | LangGraph (when multi-step) or plain SDK (single-shot) |
| Vector store | pgvector (if you have Postgres) / Pinecone (managed) / Vectorize (Cloudflare) |
| Embedding model | OpenAI text-embedding-3-large or Cohere embed-multilingual-v3 |
| Reranker | Cohere rerank-v3 or a small cross-encoder |
| Reasoning model | Claude Sonnet 4.6 (default), GPT-4o (latency-sensitive) |
| Ingestion | Per-source webhooks + scheduled fallback sync |
| Observability | Langfuse for traces and evals, Sentry for errors |
| UI | Next.js + streaming + Server Components |
For the deeper RAG patterns we use, see our RAG patterns that survive production post.
What good evals look like
Three layers, always:
- Offline eval set: 50–200 representative questions, each with a reference answer or rubric. Run on every prompt/model change. Track scores over time.
- Online metrics: deflection rate (questions answered without escalation), CSAT (thumbs up/down), per-question cost, latency p50/p95.
- Sampled human review: ~1% of real conversations weekly, graded for accuracy and tone.
Without all three, the agent will silently drift as docs change and models update.
Cost and timeline
| Scope | Typical investment |
|---|---|
| Discovery + knowledge audit (1 week) | €4,000–6,000 |
| Internal Q&A agent (4–6 weeks) | €25,000–40,000 |
| Customer support agent with CRM + escalation (6–10 weeks) | €40,000–80,000 |
| Multi-domain enterprise agent (10–16 weeks) | €80,000–150,000 |
| Ongoing retainer | from €1,500/month |
LLM pass-through cost typically €0.001–€0.01 per query depending on context size.
Common failure modes we've seen
- Stale knowledge base. Webhook from CMS broke six months ago, nobody noticed. We add monitoring on ingestion lag from day one.
- Hallucinated citations. The model fabricates a doc reference. We validate every cited URL exists in the retrieval result before rendering.
- Cross-domain leak. An internal HR agent answers questions about engineering source code because everything's in one vector store. We separate stores and add metadata filtering.
- Prompt injection. Users try to override the system prompt ("ignore previous instructions"). We add prompt-injection detection and refuse to follow user-provided system-level instructions.
Where it pairs
Conversational agents commonly chain with:
- Document processing agents when the user asks about specific documents the conversational agent can fetch and analyse on the fly.
- Workflow orchestrators when the user wants the agent to do something — create a ticket, schedule a follow-up — not just answer.
- Voice agents when the same knowledge needs to be served on the phone.
If you have a knowledge surface that customers or employees are struggling to navigate, drop us a note. We respond within one business day.
Frequently asked questions
Related
AI Agents Development
Custom agents that read documents, hold conversations, take phone calls, and execute multi-step workflows — wired into the systems you already run.
Mastery
AI-powered learning platform on Google Generative AI
RAG done right: the patterns that survive production
Production RAG is engineering, not magic. The patterns that survive: hybrid retrieval (vector + BM25), rerank top-k with a cross-encoder, metadata filtering, source dating, citation rendering, sampled human review. Without these, your retrieval is good in the demo and broken in production.
Why your AI chatbot fails (and what to fix)
Most chatbots that fail in production fail for one of six reasons: no retrieval, bad retrieval, no evals, no escalation, no observability, no scope. Tuning the prompt won't fix any of them. The fix is engineering — and the engineering is well-understood by now.
ChatGPT API vs Claude API vs Gemini: which to pick (2026)
Claude Sonnet 4.6/4.7 is our default for production agents — most reliable tool calling, best structured output, strong reasoning. GPT-4o wins for voice (Realtime is best-in-class) and the largest ecosystem. Gemini 2.5/2.0 wins for long-context, vision-heavy document work, and cost-sensitive volume workloads. Pick per task; abstract behind a provider interface.
Want to scope a conversational agent project?
Tell us the workflow. We'll come back within one business day with a clear next step.