The core problem

Imagine you’ve built a chatbot. The system is running, the first demos look promising, stakeholders are excited. Three months after launch, you check the usage statistics, and they’re sobering. People hardly use the bot. And when they do, it’s mostly for trivial questions.

What happened?

The problem isn’t that the bot gives bad answers. The problem is that nobody trusts it. A chatbot you don’t trust doesn’t save time – it wastes time. Users have to double-check every answer, verify it elsewhere, look things up again. At that point, it’s easier to just search the documents directly.

Trust isn’t a “nice to have”; it’s fundamental for adoption. And trust isn’t created by big promises or shiny screenshots. It’s created by demonstrable, measurable quality – through evaluation.

The client has to know what they want

When I talk to clients about chatbots, I often hear: “We want the bot to give good answers.”

Sounds reasonable, but as a requirement, it’s far too vague. What does “good” even mean?

  • Should the bot prefer a short but correct answer, or a detailed one that’s 95% accurate?
  • Is it allowed to say “I don’t know”, or should it always try to answer?
  • What tone is desired: formal and factual, or friendly and personal?
  • How should it handle conflicting information in the source documents?
  • How detailed should answers be, a concise summary or the full information?

These questions may sound trivial, but the answers define whether a chatbot is “good” or not. Often, clients don’t really know what they need until they see bad examples. That’s why we need human evaluations.

Human evaluations: truly understanding the client

Let’s get practical. The requirements are clarified, and the scope is defined. Now it’s about understanding what “good” means in practice. But how do we find that out?

The answer: manually, at first.

I know, in a world of AI, LLMs as a judge, and automated metrics, that sounds a bit old-fashioned. But you can’t build an automated evaluation if you don’t know what you actually want to evaluate. And you only find that out by having real people assess real answers.

Building an evaluation set and defining quality dimensions

First, you need a set of representative questions, 50 to 200, ideally real user questions. Not the simple demo examples, but everyday questions:

  • Common standard questions: “How can I log in?”, “Where is the emergency department?”
  • Edge cases: “When do I have to enter my vacation days as an employee of the administration?”, “Give me a good pizza recipe”
  • Ambiguous questions: “How can I sign in?”
  • Questions that can’t be answered from the documents at all: “Is Bern better than Basel?”, “Who is Margaret Thatcher?”

For each question, an answer is generated. In parallel, the core team (and ideally, other stakeholders) define the evaluation dimensions, as not every dimension is equally important in every context.
Typically, we use:

  • Correctness: Is the information accurate? Is there even a clear right/wrong?
  • Completeness: Is the information complete? Are important aspects missing?
  • Tone: Does the tone match what we want? Textmate can be very helpful here, by the way.)

Evaluating

Then comes the tedious part: several people assess each question–answer pair along the defined dimensions.

  1. Good/not good: For each dimension, they decide whether the answer is “good” or not.
  2. Justification: Every rating has to be justified. That may feel time-consuming, but it’s crucial – it’s the only way to build a shared understanding of what “good” means.
  3. Blind evaluation: Evaluators shouldn’t see one another’s scores. If results differ widely, the criteria are probably too vague.
  4. Discussion: Where there are discrepancies, a joint discussion helps. These conversations are often the most valuable part of the whole process.

After 50–100 evaluated examples, you get a clear picture of the initial situation – and usually also of what needs to happen next.

Scaling with tools

Manual evaluation doesn’t scale well: evaluating 100 questions by hand is doable. 1,000 is painful. 10,000 in continuous monitoring? Impossible.

That’s where tools come in.

LLM-as-a-Judge: how it works

The idea is simple: an LLM evaluates the chatbot’s answers based on defined criteria. In very simplified terms, it needs:

  1. The question
  2. Your system’s answer
  3. The gold standard (what the ideal answer should look like)

The evaluator then outputs a conclusion and justification.

The biggest risk: you replace one problem (evaluating the chatbot) with another (evaluating the evaluator). That’s why the automated evaluation must be calibrated. To do this, we typically take 50–100 manually evaluated examples and have them assessed by the LLM as well. If the results align, the judge is working reliably.

After that, continuous improvement begins; more on that another time. At the end of these iterations comes the big moment: go-live.

Go-live and continuous monitoring

We recommend starting the go-live quietly at first, without a big announcement. That way, the chatbot can be further improved in the first days based on real user questions.

But the work isn’t done once it’s live: ongoing evaluation is crucial. Metrics that are particularly helpful include:

  • Share of unanswered questions
  • Groundedness (i.e is the bot hallucinating, or are the facts based on the sources?)
  • Human spot checks, especially where evaluations are weak or groundedness is low
  • And last but not least: user feedback.

With clear, easy-to-understand metrics, even a chatbot handling 10,000 or more questions can be reliably monitored – without having to review every single one.

Evaluation ist kein «Nice-to-have»

The difference between a successful chatbot and a failed one doesn’t lie in the best embedding model, the latest LLM or the cleverest retrieval algorithm.
It lies in the willingness to invest time in evaluation.

In human evaluations.
In automated monitoring.
In continuous improvement.

That’s the only way to build trust – the foundation for adoption.