The core problem
Imagine youâve built a chatbot. The system is running, the first demos look promising, stakeholders are excited. Three months after launch, you check the usage statistics, and theyâre sobering. People hardly use the bot. And when they do, itâs mostly for trivial questions.
What happened?
The problem isnât that the bot gives bad answers. The problem is that nobody trusts it. A chatbot you donât trust doesnât save time â it wastes time. Users have to double-check every answer, verify it elsewhere, look things up again. At that point, itâs easier to just search the documents directly.
Trust isnât a ânice to haveâ; itâs fundamental for adoption. And trust isnât created by big promises or shiny screenshots. Itâs created by demonstrable, measurable quality â through evaluation.
The client has to know what they want
When I talk to clients about chatbots, I often hear: âWe want the bot to give good answers.â
Sounds reasonable, but as a requirement, itâs far too vague. What does âgoodâ even mean?
- Should the bot prefer a short but correct answer, or a detailed one thatâs 95% accurate?
- Is it allowed to say âI donât knowâ, or should it always try to answer?
- What tone is desired: formal and factual, or friendly and personal?
- How should it handle conflicting information in the source documents?
- How detailed should answers be, a concise summary or the full information?
These questions may sound trivial, but the answers define whether a chatbot is âgoodâ or not. Often, clients donât really know what they need until they see bad examples. Thatâs why we need human evaluations.
Human evaluations: truly understanding the client
Letâs get practical. The requirements are clarified, and the scope is defined. Now itâs about understanding what âgoodâ means in practice. But how do we find that out?
The answer: manually, at first.
I know, in a world of AI, LLMs as a judge, and automated metrics, that sounds a bit old-fashioned. But you canât build an automated evaluation if you donât know what you actually want to evaluate. And you only find that out by having real people assess real answers.
Building an evaluation set and defining quality dimensions
First, you need a set of representative questions, 50 to 200, ideally real user questions. Not the simple demo examples, but everyday questions:
- Common standard questions: âHow can I log in?â, âWhere is the emergency department?â
- Edge cases: âWhen do I have to enter my vacation days as an employee of the administration?â, âGive me a good pizza recipeâ
- Ambiguous questions: âHow can I sign in?â
- Questions that canât be answered from the documents at all: âIs Bern better than Basel?â, âWho is Margaret Thatcher?â
For each question, an answer is generated. In parallel, the core team (and ideally, other stakeholders) define the evaluation dimensions, as not every dimension is equally important in every context.
Typically, we use:
- Correctness: Is the information accurate? Is there even a clear right/wrong?
- Completeness: Is the information complete? Are important aspects missing?
- Tone: Does the tone match what we want? Textmate can be very helpful here, by the way.)
Evaluating
Then comes the tedious part: several people assess each questionâanswer pair along the defined dimensions.
- Good/not good: For each dimension, they decide whether the answer is âgoodâ or not.
- Justification: Every rating has to be justified. That may feel time-consuming, but itâs crucial â itâs the only way to build a shared understanding of what âgoodâ means.
- Blind evaluation: Evaluators shouldnât see one anotherâs scores. If results differ widely, the criteria are probably too vague.
- Discussion: Where there are discrepancies, a joint discussion helps. These conversations are often the most valuable part of the whole process.
After 50â100 evaluated examples, you get a clear picture of the initial situation â and usually also of what needs to happen next.
Scaling with tools
Manual evaluation doesnât scale well: evaluating 100 questions by hand is doable. 1,000 is painful. 10,000 in continuous monitoring? Impossible.
Thatâs where tools come in.
LLM-as-a-Judge: how it works
The idea is simple: an LLM evaluates the chatbotâs answers based on defined criteria. In very simplified terms, it needs:
- The question
- Your systemâs answer
- The gold standard (what the ideal answer should look like)
The evaluator then outputs a conclusion and justification.
The biggest risk: you replace one problem (evaluating the chatbot) with another (evaluating the evaluator). Thatâs why the automated evaluation must be calibrated. To do this, we typically take 50â100 manually evaluated examples and have them assessed by the LLM as well. If the results align, the judge is working reliably.
After that, continuous improvement begins; more on that another time. At the end of these iterations comes the big moment: go-live.
Go-live and continuous monitoring
We recommend starting the go-live quietly at first, without a big announcement. That way, the chatbot can be further improved in the first days based on real user questions.
But the work isnât done once itâs live: ongoing evaluation is crucial. Metrics that are particularly helpful include:
- Share of unanswered questions
- Groundedness (i.e is the bot hallucinating, or are the facts based on the sources?)
- Human spot checks, especially where evaluations are weak or groundedness is low
- And last but not least: user feedback.
With clear, easy-to-understand metrics, even a chatbot handling 10,000 or more questions can be reliably monitored â without having to review every single one.
Evaluation ist kein «Nice-to-have»
The difference between a successful chatbot and a failed one doesnât lie in the best embedding model, the latest LLM or the cleverest retrieval algorithm.
It lies in the willingness to invest time in evaluation.
In human evaluations.
In automated monitoring.
In continuous improvement.
Thatâs the only way to build trust â the foundation for adoption.