Four Ways to Grade an LLM (Without Going Broke)
(Part 3 of 4: Evaluation-Driven Development for LLM Systems)
In Part 2, we built golden datasets: curated input/output pairs that define what "correct" looks like. Now we need something to compare our system's actual output against those golden answers. That something is an evaluation technique.
The problem is that evaluation techniques range from free and instant to expensive and slow. Picking the wrong one wastes money. Picking only the cheap one misses real quality issues. The trick is knowing which tool to reach for and when.
The Evaluation Toolbox
No single technique works for everything. Here is the hierarchy, from simplest to most powerful:
| Technique | Cost | Speed | Best for |
|---|---|---|---|
| Deterministic checks | Free | Instant | Structural properties, booleans, format |
| Semantic similarity | Cheap | Fast | "Same meaning" comparison |
| LLM as judge | Moderate | Slow | Quality, correctness, subjective criteria |
| Human evaluation | Expensive | Very slow | Validating your automated metrics |
The core principle: use the cheapest technique that answers your question. If a boolean check tells you what you need to know, do not spin up an LLM judge for it.
We will walk through each technique using the same running example: an internal knowledge base Q&A bot. Employees ask it questions about company policies (PTO, onboarding, expense reports), and it answers based on the employee handbook.
Technique 1: Deterministic Checks
The simplest form of evaluation. You check concrete, verifiable properties of the response.
Our KB bot has a handbook_found boolean. When someone asks about something that is not in the handbook ("What's the CEO's favorite pizza topping?"), the bot should set handbook_found = false and decline to answer. A deterministic check handles this perfectly:
# Did the bot correctly flag that this topic isn't in the handbook?
assert response.handbook_found == expected.handbook_found
# Does the PTO answer mention the required policy details?
assert "15 days" in response.answer
assert "accrual" in response.answer.lower()
# Is the response a reasonable length (not empty, not a novel)?
assert 20 < word_count(response.answer) < 500
These checks are binary. Pass or fail. No ambiguity.
Strengths: Fast, free, deterministic, no external dependencies. You can run thousands of these in under a second.
Weaknesses: They cannot evaluate quality. The response "PTO accrual of 15 days is a terrible policy and you should quit" passes the keyword check just fine. Deterministic checks also cannot handle paraphrasing. If the golden answer says "15 days" but the bot says "three weeks," a string match fails even though the answer is correct.
Design tip: Build these first. They catch the obvious structural failures before you spend money on anything else.
Technique 2: Semantic Similarity
Semantic similarity converts two pieces of text into embedding vectors and measures the distance between them. If two sentences mean the same thing, their embeddings will be close together in vector space, even if the exact words differ.
For our KB bot, say the golden answer for "How much PTO do I get?" is:
"New employees receive 15 days of PTO per year, accruing monthly starting after the 90 day probation period."
And the bot responds:
"You get 15 PTO days annually. They start accruing each month once you've completed your first 90 days."
Different wording, same meaning. A semantic similarity check catches this.
Score ranges (0 to 1):
- 0.9+ means the responses say essentially the same thing
- 0.7 to 0.9 means they are related but differ in detail
- Below 0.7 means they are likely saying different things
score = semantic_similarity(response.answer, golden.expected_answer)
assert score > 0.85
Strengths: Handles paraphrasing naturally. Cheap and fast compared to LLM judges.
Weaknesses: Here is the critical limitation. Two sentences can be semantically similar but factually wrong. Consider:
- "Employees receive 15 days of PTO per year."
- "Employees receive 25 days of PTO per year."
These sentences have nearly identical structure, topic, and vocabulary. Their similarity score will be high. But one of them is wrong by ten days. The factual error is subtle in embedding space.
This makes semantic similarity good for coarse checks ("Is the response about the right topic? Does it roughly match the reference?") but unreliable for factual accuracy. If the difference between a correct and incorrect answer is a single number or name, semantic similarity will not catch it.
Technique 3: LLM as Judge
The most powerful automated technique. You ask one LLM to evaluate another LLM's output.
The setup:
- Give a judge model the question, the expected answer, and the actual answer
- Ask it to score the actual answer on specific criteria
- The judge returns a score and its reasoning
correctness_metric = LLMJudge(
name="Correctness",
criteria="Determine whether the actual output is factually correct "
"based on the expected output. Score higher if the key facts "
"match, even if the wording is different.",
inputs=["actual_output", "expected_output"],
threshold=0.7
)
result = correctness_metric.evaluate(
input="What's the onboarding process for new hires?",
actual_output=response.answer,
expected_output=golden.expected_answer
)
# result.score -> 0.0 to 1.0
# result.reason -> "The response correctly describes the three-phase
# onboarding but omits the buddy system mentioned in
# the expected answer."
The real power is in custom criteria. You can define evaluation dimensions in plain language:
# Is the response grounded in the handbook?
groundedness = LLMJudge(
name="Groundedness",
criteria="Does the response only contain information from the employee "
"handbook? Score 0 if it includes policies or benefits not "
"present in the handbook. Score 1 if every claim traces back "
"to handbook content."
)
# Is this actually helpful to the person asking?
helpfulness = LLMJudge(
name="Helpfulness",
criteria="Is this response useful to a new employee asking about "
"onboarding? Does it include practical next steps, timelines, "
"and who to contact? Score based on actionability, not just "
"factual completeness."
)
Notice how "helpfulness" evaluates something that no deterministic check or embedding comparison could touch. You are describing, in natural language, what "good" means for your specific use case.
Strengths: Handles semantic equivalence. Evaluates subjective qualities. Flexible criteria you define in plain English. Correlates well with human judgment when calibrated properly.
Weaknesses: Every evaluation costs money (an extra API call per test case per metric). Adds latency. The judge can make mistakes. Scores may vary slightly between runs.
Technique 4: Human Evaluation
The gold standard. A person reads the question, the expected answer, and the actual answer, then makes a judgment call.
No automated technique is perfectly reliable. Human evaluation exists to validate your automated metrics, not to replace them. The workflow:
- Run your automated evaluation across all test cases
- Sample a subset of results (especially borderline scores and surprising failures)
- Have a person review those cases and record their own scores
- Compare human scores to automated scores
If your LLM judge consistently disagrees with your human reviewers, the judge's criteria need refinement. If semantic similarity flags a correct answer as wrong, you know where the technique breaks down.
Human evaluation does not scale. You cannot have a person review every response in a 200 case test suite on every code change. But you can have a person review 20 cases once a week to keep your automated metrics honest.
Layered Evaluation: Cheap Checks First, Expensive Checks Second
In practice, you combine techniques. The pattern is straightforward: run the fast, cheap checks first, then use the slower, expensive checks for the questions that simple checks cannot answer.
Here is what a combined evaluation function looks like for our KB bot:
function evaluate_response(question, response, golden):
results = {}
# Layer 1: Deterministic checks (free, instant)
results["handbook_found_correct"] =
response.handbook_found == golden.expected_handbook_found
results["required_topics_mentioned"] = all(
topic in response.answer.lower()
for topic in golden.must_mention_topics
)
# Layer 2: Semantic similarity (cheap, fast)
results["similarity_score"] =
semantic_similarity(response.answer, golden.expected_answer)
# Layer 3: LLM judge (costs money, worth it for quality)
results["correctness"] = correctness_judge.evaluate(
input=question,
actual_output=response.answer,
expected_output=golden.expected_answer
).score
results["helpfulness"] = helpfulness_judge.evaluate(
input=question,
actual_output=response.answer
).score
return results
Each layer catches different failure modes. The deterministic check catches structural errors (wrong boolean, missing required keywords). Semantic similarity catches gross topic mismatches. The LLM judge catches subtle correctness and quality issues.
If a response fails the deterministic checks, you already know it is broken. You could skip the expensive LLM judge call entirely for those cases and save money. In production, many teams short circuit: if the cheap checks fail, mark the test as failed and move on.
The Meta Evaluation Problem
If you use an LLM to judge another LLM, how do you know the judge is right?
This is not a philosophical question. It is a practical one. Here are four concrete strategies:
Use a stronger model as the judge. The judge should be at least as capable as the system being evaluated. If your bot runs on a smaller, cheaper model, use a larger model for evaluation. A more capable model catches errors that a weaker one would miss.
Validate with human review. Run your full evaluation suite, then manually review a sample of the judge's scores. Pay special attention to cases where the judge gave a high score. Are those responses actually good? If the judge is rubber stamping mediocre answers, your criteria are too loose.
Read the reasoning. A good LLM judge does not just return a number. It explains why it gave that score. Read those explanations. They reveal whether the judge is evaluating what you think it is evaluating, or whether it latched onto something irrelevant.
Track score distributions. If all your scores cluster around 0.5, the criteria might be too vague for the judge to differentiate. If everything is 0.0 or 1.0 with nothing in between, the criteria might be too binary. Healthy distributions usually have a spread, with a concentration toward the high end if your system is working reasonably well.
Evaluation Failure Modes
Your evaluation system is itself software, and it can fail in ways that quietly undermine your confidence. Watch for these:
Score inflation. A lenient judge makes everything look good. Test your evaluation with intentionally bad responses. If a completely wrong answer still scores 0.6, your criteria need tightening. Feed the judge responses like "I don't know, go ask HR" for a detailed policy question and see what score comes back.
Criteria ambiguity. "Is the response good?" is too vague. The judge will interpret "good" however it wants, and that interpretation might shift between runs. "Does the response accurately state the company's PTO policy, including the number of days and the accrual schedule?" gives the judge something concrete to evaluate.
Threshold sensitivity. A threshold of 0.5 versus 0.7 can flip a large number of results from pass to fail. Do not pick thresholds based on intuition. Run your evaluation first, look at the actual score distribution, and set thresholds based on where correct and incorrect responses naturally separate.
Judge bias. LLM judges tend to prefer longer, more detailed answers. A concise, correct two sentence response may score lower than a verbose five paragraph response that repeats itself. If brevity matters for your use case, call it out explicitly in the criteria: "Do not penalize concise answers. A short response that fully answers the question should score as highly as a longer one."
Key Takeaways
- Use the cheapest technique that answers your question. Deterministic checks before semantic similarity before LLM as judge. Do not pay for what a boolean can tell you.
- Deterministic checks are your foundation. They are free, instant, and catch structural failures. Build them first for every evaluation suite.
- Semantic similarity is a coarse filter, not a precision tool. Good for "right topic, right ballpark" checks. Unreliable for catching subtle factual errors.
- LLM as judge is the most flexible technique. Define criteria in natural language for correctness, helpfulness, groundedness, or anything else. But remember that every evaluation call costs money and adds latency.
- Layer your evaluation. Cheap checks first, expensive checks second. Short circuit when you can.
- Evaluate your evaluation. Read the judge's reasoning. Validate scores with human review. Track score distributions over time. Your evaluation system needs its own quality checks.
Next in This Series
Your evaluation suite passes today. Tomorrow you ship a prompt change, and three test cases that used to pass now fail. Are those regressions, or did you just improve the system and the golden answers are stale? In Part 4, we will cover regression testing for LLM systems: how to detect real regressions, manage expected changes, and build evaluation into your development workflow so that every change gets tested before it ships.
No spam, no sharing to third party. Only you and me.
Member discussion