members-only post

Four Ways to Grade an LLM (Without Going Broke)

Your evaluation technique should match the question you're asking, not your ambition.

(Part 3 of 4: Evaluation-Driven Development for LLM Systems)

In Part 2, we built golden datasets: curated input/output pairs that define what "correct" looks like. Now we need something to compare our system's actual output against those golden answers. That something is an evaluation technique.

The problem is that evaluation techniques range from free and instant to expensive and slow. Picking the wrong one wastes money. Picking only the cheap one misses real quality issues. The trick is knowing which tool to reach for and when.

The Evaluation Toolbox

No single technique works for everything. Here is the hierarchy, from simplest to most powerful:

TechniqueCostSpeedBest for
Deterministic checksFreeInstantStructural properties, booleans, format
Semantic similarityCheapFast"Same meaning" comparison
LLM as judgeModerateSlowQuality, correctness, subjective criteria
Human evaluationExpensiveVery slowValidating your automated metrics

The core principle: use the cheapest technique that answers your question. If a boolean check tells you what you need to know, do not spin up an LLM judge for it.

We will walk through each technique using the same running example: an internal knowledge base Q&A bot. Employees ask it questions about company policies (PTO, onboarding, expense reports), and it answers based on the employee handbook.

Technique 1: Deterministic Checks

The simplest form of evaluation. You check concrete, verifiable properties of the response.

Our KB bot has a handbook_found boolean. When someone asks about something that is not in the handbook ("What's the CEO's favorite pizza topping?"), the bot should set handbook_found = false and decline to answer. A deterministic check handles this perfectly:

# Did the bot correctly flag that this topic isn't in the handbook?
assert response.handbook_found == expected.handbook_found

# Does the PTO answer mention the required policy details?
assert "15 days" in response.answer
assert "accrual" in response.answer.lower()

# Is the response a reasonable length (not empty, not a novel)?
assert 20 < word_count(response.answer) < 500

These checks are binary. Pass or fail. No ambiguity.

Strengths: Fast, free, deterministic, no external dependencies. You can run thousands of these in under a second.

Weaknesses: They cannot evaluate quality. The response "PTO accrual of 15 days is a terrible policy and you should quit" passes the keyword check just fine. Deterministic checks also cannot handle paraphrasing. If the golden answer says "15 days" but the bot says "three weeks," a string match fails even though the answer is correct.

Design tip: Build these first. They catch the obvious structural failures before you spend money on anything else.

Technique 2: Semantic Similarity

Semantic similarity converts two pieces of text into embedding vectors and measures the distance between them. If two sentences mean the same thing, their embeddings will be close together in vector space, even if the exact words differ.

For our KB bot, say the golden answer for "How much PTO do I get?" is:

"New employees receive 15 days of PTO per year, accruing monthly starting after the 90 day probation period."

And the bot responds:

"You get 15 PTO days annually. They start accruing each month once you've completed your first 90 days."

Different wording, same meaning. A semantic similarity check catches this.

Score ranges (0 to 1):

  • 0.9+ means the responses say essentially the same thing
  • 0.7 to 0.9 means they are related but differ in detail
  • Below 0.7 means they are likely saying different things
score = semantic_similarity(response.answer, golden.expected_answer)
assert score > 0.85

Strengths: Handles paraphrasing naturally. Cheap and fast compared to LLM judges.

Weaknesses: Here is the critical limitation. Two sentences can be semantically similar but factually wrong. Consider:

  • "Employees receive 15 days of PTO per year."
  • "Employees receive 25 days of PTO per year."

These sentences have nearly identical structure, topic, and vocabulary. Their similarity score will be high. But one of them is wrong by ten days. The factual error is subtle in embedding space.

This makes semantic similarity good for coarse checks ("Is the response about the right topic? Does it roughly match the reference?") but unreliable for factual accuracy. If the difference between a correct and incorrect answer is a single number or name, semantic similarity will not catch it.

This post is for subscribers only

Subscribe to continue reading