members-only post Feb 24, 2026 eval driven development

Your Golden Dataset Is Worth More Than Your Prompts

Most teams spend weeks perfecting prompts and minutes on evaluation data. That's backwards. Part 2 of 4: Evaluation-Driven Development for LLM Systems

In Part 1, we talked about why traditional testing breaks down for LLM systems. The outputs are non-deterministic, correctness is subjective, and assertEqual can't tell you whether a response is actually good.

So if exact string matching won't work, what do you measure against?

You measure against a golden dataset.

What is a golden dataset?

A golden dataset is a collection of input/output pairs where you have defined what a correct response looks like. Each entry contains a question your system should handle, along with criteria that describe what a good answer includes (and what it should not include).

"Golden" means these are your reference answers. The standard you measure everything against.

Here is what a golden test case looks like for an internal knowledge base Q&A bot:

{
  "id": "hp_001",
  "question": "What is our PTO policy for new employees?",
  "expected_answer": "New employees receive 15 days of PTO in their first year, accruing at 1.25 days per month.",
  "expected_found_in_handbook": true,
  "must_mention": ["15 days", "first year", "accrual"],
  "category": "happy_path"
}

That expected_answer is not the only acceptable response. It is a reference point. Your evaluation checks whether the actual response captures the same facts, not whether it matches character for character.

This is the critical shift from traditional testing. You are defining what good looks like, not what the exact string should be.

Why it is your most valuable asset

Prompts are easy to rewrite. Model versions change. Retrieval strategies get swapped out. But your golden dataset captures something harder to replace: your organization's definition of what "correct" means for your specific use case.

When a new engineer joins and asks "what should this bot say about our expense policy?", the golden dataset has the answer. When you swap from one model to another, the golden dataset tells you if the new model is better or worse. When someone changes the system prompt, the golden dataset catches the regressions.

Prompts are instructions. The golden dataset is the specification.

This post is for subscribers only

Subscribe to continue reading

What is a golden dataset?

Why it is your most valuable asset

More like this

Harness Engineering: The Outer System That Makes Agents Reliable

Write Skills Like Workstations, Not Prompts

Ship Prompts Like Software: Regression Testing for LLMs

Four Ways to Grade an LLM (Without Going Broke)

Build LLM Evals You Can Trust

Claude Code Tips From the Guy Who Built It