Your Golden Dataset Is Worth More Than Your Prompts
In Part 1, we talked about why traditional testing breaks down for LLM systems. The outputs are non-deterministic, correctness is subjective, and assertEqual can't tell you whether a response is actually good.
So if exact string matching won't work, what do you measure against?
You measure against a golden dataset.
What is a golden dataset?
A golden dataset is a collection of input/output pairs where you have defined what a correct response looks like. Each entry contains a question your system should handle, along with criteria that describe what a good answer includes (and what it should not include).
"Golden" means these are your reference answers. The standard you measure everything against.
Here is what a golden test case looks like for an internal knowledge base Q&A bot:
{
"id": "hp_001",
"question": "What is our PTO policy for new employees?",
"expected_answer": "New employees receive 15 days of PTO in their first year, accruing at 1.25 days per month.",
"expected_found_in_handbook": true,
"must_mention": ["15 days", "first year", "accrual"],
"category": "happy_path"
}
That expected_answer is not the only acceptable response. It is a reference point. Your evaluation checks whether the actual response captures the same facts, not whether it matches character for character.
This is the critical shift from traditional testing. You are defining what good looks like, not what the exact string should be.
Why it is your most valuable asset
Prompts are easy to rewrite. Model versions change. Retrieval strategies get swapped out. But your golden dataset captures something harder to replace: your organization's definition of what "correct" means for your specific use case.
When a new engineer joins and asks "what should this bot say about our expense policy?", the golden dataset has the answer. When you swap from one model to another, the golden dataset tells you if the new model is better or worse. When someone changes the system prompt, the golden dataset catches the regressions.
Prompts are instructions. The golden dataset is the specification.
Subscribe to continue reading