Ship Prompts Like Software: Regression Testing for LLMs

Because "it seemed fine when I tested it" is not a deployment strategy. Part 4 of 4: Evaluation-Driven Development for LLM Systems

This is Part 4 of a four part series on Evaluation-Driven Development for LLM systems. We have covered why traditional testing failshow to build golden datasets, and automated evaluation techniques. This post builds on all three. If you haven't read the earlier posts, start there.


You rewrote your system prompt. The answers look crisper, more concise, more professional. You tried a handful of questions and everything checked out. You deployed on Friday afternoon.

By Monday, the support tickets start rolling in. New hires can't get onboarding answers from the internal Q&A bot. The prompt change that improved benefits questions quietly destroyed onboarding guidance. Nobody noticed because nobody tested onboarding after making a change to benefits.

This is a regression. In traditional software, a regression is when something that used to work breaks after a change. The same thing happens with LLM systems, but it is much harder to detect. There is no compiler error. No stack trace. Just a subtle shift in quality that slips past manual spot checks.

Regression testing for LLMs means running your full evaluation suite every time you make a change. Not just testing the thing you changed. Testing everything.

The naive approach vs. the systematic approach

Most teams iterate on prompts like this:

1. Change the prompt
2. Try a few questions manually
3. Think "yeah, that looks better"
4. Deploy

This is how regressions happen. You are sampling a tiny slice of your input space and assuming it represents the whole picture. It does not.

The systematic approach looks different:

1. Run your full evaluation suite against the current prompt
2. Record scores by category
3. Modify the prompt
4. Run the same evaluation suite against the new prompt
5. Compare scores side by side, category by category
6. Only deploy if the new prompt is better overall

In pseudocode:

# Run golden set against prompt v1
v1_scores = {}
for case in golden_set:
    response = run_assistant(case.question, prompt=PROMPT_V1)
    score = evaluate(response, case.expected_answer)
    v1_scores[case.id] = {category: case.category, score: score}

# Run golden set against prompt v2
v2_scores = {}
for case in golden_set:
    response = run_assistant(case.question, prompt=PROMPT_V2)
    score = evaluate(response, case.expected_answer)
    v2_scores[case.id] = {category: case.category, score: score}

# Compare by category
for category in categories:
    v1_avg = average(v1_scores where category matches)
    v2_avg = average(v2_scores where category matches)
    print(category, v1_avg, v2_avg, difference)

The key detail: you compare by category, not just overall. Overall averages hide regressions.

A regression in action

Say you maintain an internal knowledge base Q&A bot. It answers employee questions about benefits, onboarding, IT policies, and time off. You rewrite the system prompt to be more concise, cutting it from 500 tokens down to 200. The answers feel tighter. You test a few benefits questions and they look great.

Here is what the full eval suite reveals:

Category        | Prompt v1  | Prompt v2  | Change
----------------|------------|------------|-------
Benefits        |    90%     |    95%     |  +5%
Onboarding      |    85%     |    70%     | -15%
IT Policies     |    88%     |    87%     |  -1%
Time Off        |    82%     |    83%     |  +1%
----------------|------------|------------|-------
Overall         |    86%     |    84%     |  -2%

The overall average only dipped 2 points. Easy to dismiss. But onboarding dropped 15 points. That is a serious regression, and it is completely invisible if you only look at the aggregate number.

What happened? The original prompt had specific instructions about walking new employees through multi-step onboarding processes. The concise rewrite stripped those instructions out. Benefits answers improved because the shorter prompt gave the model more room to work with. Onboarding answers collapsed because the model lost critical context.

Without categorized evaluation, you would have shipped this. The handful of benefits questions you tested manually would have looked like an improvement.

The four comparison outcomes

When you compare two prompt versions across your full eval suite, you will see one of four patterns.

New prompt is better everywhere. This is rare. When it happens, ship it with confidence. You found a genuine improvement.

New prompt is worse everywhere. Easy decision. Revert and try a different approach.

Better on some categories, worse on others. This is the most common outcome, and the hardest to navigate. You need to make a judgment call. Which categories matter most? In the Q&A bot example, a 5% gain on benefits is probably not worth a 15% loss on onboarding. But if the worse category is an edge case that affects very few users, the tradeoff might be acceptable. This is where knowing your users matters more than knowing your metrics.

Results are roughly the same. The change did not make a meaningful difference. In this case, the simpler prompt wins. Less complexity means fewer things to maintain, fewer things to break, and easier debugging when something does go wrong.

What to track over time

Each evaluation run produces data. Do not throw it away after the comparison. Track it over time.

Category scores, not just overall averages. An overall score of 85% could mean uniformly solid performance, or it could mean 95% on easy questions and 50% on hard ones. Break it down. Track each category independently. This is where you spot trends before they become problems.

Score trends across versions. Are things getting better over time? Flat? Slowly degrading? A downward trend means your changes are introducing more regressions than improvements. That is a signal to slow down and investigate rather than keep iterating.

Per-case history. Some test cases fail over and over, regardless of prompt changes. These persistent weak spots usually indicate a fundamental limitation of your approach (missing context in the knowledge base, ambiguous questions that need clarification) rather than something you can fix with prompt tweaking.

Failure patterns. Are failures scattered randomly or clustered in one category? Clustered failures point to a specific, fixable problem. Random failures suggest something more systemic, like insufficient context retrieval or a model that struggles with your domain.

Three maturity levels for regression testing

You do not need a fully automated pipeline on day one. Start where you are and grow into more automation as your system matures.

Level 1: Manual but consistent

Run evals before every change. Keep a log of results.

# Before changing the prompt
run_eval_suite > results/baseline_2026_02_20.json

# After changing the prompt
run_eval_suite > results/new_prompt_2026_02_20.json

# Compare
compare_results results/baseline_2026_02_20.json results/new_prompt_2026_02_20.json

This works for solo developers and small teams. The discipline matters more than the tooling. If you run evals consistently before every change and log the results, you will catch most regressions. The failure mode here is forgetting to run the eval, or skipping it because you are "just making a small change."

Level 2: CI/CD integration

Run evals automatically on every pull request that touches prompts or application code.

# In your CI pipeline config
on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/**'

steps:
  - run_eval_suite
  - post_results_to_pr_comment
  - fail_if_regression_detected

Every PR that modifies a prompt triggers a full evaluation. Results appear as a comment in the PR. Reviewers see exactly how the change affects quality across every category. No one has to remember to run evals because the system does it for them.

This is where most teams should aim to be. It catches regressions before they reach production and makes evaluation results part of the code review conversation.

Level 3: Continuous monitoring

Run evals on a schedule against your production system. This catches a class of problems that pre-deployment testing cannot: changes from the model provider.

Model providers update their models without warning. A prompt that worked last month might behave differently today because the underlying model shifted. Scheduled evals (daily, weekly) against production detect this drift before users report it.

This level also catches issues from changes to your knowledge base, external API behavior, or any other dependency that your pre-deployment tests might not cover.

A note on A/B testing in production

Regression testing happens before deployment. A/B testing happens after. The concept is familiar from web development: serve two versions to different users and compare outcomes.

For LLM systems, A/B testing is harder than it sounds. Sample sizes per variant tend to be small. Outcomes are subjective. User satisfaction depends on context that is difficult to capture in a metric. Practical A/B testing for LLMs usually combines automated evaluation on logged responses with explicit user feedback (thumbs up, thumbs down, or similar signals).

This is an advanced topic. Get pre-deployment regression testing right first. It catches the majority of problems and does not require production traffic splitting infrastructure.

Key Takeaways

  • A regression in an LLM system is a silent quality drop. There is no stack trace, no failing build. The only way to detect it is to test everything after every change.
  • Compare prompt versions with data, not gut feeling. Run the full eval suite for both versions and compare scores by category.
  • Overall averages hide regressions. A 2% overall dip can mask a 15% collapse in a single category. Always break scores down by category.
  • Four outcomes when comparing prompts: better everywhere (ship), worse everywhere (revert), mixed results (weigh tradeoffs by category importance), or roughly the same (simpler prompt wins).
  • Track category scores, score trends, per-case history, and failure patterns over time. This historical data is what turns evaluation from a one-time check into a continuous quality signal.
  • Start with manual but consistent evaluation. Automate into CI/CD when you are ready. Add production monitoring when your system is mature enough to warrant it.

Series Recap

This four part series has covered a complete lifecycle for building LLM systems you can trust.

In Part 1, we established why traditional software testing breaks down for LLMs. The outputs are nondeterministic, correctness is subjective, and "it worked when I tried it" tells you almost nothing about real-world performance. Evaluation-driven development is the alternative: measure first, then change, then measure again.

In Part 2, we built the foundation. Golden datasets give you a stable, categorized set of test cases to measure against. Starting with 10 to 15 cases across multiple categories (happy path, edge cases, adversarial inputs, questions outside scope) gives you enough signal to make informed decisions.

In Part 3, we filled the evaluation toolbox. Deterministic checks for structural properties. Semantic similarity for meaning comparison. LLM as judge for nuanced quality assessment. Each technique has a cost, a speed, and a sweet spot. Use the cheapest one that answers your question.

In Part 4 (this post), we closed the loop with regression testing. Run your full suite before and after every change. Compare by category, not just overall. Track results over time. Automate when you can.

The full cycle looks like this:

Build golden set
    -> Measure baseline
    -> Make a change
    -> Run full eval suite
    -> Compare results by category
    -> Deploy (if better) or revert (if worse)
    -> Add new failures to the golden set
    -> Repeat

Every iteration through this loop makes your system more robust. The golden set grows. The eval suite catches more edge cases. The historical data gives you a clearer picture of what is working and what is not.

The teams that ship reliable LLM systems are not the ones with the cleverest prompts. They are the ones that measure, compare, and iterate with discipline. Evaluation-driven development is that discipline.

Join engineers getting weekly insights on agents, RAG & production LLM systems

No spam, no sharing to third party. Only you and me.

Member discussion