Melt Your GPU on Evals, Not Vibes

A grid of green, yellow, and red dots forming a microarray pattern

The Bug That Convinced Me My Tests Were Lying

If you ship a chatbot — internal, customer-facing, agentic, RAG-flavored, whatever — you probably have something you call an "eval." If I had to bet, it looks like a spreadsheet of canned prompts, a single run per prompt, a human eyeballing the output, and a vibe-driven "yeah that looks fine, ship it." That was my setup too. I called it a regression suite. It was a vibe check in a trench coat.

Here's the story of the bug that forced me to stop pretending, and the eval scaffolding that came out of it. The specifics are about my local LLM fleet on a melting RTX 3090, but the lessons translate one-for-one to any chatbot running against any model, hosted or local.

I've been running a fleet of local LLM agents for a while — five bots on a single RTX 3090, talking to customers and helping me run my business. They live in a Nix flake. They're reproducible. They have a regression suite. I felt fancy.

Then last week, one of the bots ("basecamp") confidently told me it couldn't read meeting transcripts. Which it definitely can. Which it had been doing fine for weeks.

My first instinct was: the prompt got too long, it's being truncated somewhere. The rendered AGENT_INSTRUCTIONS.md was 21,257 bytes. I had a vague memory of a 12K limit in OpenClaw. Hypothesis formed, gut confident, time to compress.

I was wrong about basically all of it. The investigation that followed turned into the eval scaffolding I should have had from day one.

The Wrong Diagnosis That Led to the Right Fix

I asked Claude to compress the prompt. It did one better — it actually went and read the OpenClaw source from the Nix store first to verify my hypothesis. The "12K limit" I remembered turned out to be gateway.webchat.chatHistoryMaxChars, which is for chat history API responses, not the system prompt. The actual context file limit was 2MB. I was off by three orders of magnitude.

The compression still happened, because the prompt was bloated. We folded redundant warnings, dropped a code block with wrong syntax, and compacted the shared helper modules (memory vault, lead tracking, heartbeat, cron) that four of the bots inherited. Basecamp went from 21,257 to 14,074 bytes — a 34% drop. The other bots picked up 3–10% savings as a side effect because they shared the helpers.

But then I had a new problem: how do I know I didn't break anything subtle?

I already had bot-test — a regression suite that runs canned conversations and checks for expected substrings. The trouble was that at temperature 0.3 with a 27B Q4 model, the tests flaked unpredictably. A test would pass three times and fail the fourth. I'd been ignoring that flakiness for months and treating any green run as "good enough."

That's the moment it clicked: a single-run test against a stochastic model isn't a test, it's a vibe check.

A Prompt Is Not a String

Here's the reframe that changed how I think about agent quality:

Your prompt isn't a string. It's a function. The inputs are at least:

The prompt itself
The context length (how much of the window is already full when the prompt is read)
The temperature
The top_k
The top_p
The model weights

A single-run test pins all six of those to one specific point in the configuration space and reports "pass" or "fail." That tells you almost nothing about how the prompt actually behaves across the operational envelope your bot will see in production.

The fix isn't "run more tests." The fix is to sweep the grid — run the same test across the Cartesian product of those dimensions and look at the pass rate as a surface, not a point.

What I Built: bot-test-sweep

📝 Took about a day of iterative work with Claude Code. Roughly 470 lines of Python wrapping the existing test harness.

The thing I wanted to type was this:

bot-test-sweep --temperatures 0.3,0.7,1.0 --top-ks 20,40 \
               --models-similar basecamp --category meetings

And then have it grid-sweep across every combination, store results durably, and tell me which prompt sections were robust and which silently degraded. Here's what shook out of that.

Design Decision 1: Production Parity By Default, Not By Configuration

This is the bug that scared me most.

bot-test was hardcoded to num_ctx=49152 and temperature=0.3. The production bots ran at num_ctx=49152 (correct, per-model Nix config) but temperature=0.7 (the Modelfile default). Tests said "correct." Production behaved differently. The test was passing on configurations the bot would never actually see.

If you have a test suite for your agents right now, stop reading and go check what sampling parameters it uses. I'll wait.

The fix was to delete every hardcoded value and replace it with a query:

num_ctx comes from nix eval on the rendered bot JSON config.
temperature, top_k, top_p come from Ollama's /api/show endpoint, which returns the Modelfile defaults.
User overrides happen via CLI flags only. One source of truth per value.

Quick aside — this is the part where I want to give Nix some credit. Because my whole stack is declared in a flake, nix eval against the rendered config gives me exactly the settings that would be (or already are) deployed. Not a snapshot. Not a copy in a different file. The actual value, read from the actual source of truth, computed the same way the running system computes it. Thanks Nix. (And related: AI Coding Tools Are Better at Nix Than Me — the reason I can stomach maintaining all this declarative config in the first place.)

If you're not on Nix, the equivalent is "your tests should read your real config file, not a fixture that drifts." Same principle, fewer angle brackets.

The banner now prints, every single run:

Context window: 49152 tokens (from nix — matches live)
Sampling: temperature=0.7 top_k=20 top_p=0.95 (from Modelfile — matches live)

If parity ever breaks — say someone changes the Nix config but not the Modelfile — the banner changes to something like (from Ollama context_length — architectural max, no nix match) and you know immediately that you're testing fiction.

Design Decision 2: Lost-in-the-Middle Padding

The most interesting axis turned out to be context fill percentage.

The --levels flag takes a list of percentages (default 0,25,50,75,90). At each level, the sweep synthesizes that fraction of the context window as plausible prior chat — mostly heartbeat exchanges, because that's what a long-running OpenClaw session actually looks like — and prepends it before the test message.

At 0% padding, the system prompt sits right next to the user message. At 90%, the system prompt is buried deep and the model has to attend to it through a lot of intervening text.

Tests that hold their pass rate across the sweep are robust. Tests that degrade are sensitive to context pressure, and they tell you exactly which sections of your prompt the model stops attending to under load. That's the kind of finding that single-point tests literally cannot surface.

Design Decision 3: Model Discovery, Not Model Configuration

I didn't want to maintain a list of "models similar to my production model" by hand. Models churn. Quants change. New variants ship every week.

So --models-similar basecamp queries Ollama's /api/tags, parses the details.parameter_size strings ("26.9B" → 26.9 billion), and filters within ±30% of the baseline. Pointed at my 27B production model, it auto-selects roughly 15 candidates in the 19B–35B range. No manual catalog. --list-models dumps the catalog if I want to pick by hand.

This matters because the cheapest model-portability test is "does my prompt still pass on the next quantization of this model" — and that test is annoying enough to set up by hand that most people skip it.

Design Decision 4: JSONL Crash Recovery

A 30-cell × 5-run sweep takes one to two hours on this hardware. Losing that to a kill -9 halfway through is unacceptable.

The fix is twenty lines of Python:

def append_result(path, cell, p, f, w, s, extra=None):
    rec = {"cell": {...}, "result": {"pass": p, ...}, "timestamp": ...}
    with open(path, "a") as fh:
        fh.write(json.dumps(rec) + "\n")
        fh.flush()
        os.fsync(fh.fileno())

One JSONL file per sweep configuration. Append-only. Flush and fsync after every cell. The cache key is derived from (model, bot_test args, runs), so re-running the same config resumes from the same file but a different config creates a different file. If the script dies mid-cell, the only data lost is a partial last line, which the reader silently skips.

Crash recovery isn't a nice-to-have for long sweeps. It's table stakes. The cost is trivial. The cost of not having it is "I lost the last 90 minutes, I'm going to bed."

Design Decision 5: Read-Only Dry-Run

--dry-run resolves the entire config (proves the Nix lookup works, proves the Modelfile parse works, proves the grid is what you expect) and exits without making a single Ollama call or touching the cache file.

I made the safety boundary explicit in the code:

# CACHE SAFETY: this branch MUST NOT call append_result
if args.dry_run:
    print_resolved_config()
    print_grid()
    sys.exit(0)

And there's a unit-style assertion in the parser that fails the build if anyone adds a write to the dry-run block. The whole thing is maybe ten lines. It prevents the catastrophic mistake of a "dry-run" silently overwriting two hours of real results.

Before you kick off a 30-cell × 5-run × 2-hour sweep, you want to know the grid is what you intended and the script can resolve live config. A dry-run that exits in one second is the difference between "I confirmed parity" and "I crossed my fingers."

Design Decision 6: Progress Output That Isn't Theater

Per-cell timing with a rolling-average ETA. No spinners. No carriage returns. Just flush=True on each print so you can tail the output:

[ 12/30] model=qwen3.5:27b pct=25 T=0.7 k=20: running … — ETA 31m04s
[ 12/30] model=qwen3.5:27b pct=25 T=0.7 k=20: done in 1m47s — PASS=5 FAIL=0 (100%)

The ETA is computed only from cells that actually ran, not from cache hits. So when you resume a sweep that already has 15 cells cached, you don't see a misleadingly tiny ETA after the first 15 instant skips.

How AI Made This Possible (And the Pattern Worth Stealing)

I want to be honest: I would not have built this scaffolding in a day on my own. Maybe in a week. Probably I would have given up after the third "this is overengineered for a side project" thought.

What actually happened is that I reported a symptom and Claude and I followed the seams. Each step revealed the next gap:

I report "basecamp can't read transcripts."
Claude reads the OpenClaw source from the Nix store, finds the actual relevant constants, and tells me my 12K hypothesis is wrong with citations.
Claude compresses the prompt. We ship it.
I say "make sure we didn't lose quality." Claude proposes statistical multi-run testing.
I say "what about long context?" Claude proposes padding stress tests.
I say "what's the actual production config?" Claude traces the Nix eval paths and queries /api/show, finds that test temperature 0.3 ≠ production 0.7.
I say "can we sweep models too?" Claude adds /api/tags discovery with similarity filtering.
I say "what if it crashes mid-sweep?" Claude adds the JSONL cache.
I say "give me progress." Claude adds per-cell timing and ETA.

The eval scaffolding emerged by following the seams, not by pre-designing it. Every step was a five-to-fifteen-minute increment. None of them felt like a big bet.

The patterns that actually worked, in case you want to steal them:

AI verifies, doesn't trust. When I claimed the 12K limit, Claude didn't just compress to make me happy. It grepped the source and corrected me. The compression still happened, but for the right reason. If your agent never pushes back on your hypotheses, you've trained it to be agreeable, not useful.
AI flags divergences from production loudly. The "test temperature 0.3 vs prod 0.7" discovery would have invalidated months of green test runs silently. Claude found it because it traced the actual call path. I'd never have looked.
AI audits its own safety boundaries. When I said "make sure dry-run doesn't overwrite the cache," Claude ran an actual self-test (assert "append_result" not in dry_block) on the script before declaring done. That's the kind of paranoid check you want around irreversible operations.
AI keeps a parallel investigation channel open. While I ran tests, Claude was simultaneously analyzing prompt sections by byte count, finding that the GSuite section was 8.6K of bloat, and comparing helper module usage across bots. I didn't ask. It just kept the second thread alive.

The Adoptable Patterns (Even If Your Chatbot Runs on a Hosted API)

None of this is local-LLM-specific. If you're shipping a customer support bot on GPT-4o-mini, an internal copilot on Claude, a RAG agent on Gemini — every one of these patterns applies. The hosted-model version of "melt your GPU on evals" is "burn through your API budget on evals" and the math is the same: it's cheaper than the regression you'd otherwise ship.

1. Hardcoded values in tests are technical debt waiting to drift. Whatever your prod config lives in — a YAML file, Vercel env vars, a LangSmith deployment, a Modelfile — the test harness should read it, not duplicate it. The first time prod temperature changes from 0.7 to 1.0 and tests keep happily passing at 0.3, you'll wish you'd done this.

2. A regression test that doesn't match production sampling is worse than useless. It gives you false confidence, which is more dangerous than no test at all. If your eval runs at temperature 0 for determinism but users hit temperature 0.8, you are not testing your chatbot. You are testing a different chatbot that happens to share a prompt.

3. Grid sweeps catch what single-point tests miss. The single most important axis for chatbot evals is conversation length. Real users have 30-message conversations. Your eval probably runs one-shot prompts. Inject realistic prior chat before the test message and rerun — that's where "lost in the middle" failures live, and they are the ones that pissed-off customers will write tickets about.

4. Crash recovery for long sweeps is table stakes. JSONL plus flush plus fsync per cell is twenty lines of Python and saves you hours of recompute or hundreds of dollars of API calls. Whichever currency you're paying in.

5. Read-only dry-runs prevent expensive mistakes. Before kicking off a multi-hour or multi-thousand-dollar sweep, verify the grid is what you intended. A dry-run that exits in one second pays for itself the first time it catches a typo'd model name that would have burned a weekend on the wrong target.

6. Multi-model sweeps tell you what's portable. Test your prompt against the next size up, the next size down, and the next quantization or version of your current model. If it only passes on one specific model, you've coupled your product to a SKU. Every "we have to upgrade to GPT-5" or "Anthropic deprecated the model we're on" panic is this problem.

7. Progress output should be informative, not theatrical. flush=True on each print is enough. No spinners. ETA computed from cells that actually ran, not cached hits, so a resumed sweep doesn't lie to you.

What the Sweep Will Actually Catch (Results Pending)

I'm writing this with the sweep still cooking. The full grid — every model in the similarity band, every padding level, every sampling combination, five runs per cell — is going to take roughly 48 hours of continuous GPU time on the 3090. Which is genuinely the point: the reason most teams don't do this is that nobody wants to melt their GPU for two days to learn that their prompt is fine. But the reason you should is that two days of compute is cheaper than one customer-facing regression that nobody caught.

I'll update this post with the heatmap once the sweep finishes. My priors going in:

The compressed prompt will pass at 0% padding (it has to, that's already verified).
Something will degrade above 75% padding. That something will be a section I thought was load-bearing but actually wasn't being attended to under context pressure.
At least one of the ±30% similarity models will fail a test the production model passes. That tells me which prompt behaviors are model-portable and which are accidentally Qwen-shaped.

The interesting answer isn't "did it pass." The interesting answer is the shape of where it fails. That's what a grid gives you that a vibe check can't.

[Results section to be added once the sweep completes.]

Try This Before Your Next Prompt Change ⚡

Three things. Takes 20 minutes if you have a test suite already.

Audit one hardcoded value in your test suite right now. Find one sampling parameter, one context size, one model version that's pinned in tests but configurable in production. Make tests read it from the same place production does. Confirm they still pass.
Run your existing tests three times instead of once. If you're on a stochastic model, you'll find at least one test that flakes. That test is lying to you. Either fix it or downgrade its signal in your head.
Inject 10K tokens of synthetic prior chat before your test message and run it again. If anything changes, you have a context-pressure problem you didn't know about. Now you have something to fix.

You don't need a 470-line sweep harness to start. You need to stop pretending your prompt is a string with one behavior. It's a function. Sample the surface.

Single-point tests give single-point confidence. Grids give you a map. Same effort over time, very different outcomes.

Header image by National Cancer Institute on Unsplash

Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.