I Had a Prompt Bug I Couldn't See
I run a handful of bots on my NixOS homelab. They do heartbeat checks, send reminders, post updates to Discord. Each one has a system prompt baked into a .nix file. And recently, they started misbehaving.
Not in an obvious way. They weren't crashing. They were doing housekeeping. Instead of delivering a short user-facing update or a clean HEARTBEAT_OK, they'd launch into self-reflection. Reviewing their own memory files. Summarizing docs nobody asked about. Writing little internal monologues about what they could improve.
The prompts looked fine when I read them. But "looks fine when I read it" is the prompt engineering equivalent of "works on my machine."
The Trick: Make Your Agent Test Your Prompts Against Your Local Model
Here's what I actually did. I asked Claude Code โ my code agent โ to call Ollama directly using the CLI. Not to be the model under test. To be the test harness.
Think about that for a second. You've got a frontier model orchestrating test scenarios, calling a local model via ollama run, reading the output, judging whether it passes, and iterating on the prompt if it doesn't. It's a feedback loop where one AI is QA-ing another AI's behavior.
And it works stupidly well.
If the shape feels familiar, it should: it's the generator/discriminator loop from GANs, just with two language models instead of two neural nets. One produces, one judges, and the tension between them drives the prompt toward quality. I wrote about why that pattern is everywhere in AI engineering โ this is one of the cleanest instances of it.
What the Setup Looks Like
๐ Takes about 15 minutes to get your first round of results.
You need three things:
- A code agent (Claude Code, Cursor, anything that can run shell commands)
- Ollama running locally with the model you actually deploy against
- The prompts you want to test (in my case, embedded in Nix config files)
The agent doesn't need any special tools or MCP servers. It just needs ollama run in the terminal. That's it.
How I Structured the Test Scenarios
I defined three scenarios for each bot's heartbeat prompt:
Scenario A: Nothing happening. Empty HEARTBEAT.md, no pending reminders. The correct response is HEARTBEAT_OK and nothing else. This catches bots that fill silence with busywork.
Scenario B: A reminder is due. HEARTBEAT.md has a reminder that's past its trigger time. The correct response is delivering that reminder โ short, actionable, addressed to the user. No HEARTBEAT_OK appended after. This is an either/or situation.
Scenario C: A reminder exists but isn't due yet. There's a reminder set for next week. The bot should say HEARTBEAT_OK and not mention the future reminder at all. This catches bots that are eager to tell you about things before they matter.
Scenario D: The temptation scenario. Memory files exist. Documentation could be reviewed. There's "work" the bot could do. But nobody asked for it. The correct response is still HEARTBEAT_OK. This catches the housekeeping problem I was seeing โ bots that invent tasks to look productive.
What the Agent Actually Runs
The Claude Code session looks roughly like this. I tell the agent what I want to test, and it constructs the Ollama calls:
# The agent builds and runs something like:
ollama run qwen3-coder:30b "You are Erdos, a math-loving bot...
HEARTBEAT.md contents: (empty)
Current reminders: none
Respond to this heartbeat check."
Then it reads the output and judges it against my pass/fail criteria:
- Banned patterns: "I don't have the capability", self-reflection, housekeeping talk, "as mentioned previously"
- Required patterns:
HEARTBEAT_OKor a short actionable update - Length check: Under 7 lines for the no-news case
If the model fails a scenario, the agent suggests a prompt tweak, I approve it, and we re-run. Multiple rounds until every scenario passes.
What Actually Broke (And I Never Would Have Caught By Reading)
Here's what the testing found across three bots. Every single one of these bugs was invisible to me when I read the prompts.
- HEARTBEAT_OK + content: Bots were delivering a reminder AND appending
HEARTBEAT_OKat the end. It should be either/or โ you're either reporting something or you're saying "nothing to report." Not both. - Premature reminders: A reminder set for next Tuesday would show up in today's heartbeat. The bots were flagging future items that weren't due yet, because the prompt didn't explicitly say "only report items whose due time has passed."
- Raw markdown in Telegram: One bot was outputting code blocks, fake tool calls, and raw file contents that would render as garbage in Telegram. The prompt needed to explicitly ban those output formats.
- Emoji pollution: Bots peppered their responses with emojis despite my instructions being plain text. Had to add an explicit "no emojis" rule.
Four distinct failure modes. Every one discoverable only by running the actual model and looking at the actual output. This is why "the prompt looks right to me" is not a testing strategy.
Why This Is Better Than Reading Your Own Prompts
Look, I've been writing prompts for years. I still can't reliably predict how a model will interpret instructions just by reading them. Nobody can. The gap between "what I meant" and "what the model does" is the entire problem space of prompt engineering.
Testing against the actual model closes that gap. And having your code agent do the testing means you get:
- Structured scenarios โ the agent thinks about edge cases you'd forget
- Consistent judgment โ same pass/fail criteria every time
- Rapid iteration โ tweak the prompt, re-run, check results, all in one session
- A record of what was tested โ the session transcript is your test log
You're essentially getting a QA engineer for your prompts, for free, using tools you already have.
The Meta Move: Build a Reusable Test Script
After the manual rounds worked, I had the agent build the whole thing into a script. In my case it's a Nix derivation called bot-test-prompts that:
- Pulls each bot's system prompt from the NixOS config
- Runs three categories of tests: heartbeat behavior (4 scenarios), style enforcement (brevity, directness), and identity checks (does the bot know its own name?)
- Checks for banned patterns, response length, and correct content
- Reports pass/fail per bot per scenario, exits 1 on any failure
Now before I deploy any prompt change, I run bot-test and get a go/no-go. It's like having unit tests for your prompts. Which, when you think about it, is exactly what it is.
# Test all bots
$ bot-test
# Test just one bot
$ bot-test erdos
# Test one category across all bots
$ bot-test --category heartbeat
# Use a different model
$ bot-test --model qwen3-coder:14b
The test categories cover the real failure modes I found:
- Heartbeat tests: Empty state, due reminder, future reminder, temptation โ 4 scenarios that cover the either/or logic and premature reporting
- Style tests: No emoji, no code blocks, no raw markdown, short responses โ catches the Telegram formatting garbage
- Identity tests: Bot knows its name, responds in character โ catches personality drift
Why I Use the Local Model Instead of the Same Agent
You might ask: why not just have Claude test the prompts against itself? Two reasons.
First, test what you deploy. My bots run on qwen3-coder:30b through Ollama. That's the model that needs to behave correctly. Testing against a different model tells me nothing about whether my actual deployment will work. Same reason you don't run your test suite against a mock database and call it integration testing.
Second, separation of concerns. The agent doing the judging should be different from the model being judged. If you ask a model to evaluate its own output, you get a weird recursive dynamic where it's reluctant to criticize itself. Having Claude judge Qwen's output (or vice versa) gives you honest assessment.
Bonus: The Nix Closed Loop
Here's the part that made me unreasonably happy. My bot prompts live in .nix files. The test script is a Nix derivation. The bots themselves are NixOS services. Which means the entire loop is closed inside my Nix config:
- Claude Code reads the bot prompt from
erdos.nix - It calls Ollama to test that prompt against the deployed model
- It finds a failure, so it edits the prompt in the same .nix file
- It re-runs the test โ still through Ollama, still from the Nix config
- When everything passes, I
nixos-rebuild switchand the fixed prompt is live
There's no separate prompt file to sync. No deploy step that might use a stale version. No "wait, which copy of the prompt is the real one?" The Nix file is the source of truth for the prompt, the test, and the deployment. Edit, test, deploy โ same file, same system, same rebuild command.
If you're already running NixOS, this is the kind of thing that makes the declarative config religion feel justified. Your coding agent can see the prompt, test the prompt, fix the prompt, and the deployment is just a rebuild away. The feedback loop has zero gaps in it.
No Nix? That's fine โ the core technique works anywhere you can run ollama from a terminal. But if you do have Nix, the closed loop is chef's kiss.
You Can Do This Right Now
โก Seriously. Today. Here's the minimum viable version:
- Open your code agent (Claude Code, Cursor, whatever)
- Tell it: "I want to test this prompt against my local Ollama model. Here are three scenarios it should handle correctly. Call
ollama run [model]with each scenario and tell me if the output matches what I expect." - Iterate until every scenario passes
That's it. No framework. No test library. No MCP server. Just one AI calling another through the terminal and reporting back.
If your prompts are important enough to deploy, they're important enough to test. And now you have a QA team that costs zero dollars and fits in a single terminal session.
Publish over perfect. Test over hope. And let your agents talk to each other โ they have things to figure out.
Header image by Imagine Buddy on Unsplash
Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.