Make Your AI Quiz You. That's the Refactor for Cognitive Debt.

A student sitting on chairs in front of a chalkboard in an empty classroom

I Just Shipped 400 Lines I Don't Really Understand

Look, I've been pairing with agents long enough that I have a routine. Open a session. Describe the problem. Let the model spin up a plan. Push back on the parts that smell wrong. Approve the rest. Watch it write code. Run the tests. Ship.

The feature works. The tests pass. The PR gets merged.

And then, three weeks later, somebody asks me why the retry logic does the weird thing with exponential backoff capped at the half-hour mark, and I open my own PR and stare at it like a tourist reading a plaque in a language I almost speak.

I shipped this. I approved every line. I cannot, in this moment, defend it.

Cognitive Debt Is the Same Shape as Tech Debt

You know what tech debt is. Code that runs but is fragile. It works today; it will hurt you later. You took a shortcut and the bill comes due eventually.

Cognitive debt is the same shape, but the fragile thing is you. You shipped code without absorbing why it works. The understanding never landed. The model held it for the length of the session and then the context window closed and the understanding evaporated with it. The code is in main. The understanding is not in your head.

You get away with this for a while. Tests pass. Reviews go smoothly. Then the production incident hits at 2am and you realize you're debugging code you wrote like it's somebody else's. Because it kind of is.

MIT's Kosmyna and colleagues ran a study in 2025 called Your Brain on ChatGPT where they hooked people up to EEGs while they wrote essays with and without an LLM. The LLM group showed measurably less neural engagement, and when asked to recall what they'd just "written," they couldn't quote their own work. Microsoft and Carnegie Mellon's Lee et al. (The Impact of Generative AI on Critical Thinking, 2025) found that knowledge workers who trust AI output more do less critical evaluation of it. None of this is shocking. If you outsource the thinking, the thinking does not happen in your head.

That's the debt. It accumulates quietly and you don't notice until the bill arrives.

Refactoring Pays Down Tech Debt. Questions Pay Down Cognitive Debt.

Here's the parallel I want to lean on, because it's not really a metaphor, it's the same mechanism in two different places.

Tech debt is what you have when the code works but isn't legible. You pay it down by refactoring. Refactoring doesn't add features. The behavior of the code doesn't change. What changes is that the next person who reads it can actually understand it. Same output, less hidden cost.

Cognitive debt is what you have when the code works but you aren't legible to yourself. You pay it down by asking and answering questions about what just got shipped. The questions don't change the code. The behavior of the code doesn't change. What changes is that the next time you open the file, you actually understand it. Same output, less hidden cost.

One process for the codebase, one process for the person. Refactoring is to the repo what a quiz is to your head. Skip either one and the debt keeps compounding.

So the move I make at the end of an agent session is simple: before I merge anything I spent more than thirty minutes on with an agent, I ask the agent to quiz me on it. Not "explain it to me." Quiz me. Make me retrieve the answer from my own head. If I can't, I read the code, figure out why, and try again until I can. That's the refactor — for the person, not the code.

Four Flavors of Question (and One Flavor to Avoid)

Most quizzes the AI generates by default are garbage, because the AI's default move is "what did I just tell you?" That's a memory test, not an understanding test. You can pass it with thirty-second recall and still have no clue how to use the concept tomorrow.

The questions that actually pay down the debt fall into four flavors. I lifted this taxonomy from a working note in my own NixOS config (justin-nix) where I'd noticed the same pattern landing well in a totally different context — Claude quizzing me on the ARM target triple for a Kobo cross-compile.

Decoding — given an artifact (a config snippet, a function signature, a flag), what does each piece tell you? Tests vocabulary.
Counterfactual — if the input or state changed in some specific way, would the code still work? Why? Tests constraints.
Mechanism — describe the physical/causal process by which something happens, step by step. Tests whether you have a model or just labels.
Synthesis — given a novel requirement, what would you need to verify before acting? Tests application.

A bad question is one that asks you to repeat the explanation back. A good question makes you reach for the answer the code implies but doesn't spell out.

The Prompt I Actually Use

📝 Takes about ten minutes per session. Less than the time you'd spend on the same investigation three weeks from now when you've forgotten everything.

At the end of a session, before I close the tab, I paste something like this:

You just helped me ship [thing]. Don't summarize. Don't explain.

Quiz me with 4-6 questions about the code we just wrote. Cover
these flavors, not just recall:

  - decoding: what does this artifact / signature / flag mean
  - counterfactual: what would break if X changed
  - mechanism: walk me through how Y actually happens
  - synthesis: a teammate wants Z — where would you put it?

Skip questions where the answer is just "what I told you."

After each answer, confirm what I got right (and deepen it with
one extra detail). For what I missed, point to the load-bearing
thing I missed, don't just say "wrong." If I'm hand-waving, call
it.

Then I answer. Out loud, in the chat, however. The point is that the words come out of my head first, not the model's.

What a Good Quiz Looks Like

Here's the shape of a good one. Picture a session where you and the agent just built a rate-limited webhook fan-out. Each question is tagged with its flavor:

Q1 (decoding) — Look at the buffer declaration. What's the data
    structure, and what does each type parameter tell you about
    how we're using it?

Q2 (counterfactual) — We capped concurrency at 8. What happens
    at 9? What changes on the receiving side if we bump that to
    16?

Q3 (mechanism) — There's a try/except around the dispatch that
    catches one specific exception class and re-raises everything
    else. Walk me through what happens, end to end, when that
    specific class fires.

Q4 (decoding + mechanism) — If the receiver returns a 429 with a
    Retry-After header, what does our code do, and what does it
    deliberately NOT do?

Q5 (synthesis) — A teammate wants to add a "priority" lane that
    skips the queue. Where would you put it, and what would you
    have to change about the backpressure logic?

This is where it earns its keep. The decoding questions go down easy; then Q3 exposes that you know what the exception does but can't trace how; Q5 sends you back to re-read the dispatcher. By the end you understand code you supposedly already wrote. The code didn't change. You did.

How the Agent Should Grade You

Grading is where most AI quizzes go off the rails. The model wants to be encouraging. Encouragement is not what you need here.

The right behavior, from my working notes, is three moves:

Confirm what's right, then deepen. If you got the answer, the agent should add one piece of context you didn't mention. The answer was correct and here's the next layer.
Fill in what's missed, without condescension. No "great try!" theater. Just: here's the part you didn't get, here's why it matters.
Point to the load-bearing detail, not pass/fail. "You said the queue is FIFO; the load-bearing detail is that it's bounded. Unbounded changes the whole backpressure story." That's worth ten "incorrect"s.

If your agent is being too nice, tell it to stop. The model holds the code; it knows where you bluffed.

When Not to Quiz Yourself

This isn't free. It's a tax you levy on the end of a session, and there are sessions where it's the wrong call.

You're executing, not learning. Mid-incident, mid-deploy, on-call at 2am — ship the fix, quiz yourself later (or don't).
The change is trivial. If the session was "rename this variable" or "add a log line," a quiz is friction without payoff.
You already know the domain cold. Quizzing yourself on the part of the codebase you wrote three years ago is theater. Save it for the unfamiliar.
There's no follow-up. If this code will never be touched again, the debt doesn't really exist. Don't pay down debt on dead code.

The pattern fits best when the session touched a real concept you'll need again, the code has more than one load-bearing detail that could be missed, and you're going to have to live with the result.

Why This Actually Works

The learning-science people figured this out a long time before LLMs existed. Roediger and Karpicke's 2006 paper in Psychological Science, Test-Enhanced Learning, showed that being tested on material produces dramatically better long-term retention than re-reading the same material. Retrieval practice beats review. Pulling the answer out of your head is the part that builds the wiring; recognizing the answer on a page is not.

Most AI sessions are pure review. The model writes, you read, you nod, you ship. Your brain never has to retrieve anything. Quizzing flips that. Now your brain has to produce, and the gap between what you can produce and what's in the code is exactly the debt you accumulated during the session.

You can also frame this as a Feynman technique with a sparring partner. Feynman's old trick was: explain the thing in plain words to somebody who doesn't know it. If you can't, you don't understand it. The agent is a tireless somebody. It also knows where you're bluffing, because it has the code.

The Bigger Move

The doomer take on AI coding is that we're all going to forget how to program. The cope take is that we don't need to understand the code anymore. Both are wrong in the same way: they treat understanding as a side effect of typing the code, and they treat AI as either the thief of that side effect or the replacement for it.

Understanding isn't a side effect of typing. It's a side effect of retrieving. You can absolutely build understanding while letting an agent do the typing, but you have to do the retrieving on purpose, because nothing about the workflow forces it on you anymore.

So make it a step. Ship the code. Pay down the cognitive debt before you close the tab. Treat the quiz the same way you treat the tests: not optional, not later, part of done.

Try It After Your Next Session ⚡

This takes ten minutes. Do it before you forget what the session was about.

Right before you merge, paste the quiz prompt above into the same chat.
Answer out loud or in the chat. No re-reading the code first.
For every question you bluff or miss, go find the answer in the code and write a one-line comment in the file explaining it for next-you.

You'll feel dumb for about five minutes. Then you'll realize you actually understand the thing you just shipped, possibly for the first time. That feeling is the debt clearing.

Good enough beats perfect. Half a quiz beats no quiz. The point is to retrieve something from your own head before the session context disappears forever.

Header photo by Shubham Sharan on Unsplash.

Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.