Recap: Stop Managing Tools, Express Outcomes — The Vibe-Ops Talk at Upper Bound 2026

A pilot's hand reaching across the lit instrument panel of a small aircraft cockpit at night, city lights glowing below

Part of my Upper Bound 2026 series — write-ups of the talks I caught at Amii's AI conference in Edmonton that I wanted to carry home. This one I sat in for at the NetApp Stage, Wednesday May 20, and it hit close to home: it's fundamentally about the same thing I do every day when I pair with a coding agent, just described at the operations layer instead of the code layer.

The Talk: "Vibe-Ops: The Missing Half of Software DevOps in the Age of AI"

Johnny Chen and Tony Tam gave this one at NetApp Stage 4 on May 20, 12:30 to 13:30. The premise, stripped down: DevOps gave us pipelines for shipping code — but what happens to operations when the thing doing the operating is an AI? "Vibe-Ops" is their name for the emerging discipline of running infrastructure and software through intent rather than explicit commands. You say what outcome you want; the AI figures out the steps.

The line they kept returning to — the one I underlined in my notes:

"Stop managing tools — express desired outcomes."

That's a deceptively small sentence. If you take it seriously, it reframes the entire operator/SRE/platform-engineer job description.

Who They Are

Johnny Chen and Tony Tam presented the talk together; beyond their names I'm not going to invent bios. What the session made clear: they're practitioners, not just theorists. The patterns they laid out had the feel of things learned in production, not sketched on a whiteboard.

The Intent-Driven Execution Funnel

The backbone of the whole talk is a four-stage funnel that maps how a human thought becomes a system action:

graph LR
    subgraph "The Execution Funnel"
    A[Human Intent] -->|Natural Language| B[Prompts]
    B -->|Structured Input| C[AI Standardization]
    C -->|Resolves Ambiguity| D[System Execution]
    D -->|Infrastructure Action| E((Outcome))
    end

Human Intent — the natural-language outcome you want ("deploy the new version with zero downtime")
Prompts — the structured ask handed to the AI
AI Standardization — the model interprets intent, normalizes it into something executable, and resolves ambiguity
System Execution — the actual infrastructure action

What I find useful about this framing: it names where most of the problems live. The gap between Human Intent and Prompts is where you lose precision. The gap between Prompts and AI Standardization is where you lose predictability. Name the stages and you can actually debug which stage broke when something goes wrong — rather than blaming "the AI" in the abstract.

This lineage connects directly to the ReAct loop from Yao et al. (ICLR 2023, arXiv:2210.03629) — that paper showed how interleaving reasoning traces with action steps lets language models navigate multi-step tasks reliably. The funnel is essentially ReAct generalized to the ops layer: reason about intent, act on the system, observe the result.

The Maturity Ladder

Chen and Tam laid out three levels of how teams actually use AI in operations, in ascending order of sophistication:

graph LR
    subgraph "The Maturity Ladder"
    L1[Level 1: Passive Assistant
Human decides + executes] --> L2[Level 2: Workflows
Predefined sequences]
    L2 --> L3[Level 3: Agents
Dynamic routing + self-correction]
    end

Level 1 — Passive Assistant. You ask, it answers. The human still decides and executes every step. Think Copilot for runbooks.
Level 2 — Workflows. AI executes predefined sequences. More powerful, but brittle: the workflows are hard-coded, and anything outside the happy path breaks. High maintenance overhead.
Level 3 — Agents. Dynamic routing, self-correction, and the ability to handle novel situations. The AI chooses the path, not just follows it.

The jump from Level 2 to Level 3 is where most of the hype lives — and most of the real risk. Level 2 is predictable enough to audit; Level 3 is where transparency starts to thin out. They weren't cheerleading uncritically for Level 3. The honest tradeoff, the way I captured it: faster deployment + lower onboarding + fewer syntax errors, versus reduced transparency + higher debugging complexity.

I've felt this in miniature. A Level 2 agent session where I hand off a clearly-scoped task is something I can review step by step. A Level 3 agent that self-routes and self-corrects can do remarkable things, and it's also the session where I'll sometimes scroll back through the history and think: I'm not entirely sure how we got here.

Two Multi-Agent Patterns

The second half of the talk was about what happens when you have multiple agents coordinating. They presented two distinct patterns, and the contrast between them is worth holding onto.

Pattern A: Review and Critique

A generator produces an artifact. A critic evaluates it. The two iterate — refinement loops — until the artifact crosses a quality threshold and gets approved.

graph LR
    subgraph "Review & Critique Loop"
    G[Generator] -->|Artifact| C[Critic]
    C -->|Approval| A[Approved Artifact]
    C -.->|Feedback| G
    end

This pattern is grounded in solid research. Madaan et al.'s Self-Refine (NeurIPS 2023, arXiv:2303.17651) showed that iterative self-feedback substantially improves LLM outputs across tasks. Shinn et al.'s Reflexion (NeurIPS 2023, arXiv:2303.11366) demonstrated something adjacent but sharper: agents can learn from verbal reflection on past failures, adjusting behavior across episodes without gradient updates. Both are essentially the Review & Critique loop running in different forms.

The practical value in ops: you can use this pattern to validate a proposed config change, infrastructure plan, or deployment strategy before it touches a real system. Generator writes the runbook; critic pokes holes in it.

Pattern B: The Swarm

This one was a proper swarm — a lot of specialized agents (not the tidy handful you might picture), coordinated by a Dispatcher. The part that stuck with me was how much they talked to each other: dense agent-to-agent chatter, not just each one reporting up to a hub. Each specialist goes deep on its own domain; the Dispatcher routes work and arbitrates between them.

graph LR
    subgraph "The Swarm Architecture"
    S1[Specialist A] <--> D[Dispatcher]
    D <--> S2[Specialist B]
    D <--> S3[Specialist C]
    S1 -.-> S2 -.-> S3
    style D fill:#1a4d3a,stroke:#00d886
    end

This is the pattern that scales, and also the one that requires the most architectural discipline. Wu et al.'s AutoGen (arXiv:2308.08155, 2023) is the closest peer-reviewed articulation of this shape: multi-agent conversation as the primitive, with agents calling on each other to complete tasks. Anthropic's own engineering write-up "How we built our multi-agent research system" (June 2025, anthropic.com/engineering) documents a similar orchestrator/sub-agent split in practice — their orchestrator is the Dispatcher in different clothes, routing subtasks to specialized subagents and assembling the result.

The deeper ancestor here is Marvin Minsky's The Society of Mind (1986): intelligence emerging from many simple, specialized agents rather than one monolithic reasoner. Chen and Tam's swarm is Minsky's thesis finally running at the infrastructure layer.

The Research Base

The patterns in this talk aren't novelties — they're converging ideas from different research threads:

The intent→action loop lineage: Yao et al., "ReAct: Synergizing Reasoning and Acting in Language Models" (arXiv:2210.03629, ICLR 2023). The theoretical floor under the execution funnel.
Review & Critique — iterative refinement: Madaan et al., "Self-Refine: Iterative Refinement with Self-Feedback" (arXiv:2303.17651, NeurIPS 2023).
Review & Critique — verbal reinforcement: Shinn et al., "Reflexion: Language Agents with Verbal Reinforcement Learning" (arXiv:2303.11366, NeurIPS 2023).
The Swarm / multi-agent conversation: Wu et al., "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation" (arXiv:2308.08155, 2023).
Dispatcher + specialists in production: Anthropic, "How we built our multi-agent research system" (June 2025).
The philosophical ancestor: Marvin Minsky, The Society of Mind (1986) — intelligence as an emergent property of specialized, cooperating agents.

Why It Stuck With Me

I've spent a lot of this year thinking about what it means to pair well with a coding agent. Most of my mental model has been about the prompting side — how I express tasks, how I break things up, when to let the agent run and when to interrupt. This talk named what I've been building toward without having a word for it: I'm already doing Vibe-Ops, just at the code level. I express an outcome; the agent figures out the sequence.

Three things I'm taking home:

Name the stage that broke. When an agent session goes sideways, the funnel gives you a vocabulary: was it Intent, Prompt, Standardization, or Execution? Vague "AI messed up" is not debuggable. Pinpointing the layer is.
The tradeoffs are real and not resolved. Level 3 is genuinely more capable and genuinely harder to audit. Chen and Tam didn't paper over this. Anyone who tells you full autonomy is strictly better than oversight hasn't run it in production.
The Review & Critique pattern is underused. I already run something like it informally — ask the agent to review its own output, or spin up a second conversation to critique the first. Making that explicit and structural is the upgrade.

The line worth stealing for anyone building with or on AI systems: stop managing tools, express desired outcomes. That shift in posture — from command-giver to intent-expresser — is still a skill, and it's the skill that will actually compound.

More in the Upper Bound 2026 series: Bayesian optimization and experiment design, IT/OT security at the Purdue model, and more sessions worth unpacking.

Header photo by Chris Leipelt on Unsplash.

Content on this blog was created using human and AI-assisted workflows described here. Original ideas and editorial decisions by Justin Quaintance.