What Changes About Planning When AI Does the Building

Article

AI can write code faster than any developer who has ever lived. This is no longer a controversial claim — it's the observable output of every team running Claude, Cursor, or Copilot in production. The 2025 DORA report quantified it: teams with high AI adoption merged 98% more pull requests. Individual output jumped 21%.

And yet organizational delivery metrics stayed flat.

The DORA researchers call this the "AI Productivity Paradox." More code ships, but the product doesn't move faster. The gains from generation get eaten by a verification tax — more time reviewing, more time debugging integration failures, more time untangling the plausible-looking code that turned out to be subtly wrong.

The bottleneck has shifted. When building is cheap, the quality of your plan becomes the binding constraint. Not because AI is bad at coding — it's often quite good — but because it's bad at catching its own misalignments. A human developer who writes a service that doesn't quite match the API contract will often feel the friction before they commit. An AI will produce something that compiles, passes its unit tests, and looks reasonable in review. You find out it's wrong when you wire everything together.

The way you define and scope work has to change to account for this.

The good ticket (always true, now mandatory)

A lesson from early in my career, learned by watching good PMs: a well-defined ticket has a visible, measurable outcome. If the person doing the work can't observe whether they succeeded, the ticket is poorly scoped.

This was always true. But human developers compensated. They had institutional knowledge, hallway conversations, a sense of how the system felt when something was off. They could absorb ambiguity because they carried context the ticket didn't.

AI carries no context the ticket doesn't give it. Every ambiguity in the spec becomes a decision the model makes silently, with high confidence, and no signal that it guessed.

Why layer-by-layer decomposition breaks down

The instinct most teams follow: one ticket for the data layer, one for the service layer, one for the API, one for the UI. Each layer gets built and reviewed in isolation. It's clean. It's how most teams have always broken down work.

The problem: nothing is testable until everything is assembled.

We learned this the hard way building a git integration feature. The work was decomposed by layer — a repository module, a service, an MCP tool, a UI component. Each piece looked correct in isolation. Each one passed its unit tests. We reviewed them individually and trusted the seams because there was no way to validate the full path without wiring them together. When we finally did, the integration didn't work. Every layer was internally correct. The assumptions between them had drifted in ways that were invisible until the whole thing ran end-to-end.

AI amplifies this failure mode. It produces code that is internally consistent and externally plausible. The longer the gap between "code written" and "code verified against real behavior," the more expensive the fix. With a human developer writing one layer per day, the gap might be a week. With AI writing one layer per hour, you can accumulate a week's worth of integration debt before lunch.

Vertical slices instead

Each ticket should be the narrowest possible end-to-end slice of functionality. Not "build the repository layer" but "implement the single-file commit path from UI click to git push."

The litmus test: can you demo or manually test this ticket's output in isolation? If the only proof of completion is "the unit tests pass," it's scoped wrong. Unit tests verify assumptions. Running the actual path verifies reality.

Sometimes this produces a bigger PR than you'd normally like. A vertical slice that touches the database, the service, and the API in one ticket feels messy by the standards of layer-by-layer decomposition. But a large, testable PR beats three small PRs that you can only trust in combination. You know one of them is wrong. You just can't tell which until you assemble all three.

The integration test as a contract

Before writing any implementation tickets, write a decoupled integration test that defines the expected behavior of the feature. Not a unit test — an integration test that exercises the real path and asserts on the observable outcome.

The test starts red. Each vertical slice should bring it closer to green. You get an objective measure of progress that doesn't depend on reading code and deciding whether it "looks right."

Writing it first is a planning decision, not an implementation detail. Writing the test first forces you to think about observable outcomes before any code exists. What does the user see? What does the API return? What state changes in the database? These questions are easy to punt on during planning. They become expensive when three layers of AI-generated code are already built on different answers.

It also opens the door to automation. Addy Osmani documented this in his LLM coding workflow — feeding test failures back to the AI in a loop, letting it iterate against an objective measure rather than relying on the developer to read every line of generated code and decide if it's right. The test becomes a feedback signal the AI can actually use.

Tickets define intent. A standards file defines how.

There's a separation that most teams don't make explicit until they start working with AI: the difference between what a ticket should specify and what should live somewhere else entirely.

Tickets should specify what and why. Build the commit path. Support single-file commits only — multi-file comes later. Edge case: handle the case where the file was deleted between the UI click and the commit. AI can even help draft these edge cases during ticket creation, though the developer needs to review and fill gaps.

Implementation patterns — naming conventions, error handling strategies, how you structure services, which libraries you use — belong in a persistent file the AI reads on every task. Not repeated across tickets. Not assumed from context. Codified once, applied everywhere.

For Claude Code, this file is called CLAUDE.md. Other tools have similar mechanisms. The point isn't the file format — it's the mental model. Teach the tool your standards once. Then write tickets that focus purely on intent. The AI combines the two. If your standards are clear and your tickets are well-scoped, the gap between "what you wanted" and "what the AI built" shrinks considerably.

Addy Osmani put it well: "The AI is not the bottleneck — your specification is." That's been our experience too. The teams struggling with AI-generated code quality almost always have a planning problem, not a model problem.

What actually changed

Planning for AI-assisted development isn't about planning less. It's about planning for a builder that's fast, capable, and confident — but has no intuition about whether the pieces fit together.

Human developers have always masked planning gaps with experience and judgment. AI doesn't mask anything. It executes what you describe, with impressive speed and zero awareness of what you meant. Every ambiguity in your plan becomes a silent fork in the code, and you won't know which path the AI chose until you test the whole thing.

The discipline is straightforward: scope for observability, test against real behavior early, and keep your standards in one place. None of this is revolutionary. Most of it is just good engineering practice that was always correct but survivable to skip. AI removes the survivability.

Glover Labs builds AI systems that modernize legacy enterprise software. Our planning and verification practices come from shipping production modernization across Fortune 500 codebases. Book a technical demo →