AI-Native Teams: Building for the New Bottleneck

Article

Fifty-five percent of companies that cut headcount to fund AI adoption now say they regret it. That's the finding from Orgvue and Forrester — and the reason is consistent across the survey data. The AI couldn't do what they thought it could. Not because the models were weak, but because the organizational structure around them couldn't absorb the output.

The 2025 DORA report — nearly 5,000 technology professionals surveyed — puts numbers on why. Ninety percent of respondents now use AI at work. Over 80% report feeling more productive. AI adoption does improve delivery throughput. It also increases delivery instability. The central finding: AI acts as an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones. The greatest returns come not from the tools but from "a strategic focus on the underlying organizational system: the quality of the internal platform, the clarity of workflows, and the alignment of teams."

The expensive work has shifted. Writing code is cheap now. Specifying what to build, reviewing what got built, and maintaining coherence as features accumulate — these are the bottleneck. They're human skills, and they haven't gotten cheaper the way implementation has.

The pod

Small cross-functional teams organized around product areas rather than technical layers. Two to three engineers, a designer who can prototype with AI, a PM who can evaluate output. Each person operates as a generalist across the stack during prototyping and contributes specialist depth during hardening.

None of this is a novel organizational idea. Spotify squads, two-pizza teams, stream-aligned teams in Team Topologies — the shape keeps getting reinvented because it keeps being correct. What AI changes is the economics that make it viable at smaller scale. A three-person team couldn't realistically span backend, frontend, infrastructure, and design without AI filling the gaps between each person's depth. Now it can. The arguments against small cross-functional teams have historically been about coverage — you can't put a backend engineer, a frontend engineer, a designer, a DevOps person, and a QA engineer on every pod without making the pods too large to coordinate. AI collapses that coverage problem by making each person effective across layers they don't specialize in.

The team shape isn't new. The minimum viable team size to make it work is.

Role generalism vs. domain generalism

Kent Beck has noted that AI makes him "90% as good at things he was 0% good at." That observation is precise and important — and the missing 10% is where bugs live.

METR ran a controlled trial with experienced open-source developers on codebases they'd maintained for five or more years. The result: AI tools made them 19% slower on average. The developers themselves believed they were 20% faster. That perception gap is the most important finding in recent AI productivity research.

What it reveals is a distinction that matters for team structure: AI makes role generalism nearly free but fails silently at domain generalism. A backend engineer can write good frontend code with AI on any codebase, because the knowledge required — React patterns, CSS layout, state management — is general, and AI has strong coverage of it. Working on an unfamiliar subsystem within the same codebase is different. AI produces code that's internally consistent but misaligned with the implicit knowledge embedded in the existing system. The middleware is structured that way for a reason. The data model encodes assumptions that aren't documented. The patterns evolved through production incidents that the git history doesn't fully explain.

The pod structure works partly because the same people own a product area over time, accumulating the domain knowledge that makes AI effective rather than dangerous. Rotation through unfamiliar territory is where the METR slowdown lives.

Two modes, one team

The same pod might maintain an existing system on Monday and prototype a new feature on Wednesday. The team doesn't change between these, but how it uses AI does.

Prototyping is generalist and AI-led. Whoever picks up the ticket builds the whole thing end-to-end — spawning a git worktree, handing the agent a spec, iterating until the functional output matches expectations. The cost of getting it wrong is low because the prototype is disposable. If the approach doesn't work, you throw the worktree away after investing hours instead of weeks. Having an agent build a rough version and seeing where it struggles is more informative than speculating about complexity in a planning meeting.

Productionizing is specialist and human-led. The engineer with security instincts reviews auth flows. The person with data modeling depth reviews the schema. The architect evaluates whether the new code fits the broader system. The prototype becomes a starting point for tighter, production-ready implementation.

The transition between modes is where things most commonly go wrong. The temptation is to ship prototype code as-is under deadline pressure — treating "it works" as "it's done." Prototype code that reaches production without a hardening pass accumulates quietly: auth flows that miss edge cases, schemas that won't survive the next feature, error handling that swallows failures. The pod needs a shared understanding that prototype-to-production is a discrete step, not a gradual polish.

The specialization taxonomy

During prototyping, everyone operates as a generalist. During productionizing, the team needs depth in specific areas where AI output is most likely to be subtly wrong and where mistakes are expensive to find later. These specializations should be distributed across the people on the pod, not concentrated in one person.

Security. Auth flows, data access patterns, secrets handling, input validation. AI generates code that works but routinely misses authorization edge cases or introduces injection vectors that look correct in review.

Data modeling and persistence. Schema design, migration safety, query performance, how the data model holds up as features accumulate. AI optimizes for the current feature without considering how the schema evolves over time.

System architecture. Service boundaries, API contracts, dependency management, keeping the codebase coherent as the rate of AI-driven change increases. Without this, the system accumulates structural problems that don't show up in any individual PR but degrade the whole thing — inconsistent patterns, duplicated logic that diverges, data models that conflict.

Observability. Logging, monitoring, alerting, on-call readiness. If nobody on the team thinks about what happens when the feature breaks at 2am, it shows in production.

Accessibility. Screen reader compatibility, keyboard navigation, ARIA patterns, contrast. AI generates inaccessible UI by default unless explicitly constrained.

Verification and test design. Test architecture, coverage strategy, dependency-aware testing, integration test design. The quality of the team's testing infrastructure determines how much the team can trust AI-generated output.

Nobody covers all of these, and most people have depth in more than one. The goal is that across the pod, the critical concerns are represented — and the team knows where the gaps are and can make conscious decisions about when to accept risk and when to bring in outside review. These specializations also don't require ten years of experience. An engineer who has spent a year focused on writing thorough integration tests brings real verification depth, even if they're early in their career. The specialization is the habit of attention — consistently thinking about what can go wrong in a specific domain — not the number of years spent doing it.

What this looked like in practice

We built the Archaeologist — a legacy codebase analysis and modernization platform — with three engineers over three months. The output: 289 commits, 908 files, roughly 133,000 lines of code across a Kotlin/Spring backend, a React/TypeScript frontend, a Node.js code generation agent, cloud infrastructure (AWS and Azure), and CI/CD pipelines.

Each engineer had clear depth but routinely worked across the full stack. Engineer A owned pipeline architecture, spec generation, and verification infrastructure — but also shipped frontend features and wrote COBOL parsers. Engineer B owned the frontend, design system, and CI/CD — but also wrote backend API endpoints and cloud storage integrations. Engineer C owned the code generation engine, GitHub integration, and cloud infrastructure — but also built frontend dashboards and handled webhook infrastructure end-to-end.

The generalist/specialist oscillation emerged naturally. During prototyping, whoever picked up a ticket built the whole thing with AI handling unfamiliar layers. During hardening, the person with relevant depth reviewed that work. When Engineer C built the GitHub webhook system, they owned the Kotlin endpoint, the TypeScript agent that consumed it, and the React UI that surfaced the status — no handoff, no integration meeting, no "my API contract doesn't match your frontend expectations" bugs.

Where the gaps showed. Security review was inconsistent — we caught auth issues when they were obvious but likely missed subtler data access pattern problems because nobody carried security as their primary lens. Accessibility wasn't addressed systematically. Observability was minimal until production issues forced it. The other failure mode was domain context: as the system grew, tickets in unfamiliar subsystems produced code that worked but didn't fit surrounding patterns. Engineer B could write backend pipeline code with AI, but the result sometimes conflicted with conventions Engineer A had established. The METR finding in miniature — AI makes you effective across roles but not across domains.

What to hire for

AI has made generalist capability the baseline. A backend engineer who can produce reasonable frontend code with AI assistance is table stakes. What differentiates is the specialization someone brings on top of that baseline — and whether it fills a gap the team currently has.

Framework-specific knowledge matters less than it used to because AI flattens that learning curve. The judgment to evaluate whether AI output is correct in a specific domain is harder to develop and harder to replace.

The measurement trap

Measuring AI-native teams is unsolved, and the approaches that seem obvious tend to fail.

Meta built an internal leaderboard called "Claudeonomics" that ranked employees by AI token consumption, awarding titles like "Token Legend" and "Session Immortal." Employees gamed it by running agents on idle tasks to climb the rankings — 60 trillion tokens burned in 30 days. Shopify added AI tool adoption to performance reviews, which incentivizes generating code rather than shipping quality. Jellyfish's data shows PR throughput increased 113% with AI adoption, but DORA simultaneously found that AI increases delivery instability. Volume metrics without quality metrics create perverse incentives.

The direction that seems more promising is measuring decisions rather than output. Specification quality — how often an engineer's tickets produce good AI output without multiple rounds of correction — is a leading indicator that the upstream planning is solid. Review quality matters more than it used to because the volume of code going through review has increased dramatically. DORA found code review is the number one use of AI among respondents, at 68%. And outcome metrics — cycle time, change failure rate, time to recovery — remain the most reliable foundation because they measure what matters to the business rather than what's easy to count.

The honest position is that the industry is still experimenting. What does seem clear: measuring AI usage volume sends the wrong signal.

Glover Labs builds AI systems that modernize legacy enterprise software. Our team structure and engineering practices are built for the new bottleneck — specification, verification, and architectural coherence at AI speed. Book a technical demo →