Pointing Claude Code at Your COBOL Will Probably Fail (And What Has to Happen First)
Articles

The Glover Team
In February, Anthropic published a "playbook" detailing how Claude Code could be used for COBOL modernization. IBM lost 13% of its market cap in a single session — roughly $30 billion — its worst day since 2000.
The market's reaction was directionally right. AI will reshape how enterprises modernize legacy code. But it was also premature, and for a reason that matters more than the stock price: Claude Code is a powerful execution engine. It is not, on its own, a modernization strategy. The distinction sounds pedantic until you've watched an agent hallucinate, stumble through and try to make sense of a 37-million-line codebase.
We know, because we've done it.
Luckily The Copilot Wasn't On A Plane
Before Glover Labs existed, our founding team was inside one of the largest financial technology companies in the world, running core banking and payments infrastructure for institutions on every continent. The codebase was 37 million lines of COBOL. On day one, we pointed GitHub Copilot and Cursor at it.
The tools didn't just underperform. They failed completely. Not because the models were bad — they were impressive on isolated snippets. The problem was that no model, no matter how capable, could infer from raw code alone what the system actually did. The business logic was spread across thousands of modules, documented in the heads of engineers who'd been there for decades, encoded in operational workarounds that had never been written down. The code was an artifact of the system. It wasn't the system.
That experience shaped everything we've built since. And it's the same experience that every enterprise team pointing Claude Code at their legacy estate is about to have — or is already having and doesn't yet realize.
The Agentic Death Spiral
There's a pattern emerging in enterprise AI deployments that we've heard called the agentic death spiral. It looks like this.
You point a capable agent at a legacy codebase. The agent ingests the code, builds a representation of it, and begins generating outputs — migration specs, refactored modules, test suites, documentation. The outputs look good. They're syntactically correct. They pass surface-level review. Leadership sees fast progress and greenlights the next phase.
Then reality catches up. The agent mapped dependencies by reading import statements, but the actual execution flow routes through a batch scheduler that lives outside the codebase entirely. The agent translated a business rule faithfully, but the rule had been superseded by an operational workaround three years ago — one that exists only in a runbook that nobody digitized. The agent decomposed a monolith into services, but one of the boundaries it chose splits a transaction that must be atomic under the bank's regulatory framework.
Each error compounds the next. The agent doesn't know what it doesn't know, so it builds confidently on top of flawed assumptions. By the time a human catches the problem, the agent has generated thousands of lines of code downstream of the original mistake. Rolling back is expensive. Not rolling back is worse.
Google's 2025 DORA report put hard numbers on this: a 90% increase in AI adoption correlated with a 9% climb in bug rates, a 91% increase in code review time, and a 154% increase in pull request size. Those numbers are from general codebases. For legacy systems — where context is sparser, dependencies are more tangled, and the gap between what's written and what's real is widest — the amplification effect is worse.
As Thoughtworks noted in their Claude Code reality check: "The problem with large COBOL systems is rarely that the code is unreadable, but rather an issue of scale and cognitive load. Treating this as if we're deploying the tool and letting it run with a prompt is a deeply naive view of execution."
The Bottleneck Was Never Capability
The IBM stock drop illuminated something and obscured it at the same time. The bottleneck in legacy modernization was never model capability. Claude's ability to read, analyze, and translate COBOL is genuine. Code Metal just raised $125 million at a $1.25 billion valuation on formally verified AI code translation. The market agrees: agents can do real modernization work.
Capability without context, though, is just fast, confident failure.
The bottleneck is the input. An agent pointed at raw COBOL has no structured representation of the system's architecture, its business rules, its data flows, its operational dependencies, its compliance constraints. It's inferring all of that from source code alone — like trying to understand a city by reading its zoning ordinances without ever walking its streets.
Anyone who's run a real modernization program knows this: the first 60-70% of the effort goes into understanding the existing system. Not changing it. Understanding it. Mapping the dependencies. Sitting with the engineer who knows why that batch job runs at 2 AM on the third Thursday of each month. Documenting the business rules that were encoded in JCL forty years ago and never revisited.
AI agents skip this step. Not because they're incapable of doing the analysis, but because nobody builds the context layer for them to analyze against. They get pointed at code and told to modernize. What they needed was a structured specification of the system — a map — that the code is merely one part of.
Context Is the Product
At Glover Labs, this is the product we build. Not the agent. The context layer the agent needs to operate on.
We call it the Living Spec — a persistent, bidirectional system of record that maps the as-is legacy state to the target end-state. It's built by running AI agents against every source of institutional knowledge — code, yes, but also UIs, documentation, support tickets, operational logs, database schemas, and subject-matter expert knowledge. The agents don't translate code. They build understanding. They construct the context layer that any execution tool — Claude Code included — needs before it can do real modernization work.
The Living Spec isn't documentation. Documentation is static, goes stale, and nobody reads it. The Spec is a live, queryable, versioned structure that updates as new understanding is developed and as modernization decisions are made. When Claude Code generates a migration for a module, it can consult the Spec to know that this module touches a regulated transaction boundary. When it decomposes a service, it can check the Spec to understand that the batch scheduler dependency means certain processes can't be separated.
Without the Spec, you get the death spiral. With it, you get an agent that actually knows what it's working with.
What the IBM Drop Actually Means
The market read the Anthropic announcement as "AI can now replace IBM's COBOL modernization business." That reading is right on direction and wrong on timeline.
AI agents will do the execution work — code translation, test generation, dependency mapping — at a fraction of the cost and time of human consultants. IBM's highest-margin consulting work is under threat. Code Metal's valuation confirms this isn't speculative.
But execution without structured context is the agentic death spiral by another name. IBM's response — that "decades of hardware-software integration cannot be replicated by moving code" — is self-serving, but it's also correct. The institutional knowledge embedded in these systems can't be extracted by reading the code alone. Until someone builds the context layer that captures that knowledge and makes it available to agents, the agents will keep producing confident, fast, and wrong results.
The question for enterprise teams isn't which agent to use. It's who builds the context layer. The model matters far less than what the model knows when it starts working.
That's what we build at Glover Labs. And after watching every AI coding tool on the market fail against a 37-million-line COBOL codebase from the inside, we're not guessing about what's missing.
Book a demo to see how the Living Spec changes the calculus.
