Are We Using AI Efficiently in Software Development? The Most Common Measurement Mistakes at a Glance
Martin Zoeller
Nobody likes waiting. Nobody likes missed deadlines. Nobody likes frustration on the team. And nobody likes working inefficiently. Ever since Claude Code began bringing agentic software engineering truly into the mainstream in late 2025, the demands for effective AI usage in engineering teams keep getting louder. In my circles, it comes up every single day.
At the same time (at least as I perceive it), attempts to measure the effectiveness of AI agents at software companies are on the rise—and in a way that is not only demonstrably unsuitable but actively works against the real goal: more efficient development processes.
This article gives you a focused overview of the three stumbling blocks I run into most often, along with a preview of the solution that has proven itself again and again.
1. AI Leaderboards Don’t Measure AI Effectiveness
In April 2026, Meta (formerly Facebook) made headlines with its “Claudeonomics” leaderboard, which ranked more than 85,000 employees by their token consumption and handed out titles like “Token Legend” or “Session Immortal” to the top performers. Put simply, the result was that engineers kept internal agents running around the clock, burned through a massive amount of tokens, and produced no notable results beyond a spot in the leaderboard’s top 250. A short time later, the leaderboard was labeled a “fun project” and shut down.
A month later, it emerged that Uber had already used up its AI budget for all of 2026. At the same time, the COO admitted he couldn’t establish any connection between the increased spending and the features actually delivered.
Why Is That?
Token consumption measures AI adoption: it shows how intensively an engineer uses AI agents in their daily work—not how effectively. (And, accordingly, how much AI usage costs per engineer.)
It also reveals which engineers are refusing to use it. This is where a conversation can be illuminating: What factors are blocking you from using AI? Is the output not good enough? Are the agents too slow? Do you never even reach the point where you can implement something? (Want to know how to have such a conversation productively? I’ll provide a guide in a future blog post. )
The most important fact to understand about these measurements: an engineer can burn through several million tokens in a single day without creating any value whatsoever for your company. Agentic software engineering is no cure for “busy work” (and can even invite it—that’s the “vibe” in “vibe coding”).
2. Lines of Code Remains the Classic Among Bad Metrics
I still remember a time when the works council of a mid-sized company was presented with the mother of all metrics: How many lines of code were added last quarter? In the age of AI agents, this classic is enjoying a genuine revival: How many lines of code did an engineer generate with the help of Claude Code or Codex?
More lines of code means more new features, right? Probably yes, but “more features” is no indication of effective work in software development, nor that the software does what customers actually need.
Two Examples:
- When the backend team shuts down an older service that’s barely used anymore, it has one fewer responsibility: the service no longer needs to be maintained, and the occasional feature requests from the two customers who used it fall silent. At the same time, this step completely destroys the lines-of-code metric, since the service probably had tens of thousands to hundreds of thousands of lines of code that vanish all at once.
- A frontend team migrates its tech stack from an older dependency to a new one. The new dependency is faster, more secure, easier to integrate—and can be configured and used with 20% of the lines of code. It makes life easier for the engineers on the frontend team, and new features come together a bit more smoothly. At the same time, the project “loses” maybe 8,000 lines of code (the lines that had wired up the old dependency), and the metric drops accordingly.
In both cases, engineers are relieved of work; after these measures, the gears are a bit better oiled. If you now measure effectiveness by lines of code, both moves were “bad”—and it’s precisely this contradiction that makes lines of code unsuitable for measuring speed or effectiveness.
As an aside: AI leaderboards and usage metrics in the dashboards of Claude Code or OpenAI typically measure token consumption. That’s an important distinction, because an agent’s long reasoning process consumes more tokens and produces fewer lines of code. For context: little “reasoning” produces more code of (probably) lower quality. Nonetheless, both metrics—token consumption and lines of code—are unsuitable for measuring the efficiency of a development team that uses AI agents.
3. Goodhart’s Law
What Is Goodhart’s Law?
When a measurement becomes the goal, it’s no longer a good measurement.
What Does That Mean in Practice?
Here’s an example: A team works in sprints and assigns story points to every task. At the end of the sprint, a Scrum Master, for instance, puts together an analysis of the completed story points: “This sprint we completed 42 story points! Last sprint it was only 33. That means we were faster this sprint!” Based on this perception, the team members agree that “story points” are a suitable metric for “velocity.” Naturally, management wants optimal (or increased) velocity. And so “more story points” becomes the goal. Goodhart’s Law says that the team will, as a result, learn to inflate story points to keep the “velocity metric” high—at the latest once a “bad sprint” is explained away by fewer completed story points.
Why Is That?
To put it very simply, after a bad sprint the team in the example above has two options:
- Assign more story points per task in order to “get more done” again.
- Analyze the underlying causes of the weaker sprint, form hypotheses about how to fix those causes, try out the solutions, critically examine the results, …
Option 2 requires enormous discipline and a shared commitment. Option 1 is trivial. It’s human nature to take the path of least resistance. (Important note: The reasons for this are considerably more complex than outlined here, and the example above is just that: an example. For the purposes of this article, however, it’s sufficient.)
How Does Goodhart’s Law Affect Our Goal? How Do We Keep Our Measurements from Losing Their Significance the Moment We Declare Them the Goal?
Goodhart’s Law gives us two very important insights:
If token consumption or lines of code were ever good metrics (they weren’t), then they stop being good ones at the latest the moment you declare them the goal—directly or indirectly:
A CEO who is irritated by an engineer’s low token consumption because he wants more speed in feature development can, at best, get the engineer to consume more tokens in the future—but not to develop features any faster.
As soon as a team finds a metric that actually measures its efficiency, that metric must not be declared the goal—neither directly nor indirectly.
How Do We Keep a Meaningful Metric from Falling Victim to Goodhart’s Law?
The following two measures reduce the likelihood that a team will, consciously or unconsciously, optimize a metric instead of the outcome it measures:
Don’t turn it into a target: A metric stays honest as long as it’s a team-internal diagnostic tool. If an executive knows the numbers and knows what they mean, they’ll ask about the numbers. With that, it becomes a target within the team—directly or indirectly—and suddenly we’re back to the works council being presented with “lines of code” or velocity in the quarterly report. And when that happens, even scrum.org offers 13 ways to manipulate that number (which, by the way, I find quietly amusing).
Find more than one good metric: If a single number (and thus a single metric) falls, that’s probably bad. If it rises, it’s probably good. That’s the velocity example from above: we can manipulate the number so the output looks better than it is. But what if I have two, three, or four numbers that interact with one another? Asked in the abstract: What does it mean when values 2 and 3 rise, value 4 stays the same, and value 1 falls? Is that good or bad?
Once we’ve found several metrics by which to assess efficiency, we lose the incentive to optimize any single one of them. Instead, a more comprehensive overall picture emerges—one we have to keep an eye on.
So: do we just measure lines of code and token consumption? No, but nice try.
In Part 2 of this series, we’ll look at the metrics your team can truly use to measure how effectively it works. We’ll learn what each of the values means, how you can measure it, and how you can ultimately answer the question: Does AI really make us faster in development, and is the investment worth it?
You don’t want to wait for the article and would rather get direct help implementing the right metrics in your team? Then let’s have a no-obligation conversation.
Never miss a blog article
Get notified when a new blog article is available. You can unsubscribe at any time.
Your email won't be shared with third parties.