Guide: How Do I Properly Measure My Engineering Team’s Performance in the Age of AI Coding Agents?
Martin Zoeller
In the first part of the series, we looked at the most common pitfalls in measuring the performance of a software development team. To measure performance correctly, it’s important to know how not to do it—and above all, why, for example, lines of code is the classic among bad metrics, or how Goodhart’s Law can render even good metrics useless.
With this background, today we can look at how your team can actually measure performance in a meaningful way—and with as little effort as possible. As a big fan of “Just Do What Works,” today I’ll show you the most effective set of metrics, which have proven themselves as the de-facto industry standard, and I’ll explain why they remain relevant even in the age of Agentic Software Engineering.
Important: “Performance” here combines the speed and reliability of the system. The system is your development team, or the way you build and release software. It is a measurement of neither the productivity nor the effectiveness of the work.
What Are the DORA Metrics?
Google publishes the DORA report once a year. The data basis for 2025 consists of just under 5,000 respondents plus more than 100 hours of qualitative interviews. At its core, it captures five metrics that provide insights into a team’s Throughput and Instability.
The report gives us a vocabulary and a benchmark context that let us talk about delivery performance in an informed way.
Important: DORA deliberately provides no basis for measuring the business value of the software, the product quality, or the well-being of the team.
Why Is DORA the Industry Standard?
DORA is the only measurement basis grounded in over a decade of empirical research involving tens of thousands of teams. The metrics we’ll discuss below are deliberately kept lean, so that no team wanting to start measuring has to rebuild its stack. This means DORA metrics are accessible to every team and thereby enable a conversation about measures for improving a product team’s performance.
The Metrics at a Glance
1. Deployment Frequency (Throughput)
What?
How often code is successfully released to production. The more often, the better.
How Do We Measure It?
The simplest version is a table in which your team logs every release to production with its date. If you release whenever you create a tag in Git, for example, you can count the tags per week.
Benchmark Figures from the Report
22.7% of respondents manage one or more deployments per day, 23.9% only once a month or less often.
Why Are More Frequent Deployments Better?
More frequent deployments mean a smaller batch of changes. These are easier to test and verify, which lowers the risk of errors per release. If an error does occur, the smaller changes make it quick to locate, and the previous state of the software easier to restore.
How Do We Improve Deployment Frequency?
Three examples:
- Keep pull requests small: Smaller pull requests allow for easier review and a leaner QA process, and are therefore ready for release sooner.
- Automate deployments: The less manual effort a deployment requires, the easier and more often you can “press the button.” The initial investment in automation pays off quickly, since it frees up capacity for the actual product development and can often take a blocker off the calendar.
- Use feature flags: Feature flags let you release code that isn’t yet visible to the customer, so that fewer changes can truly block a release.
2. Change Lead Time (Throughput)
What?
Time from commit in Git to successful release to production. The shorter, the better.
How Do We Measure It?
In the simplest version, you subtract the timestamp of a feature’s first commit from the timestamp of the release to production, or you measure the time between the merge of a pull request and the release.
Benchmark Figures from the Report
Only 9.4% of all respondents manage less than an hour. Alarmingly, 43.5% take longer than a week.
Why Is a Low Change Lead Time Better?
The less time that passes between building a change and releasing it, the fresher the engineer’s memory of their work, and the easier it is to fix an error. What’s more, a short interval between development and release fosters a fast learning culture and iterative improvement—two values anchored in the principles of agile software development.
How Do We Improve Change Lead Time?
Three examples:
- Keep review processes lean: As a team, define which comments in a pull request must be addressed immediately and which may be reworked later. Strictly separate changes to the software’s behavior from structural changes, so that the latter can be moved forward with only a minimal review.
- Speed up and streamline pipelines: When every push takes half an hour to build, it causes congestion. Changes move more slowly from development to release, testers wait for the latest version, and pipeline failures on a ticket that was worked on half an hour ago will very likely trigger a context switch.
- Distribute review responsibility: A single senior who has to watch over every change is perhaps the biggest bottleneck in a product development team. Here it pays to ask questions like “Can no one else really do this?”, “Can AI support us with this?”, or “Do we really have to review purely structural changes?”
3. Failed Deployment Recovery Time (Throughput)
What?
Time to restore a working state after a failed release. The shorter, the better.
How Do We Measure It?
Tools like Opsgenie provide these values automatically. If your team doesn’t have incident tooling yet, you can measure the time between the start of the incident and its resolution. What’s important here is that only incidents directly caused by a deployment count.
Benchmark Figures from the Report
21.3% of respondents recover within an hour; more than half (56.5%) take more than a day.
Why Is a Short Recovery Time Better?
The faster an incident is resolved, the less economic damage it causes. I know of situations where every hour the system was down caused damage in the six-figure euro range. Handling incidents is something you can practice, and it takes away teams’ fear of deploying (yes, that exists), which is why the DORA team counts this value toward Throughput rather than Instability.
How Do We Improve Failed Deployment Recovery Time?
Three examples:
- Build in rollback mechanisms: When the only way out is to actively fix the error, an engineer has to do cognitively demanding work under time pressure—work that often takes more time than in day-to-day business and can also cause follow-on errors. If you can undo a deployment at the push of a button, that lets your team fix the issue cleanly and under less stress.
- Build in and improve monitoring and alerting: What if we have an incident and no one knows about it? What if the support hotline is running hot and the product owner suddenly thinks: “Wait a minute…” Unexpected errors in production must trigger a notification.
- Document recovery mechanisms: What do I do if the database goes offline? How can I reproduce a customer’s error with minimal effort? Who has the necessary permissions to roll back a deployment? “Sven is on vacation” is a sentence no engineer wants to hear during an incident.
4. Change Fail Rate (Instability)
What?
Share of releases that immediately cause an error in production, making a rollback or a hotfix necessary. The lower, the better.
How Do We Measure It?
Classify every deployment after the fact as clean or caused incident. The ratio of the two categories over time gives you the value.
Benchmark Figures from the Report
Only 8.5% achieve a 0-2% failure rate, and 39.5% have a failure rate above 16%.
Why Is a Lower Fail Rate Better?
A high Change Fail Rate means a deployment carries more risk. That amplifies the fear mentioned under metric 3. A low Change Fail Rate points to good QA mechanisms and clean work, and it means less friction from hotfixes and more trust in the product on the customer’s side.
How Do We Improve the Change Fail Rate?
Three examples:
- Maintain and use a staging environment: An environment that is as similar to production as possible can reveal errors “in the real world” that don’t show up in development environments. The important thing is that such an environment doesn’t just exist but is also actively used for testing in order to be effective.
- Smaller releases: Purely statistically, a large release is more likely to contain an error that causes rework. Smaller releases simply contain fewer sources of error and have the same advantages as smaller pull requests.
- Automated tests in the pipeline: The more errors that can be ruled out automatically, the better. Important: the automation and the stability of tests interact with each other, especially in so-called end-to-end tests (E2E). How do I find the right level of test automation? In short: by experimenting.
5. Deployment Rework Rate (Instability, New Since 2024)
What?
Share of releases that are unplanned and necessary to fix a production problem. Not necessarily just hotfixes, but rework in general. The lower, the better.
How Do We Measure It?
As with the previous metric, we classify deployments as planned/feature and unplanned/fix. The ratio of unplanned deployments to all deployments gives you the value. Important: for a deployment to be classified as unplanned/fix, it must be a direct consequence of a preceding deployment.
Benchmark Figures from the Report
Only 7.3% come in under 2%. 26.1%, that is, more than a quarter of all respondents, spend a considerable amount of time on hotfixing, namely 8-16%.
Why Is a Low Deployment Rework Rate Better?
If you spend a lot of time on rework, you have an unstable system. Because this rework isn’t necessarily urgent and doesn’t happen under stress, it isn’t as visible as the effects of the Change Fail Rate metric, but it reduces real productivity in a subtle way. A lower Deployment Rework Rate means that topics are wrapped up more cleanly and that a lot of time can be invested in the real advancement of the product.
How Do We Improve the Deployment Rework Rate?
Three examples:
- Establish or sharpen a Definition of Done: When is a change complete and ready to be released? Surprisingly often, teams can’t answer this question clearly, which means missing puzzle pieces don’t surface during development but perhaps only at the customer’s end.
- Watch whether AI-generated code drives up the Rework Rate: The metric was newly introduced in 2024, and early figures indicate that code generated by an AI agent drives this value up significantly. So if AI usage on your team is still low, it’s worth starting to measure as early as possible, in order to then find out whether you’re using AI agents effectively.
- Analyze errors instead of just fixing them: “How did this happen?” is a question we don’t often ask ourselves in the hectic day-to-day. We fix symptoms as part of our daily work but rarely ask what the underlying cause was. “Should this have been caught earlier?” and “Were the requirements unclear?” are just two of the interesting questions. What matters is that your team has internalized a culture in which such questions are not understood as assigning blame, but as an impulse for getting better.
How Can I Automate the Measurements?
The good news: in principle, this works very well. The bad news: the actual recommendation does—unlike the significance of the metrics themselves—depend on the tool stack you use.
In short:
- If you use GitLab Ultimate for both your issue tracking and your entire development pipeline, GitLab gives you the first four metrics—including a dashboard—for free (Analyze → Analytics Dashboards → DORA metrics dashboard).
- If you use GitLab in one of the other tiers, or GitHub, or an additional tool for issue tracking (Jira, Linear, etc.), your best bet is clearly Apache DevLake as an open-source solution. If you want less setup and administration effort, SaaS solutions like CodePulse or DX are available.
- Alternatively, it’s worth looking at the classic Google Four Keys, even though the linked repository has since been archived.
My recommendation: Not measuring the DORA metrics is not a good idea. Accordingly, I advise you either to look directly into automating the measurements, or to start lean and manually and firmly schedule setting up the automation on your roadmap soon.
Important: the new metric, Deployment Rework Rate, is unfortunately not yet—or not fully—supported by the various tools (as of June 2026).
How Long Do I Need to Measure, and What Do I Do with the Results?
The DORA metrics only become meaningful when they are collected continuously over a period of three to six months. Shorter or one-off measurements have the problem of any statistic: individual measurements on their own aren’t relevant and may be outliers. Some seasons are considerably more stressful for certain products than others: a coat-check management software is used less frequently in hot summers than in icy winters, and accordingly experiences fewer incidents during the warm season.
The best thing to do with the results is what the DORA team itself does: review them regularly and reassess. Until 2024, the report categorized the individual gradations in the metrics between “Elite” and “Low”; since 2025, a cluster analysis has been carried out instead, and teams find themselves in named groups like “Harmonious High Achievers” or “Legacy Bottleneck.”
What does that mean? The metrics provide a basis for discussion. They provide benchmark values you can compare yourself against. They provide pointers to weaknesses in your own delivery performance and impulses for possible improvements.
For you and your team, that means: you no longer have to guess whether a major change—such as introducing AI agents into your development process—improves or worsens your performance. Gut feeling turns into solid figures. You can talk about those, and the effectiveness of countermeasures, too, can be derived from the same figures a few months later.
What About Goodhart’s Law?
The metrics don’t stand on their own: a high Deployment Frequency is no indication of high performance if the Change Fail Rate is also very high (as a reminder: performance = speed + reliability). It simply makes no sense to declare a single metric the goal. Accordingly, Goodhart’s Law doesn’t apply here.
You could now get really clever and say: “Well, then I’ll just declare the group of DORA metrics the goal—so now Goodhart’s Law applies to the group, right?” In short: no, because you can manipulate a single metric, but if all your DORA values improve over time, you can assume that your team’s performance has actually risen and that it hasn’t managed to game the entire benchmark.
Are the DORA Metrics Still Relevant in the Age of Claude and Codex?
More than ever. Because we can now generate (mostly) working code by talking into a microphone for 60 seconds and then pressing “Enter,” far more code is produced than before. Output rises. With the right system, your team can ensure that the AI agents’ code, too, goes to production in small iterations and small pull requests. Deployments may go up. “We’re getting more done” is the logical gut feeling.
The DORA metrics combine both measurements that can support or refute the “we’re getting more done” feeling (how often do we release?) and those that tell us what effects the new workflows have on stability.
In other words: if your engineering team uses AI agents, you have to measure the DORA metrics. They are the most solid basis for discussion you have.
Are the DORA Metrics the Only Values I Should Measure?
The DORA metrics are your first, very important step toward shedding light on the matter. They give you helpful pointers about your team’s performance. What they don’t give you is an overview of how your team works together and what concrete effects that has.
In Part 3 of the series, we’ll look at how you can extend the DORA metrics with an appropriate framework to gain an overall picture of the performance and of essential aspects like the satisfaction and communication of your engineering team.
Never miss a blog article
Get notified when a new blog article is available. You can unsubscribe at any time.
Your email won't be shared with third parties.