Guide: Measuring AI Agent Productivity with the SPACE Framework

June 29, 2026 10 min read

Martin Zoeller

In part one of the series, we learned how not to measure the performance of a product development team, so that we can avoid common mistakes when assessing a development process. In part two, we were introduced to the DORA metrics, which let us capture a team’s delivery performance without falling victim to Goodhart’s Law.

With these metrics, we can already assess the results of our work quite well and answer two central questions:

How much throughput do we have on the team—that is, how often do we release changes?
How stable are these releases, and how much rework do we have?

In the age of agentic coding, many teams measure—if they measure at all—only the metrics behind question one and neglect the metrics behind question two.

In part three of the series, we expand the two questions above with a third, to get a far more comprehensive overall picture:

What do the inner workings of the engine that produces these releases look like?

To do this, we draw in part on the SPACE framework.

What Is the SPACE Framework?

The core idea of the SPACE framework is that productivity is multidimensional and cannot be captured by a single metric. It was published in 2021 under the title “The SPACE of Developer Productivity” by a group of authors from, among others, Microsoft, GitHub, and the University of Victoria.

For our purposes, the SPACE framework’s main advantage is that it is complementary to the DORA metrics, not competing with them. By collecting the SPACE metrics, we broaden our overall picture. We can:

collect additional metrics that eliminate blind spots, and
infer the causes of the symptoms we noticed while measuring the DORA metrics.

SPACE is an acronym for the five metrics that the framework lets us collect:

Satisfaction & Well-being
Performance
Activity
Communication & Collaboration
Efficiency & Flow

Before we look at the metrics in detail, three important notes upfront:

Not every metric is as easy to measure as some of the DORA metrics.
Not every metric is equally relevant for our purposes.
As with DORA, it’s important to collect these metrics at the team level, not at the level of an individual engineer.

S: Satisfaction & Well-being

What Are We Measuring?

How satisfied, engaged, and healthy (i.e., not burned out) are developers when it comes to their day-to-day work, their tools, and the company and team culture?

How Do We Measure It?

Measuring it directly isn’t possible. The simplest approach is regular anonymous surveys such as NPS (Net Promoter Score, “How likely is it that I would recommend my team to others?”) that you collect and evaluate.

Start with a quarterly survey and gather feedback on the frequency. Maybe a survey every six months makes more sense for your team. You round out the quantitative and qualitative figures with a look at the turnover on your team, and you have a solid impression to build on.

How Important Is It?

High turnover hurts your team’s productivity and ultimately your company. On top of that, we spend a great deal of time at work; accordingly, on a purely human level alone, it’s very important to create an environment where people enjoy being and that enables them to be the best version of themselves.

Beyond that, Satisfaction & Well-being is probably the biggest blind spot of the DORA metrics, and perhaps even the biggest blind spot of all the developments that have taken place in software development in recent years.

Anyone who puts in the effort here can easily stand out as a good employer.

How Can I Improve Satisfaction & Well-being?

Six examples, depending on your qualitative data:

Reduce the meeting load and context switching.
Distribute on-call load and responsibility fairly.
Define shared goals and priorities.
Reduce micromanagement and empower engineers to manage themselves.
Improve the developer experience and reduce toolchain friction.
Create a safe space for feedback and growth, e.g., through regular 1-on-1s.

P: Performance

What Are We Measuring?

What impact does the code have on quality, reliability, customers, and the company?

How Do We Measure It?

Once again, measurement is only partially possible, or possible via proxies. Indicators of low performance include, for example:

a high Change Fail Rate (see DORA metrics),
a high share of bug tickets per iteration (e.g., sprint),
declining or low customer satisfaction,
low usage of new features.

Here it’s important to define the proxies you can measure without significant effort, and to come up with a simple methodology for how you use these measurements to draw conclusions about performance. Never define just one proxy—better three or four.

How Important Is It?

Because the cost of change feels like it has dropped thanks to agentic coding, the danger of “busy work” has risen sharply: the tools are (still) cheap, and some engineers or teams fall into something close to an AI frenzy. A look at performance lets you draw conclusions about whether the changes are really moving your company forward, or whether they only come about because it’s so easy. Even in 2026, the rule holds: the best code is the code you don’t write (or, well, generate).

In short: very important.

How Can We Improve Performance?

Three examples, depending on your choice of proxies:

Keep releases smaller to reduce the risk of errors.
Conduct user interviews and build in surveys to better capture needs.
Use A/B tests or fake-door tests to gauge demand for features.

A: Activity

What Are We Measuring?

How high is the volume of your engineers’ visible actions (commits, pull requests, completed tickets, and the like)?

How Do We Measure It?

For example, you can track the number of completed tickets or the number of pull requests merged into main. This metric is solely about the volume of work done (which is also what makes it so dangerous).

How Important Is It?

Activity is the metric that has lost the most significance because of agentic coding: anyone who uses AI agents substantially increases the volume of their actions. The stability of the results can stay the same or even decline, and the value of the changes to the customer likewise can’t be discerned from this metric.

Since the metric is easy to collect (via Jira or GitHub), you can add it to the mix. In 2026, it carries meaning—if any at all—but only when it suddenly drops sharply.

How Can We Increase Activity?

When you integrate AI agents into your team’s development process (which you should), the volume of your engineers’ visible actions increases automatically. That’s enough. As we already learned in part one, it is strongly discouraged to declare this metric a goal. At that point, it falls victim to Goodhart’s Law anyway.

C: Communication & Collaboration

What Are We Measuring?

How good is the flow of information within the team, and how well do teams coordinate with respect to knowledge sharing, review culture, and onboarding?

How Do We Measure It?

A lot happens in private channels, so direct measurement is hard. Here, too, there are a few proxies you can use, e.g.:

How long do pull requests stay open?
How many reviewers are there per pull request?
How much time passes before a new team member can merge a change for the first time?

In addition, communication and collaboration can be captured through regular surveys, with items like “I can quickly find the right point of contact.”

How Important Is It?

Information flow and knowledge transfer determine, like almost no other mechanism, exactly what your team builds. Unclear requirements lead to the wrong features being built. Missing information increases the risk of bad decisions. Missing points of contact mean that ideas or problems never get raised.

How Can We Improve Communication & Collaboration?

Here are a few examples, which again depend on the proxies you choose:

Distributing the review load improves knowledge transfer.
Pairing or mob sessions spread knowledge and lead to better decisions.
Efficient, up-to-date documentation reduces dependencies and fosters independent work.
A well-maintained onboarding checklist enables new team members to become productive faster.
A dedicated onboarding buddy for every new team member lowers the barrier to asking questions and fosters integration into the team.

E: Efficiency & Flow

What Are We Measuring?

How undisturbed is engineers’ progress (with respect to interruptions, wait times, and handoffs)?

How Do We Measure It?

Efficiency and flow can be measured via Change Lead Time (DORA), wait times between workflow stages, or the number of handoffs per member. It’s also worth taking a look at developers’ calendars: How many blocks of uninterrupted focus time are there? How long are these blocks on average?

Here, too, you can back up these impressions with survey items like “I had enough uninterrupted focus time for my tasks.”

How Important Is It?

Context switches are very expensive, and packed calendars or focus blocks that are too short massively reduce a developer’s efficiency.

In the age of agentic coding, however, there’s a new challenge: where a developer a few years ago could still get into a kind of “flow state” in which they could solve hard problems and push things forward very effectively, today Claude Code or Codex often takes over. These new workflows come with wait times we didn’t have before.

Where you could previously help with focus blocks in the calendar and meeting-free days, the responsibility for the flow state now shifts onto the individual person. I increasingly see developers switching topics (and with that, context) during the wait times, or watching short-form videos.

Honestly, as of June 2026, this problem is not solved, and one can speculate that this phenomenon—much like burnout, for example—will become one of the unsolved problems of our industry.

How Can We Improve Efficiency & Flow?

Apart from the individual level I just touched on briefly, at the team level you can try the following approaches:

Encourage fixed focus blocks by, for example, requiring that meetings always take place in the morning.
Meeting-free days are often hard to implement, but at least worth a try.
Faster build pipelines reduce the risk of unnecessary wait times.

On a personal level, it can be worthwhile to push back against heavy automation and parallelization, and instead see whether the performance gains from AI agents are already noticeable even without multitasking.

Which of the SPACE Metrics Matter Most in 2026?

As the deep dive above shows, you can’t collect the SPACE metrics as directly as the DORA metrics. The additional effort is correspondingly higher. Important upfront: I don’t know a single team that collects all five DORA metrics and all five SPACE metrics.

The authors themselves recommend collecting at least three metrics. In the age of agentic coding, my absolute recommendation is to regularly collect and evaluate the following three metrics:

Satisfaction & Well-being
Communication & Collaboration
Performance

You’ll find that Efficiency & Flow emerges in part from the qualitative feedback on Satisfaction & Well-being and from your observations in Communication & Collaboration. I consider Activity highly dangerous in 2026, and by far the least meaningful.

The only case in which measuring activity would become valuable is if it dropped unexpectedly sharply. But even then, it leaves the question of “why” open, which is why at that point, at the latest, you’d have to fall back on the other metrics from SPACE and DORA.

Conclusion

With the end of this series, you now have all the tools you need to measure both the delivery performance and the productivity of your team and its development processes. You now know the dangers that arise from wrong or too few metrics, and you know how to get around Goodhart’s Law so that your measurements stay efficient.

With this knowledge, you have an important edge over your competition: as soon as you carry out regular measurements, you can assess how changes in your team or your company play out, and give well-founded answers to the questions that are on everyone’s mind in 2026:

Are AI agents making us faster? Is our software becoming unstable? How are my employees doing with the changes? Are the agents really worth their cost?

What’s the Best Way to Continue?

All that remains is to put in place the foundational structures you need to collect the values:

Figure out which DORA metrics you can read from your existing infrastructure, and, as a complement, take a look at Apache DevLake, for example.
Develop a form for an employee survey to collect the most important SPACE metrics on a quarterly basis at first.
Gather the qualitative and quantitative data in one place, e.g., a dashboard, so that you can see and assess everything at a glance.
Wait until you have enough data; measure for at least three months before you draw any conclusions from the results.
Continuously improve yourself and your team. Be willing to experiment. Be open to change.

If this series helped you, I’d be delighted if you sent the article link to your colleagues. If you need help with the implementation, write to me at hi@martinzoeller.com or book a slot for a no-obligation conversation.

Back to all articles

Hand-picked articles on closely related topics.

Get updates on agentic software engineering

Get notified when I publish new insights about agentic coding in software development. You can unsubscribe at any time.

Your email won't be shared with third parties.

Guide: Measuring AI Agent Productivity with the SPACE Framework

What Is the SPACE Framework?

S: Satisfaction & Well-being

What Are We Measuring?

How Do We Measure It?

How Important Is It?

How Can I Improve Satisfaction & Well-being?

P: Performance

What Are We Measuring?

How Do We Measure It?

How Important Is It?

How Can We Improve Performance?

A: Activity

What Are We Measuring?

How Do We Measure It?

How Important Is It?

How Can We Increase Activity?

C: Communication & Collaboration

What Are We Measuring?

How Do We Measure It?

How Important Is It?

How Can We Improve Communication & Collaboration?

E: Efficiency & Flow

What Are We Measuring?

How Do We Measure It?

How Important Is It?

How Can We Improve Efficiency & Flow?

Which of the SPACE Metrics Matter Most in 2026?

Conclusion

What’s the Best Way to Continue?

Related Articles

Get updates on agentic software engineering