AI Observability in Developer Workflows: What the Latest Research Actually Shows

Smiling person in layered hair w/eyelashes,gesturing

Published on 14 May 2026 by Zoia Baletska

f01901d03f94a2b8b2e7cdfa4cd19ddfda531fce4a81fb4da72adcd3aff85acb.webp

AI has quietly settled into everyday development work. It shows up in code reviews, pull requests, documentation, and all the small decisions that keep things moving. What’s still missing, in many cases, is a clear understanding of what it’s actually doing there.

A recent paper, AI Observability for Developer Productivity Tools: Bridging Cost Awareness and Code Quality, takes a closer look at that gap. Rather than treating AI as something you either “use” or “don’t use,” it looks at how to track it properly—cost, behaviour, and output included—and how that data fits into development workflows.

The Problem: Plenty of Usage, Very Little Clarity

Teams today often rely on several AI tools at once. Some are embedded in IDEs, others in review systems, others in standalone tools. Each of them generates data, but it rarely ends up in one place. That creates a familiar situation: you know AI is being used, but not how much it costs overall, how often it’s triggered, or how reliable the outputs really are. There’s no clean way to connect usage with outcomes.

The paper frames this as an observability problem. Not in the traditional infrastructure sense, but in the broader sense of understanding how a system behaves under real conditions.

What the Paper Builds

To explore that idea, the authors built a system that combines two things that are usually separate:

a developer-focused dashboard that tracks work like pull requests and AI-assisted reviews
an observability layer that captures model usage, tokens, and costs

Instead of running side by side, both feed into the same data model. That shared layer is what makes it possible to look at AI activity in context, rather than in isolation.

The implementation itself is fairly straightforward—FastAPI on the backend, a simple frontend, and integrations with several AI providers—but the structure matters more than the stack.

Seven Patterns That Keep Showing Up

One of the more useful parts of the paper is a set of patterns that came out of the implementation. They’re not abstract ideas; they reflect what had to be built to make the system work:

pulling real token usage directly from provider APIs
maintaining a pricing registry for different models
defining a shared telemetry format
aggregating costs into dashboards
validating model outputs before they’re used
generating summaries with LLMs
producing reports from collected data

Taken together, these patterns sketch out what “AI observability” looks like in practice. Not a single tool, but a combination of pieces that fill in different parts of the picture.

What They Actually Ran

This wasn’t limited to a small demo. The system was used with real development activity over several months, including AI-assisted reviews and general usage across tools.

Data came from multiple places:

direct API integrations
CLI logs from local tools
manual entries where automation wasn’t possible

That mix is worth noting. Even with a working system, not everything can be captured automatically.

What Came Out of It

A few results stand out, mostly because they put numbers behind things that are often guessed.

Token tracking turned out to be highly reliable when taken directly from provider APIs. The reported usage matched billing data, which means cost calculations could be treated as accurate rather than approximate.

At the same time, output quality told a different story. A noticeable share of responses—29%—needed some form of cleanup before they could be used. That doesn’t make AI ineffective, but it does change how its contribution should be interpreted.

On the performance side, the system handled queries and dashboard updates quickly enough to be usable in day-to-day work. More importantly, it reduced the effort needed to understand usage patterns. Instead of piecing things together manually, the data was already there.

Another point the paper makes, indirectly, is that no single data source is enough. APIs cover part of the picture, logs cover another, and some gaps still require manual input.

Where the Approach Falls Short

The paper doesn’t gloss over the rough edges.

Some tools don’t expose their usage data at all, which makes automated tracking impossible. In those cases, costs have to be entered manually.

Validation pipelines work well when outputs follow a structured format, but become less reliable with free-form text.

Pricing data isn’t automatically updated, so it has to be maintained over time. And anything based on parsing logs depends on formats that can change without notice.

None of these issues breaks the system, but they do limit how complete the picture can be.

What This Means for Teams Using Agile Analytics

The paper makes one thing clear: AI is no longer a side tool in development. It’s part of how work gets done, and it behaves like any other component in your delivery system—it produces output, consumes resources, and introduces variability.

That has consequences.

Most teams today track delivery performance through familiar signals—cycle time, throughput, and deployment frequency. Those still matter, but they don’t tell you much about how AI is shaping the work behind them. You might see faster reviews or shorter feedback loops, but without visibility into AI usage, it’s hard to tell whether that improvement is real, where it comes from, or what it costs.

The approach described in the paper points in a different direction. By tying AI activity—tokens, cost, outputs—directly to development workflows, it becomes possible to look at things in context. Not just how fast something moved, but what contributed to that speed, and whether it was worth it.

That opens up more grounded questions:

Are AI-assisted reviews actually reducing effort, or just shifting it into cleanup?
Which parts of the workflow benefit from AI, and which ones don’t?
How does AI usage affect cost at the level of a team, a feature, or a sprint?

This is where observability starts to matter in a practical sense, not as another dashboard, but as a way to connect signals that are currently disconnected.

For teams working with Agile Analytics, the implication is fairly direct. If AI is influencing delivery, then it belongs in the same analytical space as everything else—alongside DevEx signals, flow metrics, and operational data. Otherwise, you’re only seeing part of the system.

The paper doesn’t present a finished solution, and it doesn’t try to. What it shows is that once you start measuring AI usage properly, a lot of assumptions become testable. Cost stops being an estimate. Output quality stops being anecdotal. Patterns start to emerge. And that changes how decisions get made.

Supercharge your Software Delivery!

Become a High-Performing Agile Team with Agile Analytics

Book a demo

Implement DevOps with Agile Analytics
Implement Site Reliability with Agile Analytics
Implement Service Level Objectives with Agile Analytics
Implement DORA Metrics with Agile Analytics

AI Observability in Developer Workflows: What the Latest Research Actually Shows

The Problem: Plenty of Usage, Very Little Clarity

What the Paper Builds

Seven Patterns That Keep Showing Up

What They Actually Ran

What Came Out of It

Where the Approach Falls Short

What This Means for Teams Using Agile Analytics

Supercharge your Software Delivery!

Read more:

When AI Doesn’t Help — Pitfalls, False Positives & How to Detect Them Early

Putting It All Together — How to Build an AI Impact Dashboard Without Breaking Trust or Teams

Output Metrics — How to Accurately Track Code Throughput & Quality when Using AI

Measuring AI Adoption & Tool Usage — What to Track Before You Code

Can You Really Measure AI Impact on Developers? A Practical Framework

The Reality Behind AI for Developers: Prompting in Practice