Measuring Agent Effectiveness

We’ve been working on an agent to help site reliability engineers identify errors and suggest fixes, basically, a kind of debugging copilot that doesn’t sleep or rest. In our previous blog, we explored how we moved beyond a chess-playing agent to tackle real-world systems. Now, we’re pushing things further: can an AI agent help keep real systems running?

To find out, we focused on a widely-used deployment stack; AWS, Kubernetes, GitHub, and Slack and used Google’s well-known microservices-demo application as our testing ground. It gave us a realistic environment to see whether the agent could step up when things go wrong.

Before we dive into how we measure that, it’s worth quickly explaining what we mean by agent. The simplest way to think about an agent is this: unlike a traditional workflow, which follows a fixed set of steps, an agent decides what to do based on the tools it has access to. It’s not told how to solve a problem, it figures that out on its own.

Which led us to the next question: how do we measure that? What does it mean for an agent to be effective?

This blog is about how we’re evaluating the effectiveness and usefulness of agents, what we’re testing for, and how we’re deciding whether the agent is actually doing its job.

What Makes an Agent Effective?

At a high level, an effective agent does three things: it spots the right problem, picks the right tool to solve it, and suggests a fix that actually works. But in practice, it’s a bit messier. Sometimes the agent gets close but misses the mark, or it suggests a fix that solves a different issue.

The most effective agents don’t just land technically correct answers, they make it easy to follow their suggestions and fix the issue without hassle. They save you time by reducing manual troubleshooting, provide clear, accurate advice, and offer solutions you can actually act on.

That’s why we think of effectiveness as two parts:

Effectiveness: Can the agent actually solve the problem?
Usability: Is the fix clear, concise, and actionable without unnecessary complexity?

It’s not enough to be technically right, the real goal is helping someone get unstuck faster than they would on their own.

What’s Out There Today?

There’s been a lot of progress recently on evaluating agents, with frameworks like Confident AI, AWS’s agent evaluation repo, and MCP Evals focusing mostly on tool use, task success, and reasoning steps. They’re great for checking if the agent called the right tools and completed the workflow, but we found they don’t always measure what matters most day-to-day: how much manual work the agent actually saves, and whether the solution it suggests is clear and immediately usable.

That’s where our approach is different, we focused not just on whether the agent was technically right, but also on how much faster and easier it made the debugging process, tracking things like minutes saved, how much follow-up was needed, and whether the output was clear and concise.

The Evaluation Plan: Measuring Agent Effectiveness

We put together a small test set of real-world errors, things like “can’t add item to cart” or “payment failed” and used a simple table to score the agent on each one.

Below is an example of what we’re tracking:

Error	Error Type	Identified?	Correct Solution Suggested?
A short description of the error	Whether it’s an application or system-level issue	Did the agent correctly identify the root cause?	Did it suggest a fix that would actually solve the problem?

From this, we can measure:

Error identification rate – how often does it correctly spot the issue?
Solution accuracy – out of the ones it spotted, how often did it suggest a fix that worked?

Other things we can also look such as:

Did it choose the right tool from our internal toolset (The MCP server).
Produced code that actually runs
Gave output that was concise, not just correct

What About Usability?

Even if an agent gets the answer right, it doesn’t help if the output is confusing or bloated.

So we added another lens: how usable is this suggestion?

Is it concise and traceable?	Is it actionable?	Time saved vs manual debugging	Any follow-up needed?
Is the response concise and easy to trace back to the issue? Did it include any irrelevant details or bloated output?	Is it actionable, can I take it and do something with it straight away?	How much time did it save compared to manual debugging?	Was any follow-up needed, or was the fix self-contained?

What We Found

Effectiveness

To get a sense of how well the agent is performing, we ran it against a handful of real errors taken from production-like scenarios. We wanted to know:

Did it spot the root cause of the problem?
Did it suggest a fix that actually works?

Each row in the table below is a real example we tested. It includes the type of error, whether the agent identified it correctly, and whether it suggested the right solution.

	Error	Error Type	Identified?	Correct Solution Suggested?
1	An exception block was added to the Cart Service to throw an error when someone tries to add loafers to the cart.	Application	✅	✅
2	In the Payment Service, there was incorrect logic in an if statement—it should only throw an exception when credit card details are invalid, such as during order placement.	Application	❌	❌
3	The Currency Service had no conversion rate for GBP. This caused a check to throw an exception due to a zero or negative conversion rate when switching to GBP.	Application	✅	✅
4	Product Catalog Service – a negative price was set for ducks, which should not be allowed.	Application	✅	✅
5	A misconfiguration in the Kubernetes manifest allowed the Checkout Service to request more memory than the node could handle, causing the node to crash.	System	❌	❌

Usability

Even if the agent gave the right solution, the suggestion also needs to be clear, actionable, and ideally, the goal is to save time. So for each test case, we also looked at how easy it was to follow the response and whether it actually helped.

Here’s what we tracked:

Error	Is it concise and traceable?	Is it actionable?	Time saved vs manual debugging	Any follow-up needed?
1 – Cart Service (loafers)	Yes – it pinpointed the exact code block that triggered the error and identified the file it came from.	Yes – showed the code and gave a sensible recommendation.	~15 minutes saved	No
2 – Payment Service	No – there was no response from the agent.	N/A	None	Yes – manual investigation needed
3 – Currency Service (GBP)	Yes – correctly identified the root cause.	Yes – clearly explained that the GBP conversion rate was missing.	~15 minutes saved	No
4 – Product Catalog (ducks)	Yes – spotted that the duck price in product.json wasn’t positive.	Yes – clearly said we need to set a valid price in the right file.	~15 minutes saved	No
5 – Node crashed system error	Yes – It suggests several recommended actions for further investigation.	Yes – Although it does not provide a solution that would immediately fix the crashed node, it suggested action to take for further investigation.	None	Yes

Results

From what we've seen, the agent does well with application errors, as long as it has access to the right logs. For example, it missed the Payment Service error in our second test case because the pod had already restarted, and the agent didn't check logs from the terminated pod. This was due to the system setup and prompt configuration.

System-level errors are trickier. In our test case, the agent noticed the node wasn't ready but failed to identify the root cause: a pod requesting more memory than the node could handle. This limitation comes from our MCP server lacking a tool to surface node health status and resource metrics.

So overall, the agent's performance depends heavily on three key factors: system configuration, available tools, and prompt structure. Can your MCP server provide the necessary data? Are your prompts specific enough to be useful, yet general enough to work across services? Does the agent select the right tools? These factors directly impact the agent's effectiveness.

One challenge we found is crafting prompts that are both general enough to work across different issues and detailed enough to be useful. This may require careful prompt engineering, or even developing tailored prompts for different error types.

Could this evaluation framework complement existing tools and help improve prompt engineering while measuring effectiveness and usefulness?

Trust In Your Agent In Production

Another major factor to consider when using an agent in production is trust. If you’re going to rely on an agent during debugging, you need to trust it. That doesn’t just mean being right once, it needs to be consistently helpful, and never get in your way. The moment it suggests the wrong fix and creates more work, that trust starts to break down.

Personally, I’d need a lot of confidence in the system before letting it restart a deployment, let alone touch the codebase, build Docker images, or push to a registry. I don’t think most SREs would be comfortable giving an agent too much control. Imagine it scales down all your nodes to “fix” an error, and ends up taking production offline for hours. That’s a disaster.

That said, I do see huge value in having agents that run diagnostics based on logs, especially in large systems with hundreds of microservices, and new engineers joining all the time. Just having something spot issues early can save a lot of time. It’s like having an extra set of eyes to make sure things are running as expected.

Of course, there’s a cost to all this. Ingesting millions of logs into an LLM isn’t free, financially or technically. You’ve got to think about the compute cost, the data pipeline, and whether the value you’re getting out of it justifies the resources. In practice, you’ll probably want to be smart about what’s ingested, maybe just recent logs, or logs around specific incidents to reduce the cost if you are making API calls to LLM services.

Another thing to keep in mind: your agent is only as good as your existing monitoring and alerting. What if agents could help improve your observability? Spot patterns that keep being missed, and suggest better alerts or log lines? That kind of feedback loop could be really powerful.

That’s why we’re not just measuring accuracy, but also clarity, usefulness, and how often it hallucinates. An agent that’s technically correct but confusing is almost as bad as no agent at all.

So before putting one in production, we need to think carefully:

What permissions should it have?
What happens if something goes wrong?
How much does it cost to run, and is it worth it?

What’s Next?

You might have an agent that solves your problem exactly how you want it, but is it still worth it if it ends up costing more than the value it brings? Autonomy isn’t cheap, and in some cases, the wrong setup could burn through your budget fast.

Our next blog post will dive into that: how expensive is autonomy really, and could the wrong agent setup actually bankrupt you?