Bringing Agentic AI into the Real World

AI agents are moving fast, but can they handle the real world? After building a chess-playing agent to test the basics (read more about that here), we wanted to see what would happen if we pushed things further. The question: can an AI agent help keep real systems running?

Why Move Beyond Chess?

Our first project, the Fuzzy Labs Chess Agent, gave us a tidy little sandbox. A place to test agent behaviour in a predictable environment. Clear rules and outputs with no messy surprises. But real-world systems aren’t chessboards. They’re noisy, unpredictable, and full of edge cases.

So we decided to take the next step: drop an agent into a much harder problem space — site reliability engineering (SRE). Our CTO, Matt Squire, was particularly keen to eradicate some of the stress that he went through as a on-call software engineer in a prior life, being awoken in the night by error logs that had no impact on the state of the application.

Ultimately, we built the Agentic SRE as a kind of stress test: could an agent observe real logs, interpret alerts, and suggest meaningful responses. Maybe even take limited actions in a production environment?

How It Worked

We reused the same foundation: the Model Context Protocol (MCP), an open standard for building agentic systems. But this time, instead of logging into a chess server, the agent would interact with live infrastructure.

We gave it tools to:

Read logs and metrics from Kubernetes
Pull files from GitHub
Post updates in Slack
Suggest remediation steps based on real-time input

We also started peeling back some of the layers around tools like Claude Desktop, aiming to move toward a fully transparent, abstraction-free agent. Our production journey tracks how we’re evolving from a desktop toy to a real-world deployment inside a Kubernetes cluster on AWS.

Our Focus: Security, Effectiveness, and Cost

As the project has evolved, three themes have emerged as core research areas:

1. Effectiveness

We care deeply about whether the agent’s reasoning is correct, useful, and comprehensible. Can it surface relevant information during an incident? Can it propose actionable steps? Can a human operator trust it, or at least understand its logic?

We’re experimenting with evaluation methods for agentic effectiveness, beyond just model accuracy or latency.

2. Security

Any agent that can observe or take action in production must be built with security at its core. We're exploring agent permissions, sandboxed execution, human-in-the-loop controls, and the boundaries between suggestion and action.

3. Cost

Long-context models, tool use, and high inference rates come at a cost, both computationally and financially. We’re investigating how to balance intelligence with efficiency, and where simplification makes sense.

Where We Go From Here

We’re not trying to replace human SREs. This isn’t about automation for its own sake. It’s about learning what agents can realistically do today, where they struggle, and how we can build toward safer, more effective systems.

Some of the big questions we’re thinking about now: