.png)
AI agents are moving fast, but can they handle the real world? After building a chess-playing agent to test the basics (read more about that here), we wanted to see what would happen if we pushed things further. The question: can an AI agent help keep real systems running?
Why Move Beyond Chess?
Our first project, the Fuzzy Labs Chess Agent, gave us a tidy little sandbox. A place to test agent behaviour in a predictable environment. Clear rules and outputs with no messy surprises. But real-world systems aren’t chessboards. They’re noisy, unpredictable, and full of edge cases.
So we decided to take the next step: drop an agent into a much harder problem space — site reliability engineering (SRE). Our CTO, Matt Squire, was particularly keen to eradicate some of the stress that he went through as a on-call software engineer in a prior life, being awoken in the night by error logs that had no impact on the state of the application.
Ultimately, we built the Agentic SRE as a kind of stress test: could an agent observe real logs, interpret alerts, and suggest meaningful responses. Maybe even take limited actions in a production environment?
How It Worked
We reused the same foundation: the Model Context Protocol (MCP), an open standard for building agentic systems. But this time, instead of logging into a chess server, the agent would interact with live infrastructure.
We gave it tools to:
- Read logs and metrics from Kubernetes
- Pull files from GitHub
- Post updates in Slack
- Suggest remediation steps based on real-time input
We also started peeling back some of the layers around tools like Claude Desktop, aiming to move toward a fully transparent, abstraction-free agent. Our production journey tracks how we’re evolving from a desktop toy to a real-world deployment inside a Kubernetes cluster on AWS.
Our Focus: Security, Effectiveness, and Cost
As the project has evolved, three themes have emerged as core research areas:
1. Effectiveness
We care deeply about whether the agent’s reasoning is correct, useful, and comprehensible. Can it surface relevant information during an incident? Can it propose actionable steps? Can a human operator trust it, or at least understand its logic?
We’re experimenting with evaluation methods for agentic effectiveness, beyond just model accuracy or latency.
2. Security
Any agent that can observe or take action in production must be built with security at its core. We're exploring agent permissions, sandboxed execution, human-in-the-loop controls, and the boundaries between suggestion and action.
3. Cost
Long-context models, tool use, and high inference rates come at a cost, both computationally and financially. We’re investigating how to balance intelligence with efficiency, and where simplification makes sense.
Where We Go From Here
We’re not trying to replace human SREs. This isn’t about automation for its own sake. It’s about learning what agents can realistically do today, where they struggle, and how we can build toward safer, more effective systems.
Some of the big questions we’re thinking about now:
- What does “good” agent behaviour look like in the chaos of real incidents?
- How do we evaluate usefulness, not just correctness?
- What does safe decision-making look like when the stakes are high?
This is still early research. The Agentic SRE isn’t a product, it’s an experiment. But so far, the results are promising.
Follow the next series of blogs that dive deeper into the implementation and the 3 key research areas.