Testing the Trusted: Helping uncover critical flaws in AI Security tools

How do we know AI tools we trust to test models are doing their job?

Company

Plexal connects startups and scale-ups with government and industry. Since 2017 they’ve supported more than 1,200 companies, created 9,400+ jobs, and added £731m to the UK economy. Much of their work happens in areas where security is non-negotiable.

https://www.plexal.com/

Headquarters

London, UK

Industry

Innovation, consultancy

AI is finding its way into every corner of industry, and the threats are keeping pace. In 2024, thousands of malicious machine-learning models were uploaded to Hugging Face, the platform where researchers and engineers share models. Many of these models appeared routine in name and description, yet some concealed techniques that could execute code on load, leak data, or sidestep safety rules.

However, it goes without saying that the platforms are not the problem. The risk lies in the speed at which new exploits appear, and in whether the tools designed to spot those exploits are actually up to the job. Which leads us to an obvious question: how do we know the tools we trust to test models are doing their job?

The challenge

The list of tools that claim to protect models from attacks is growing fast. We wanted to know how effective they actually are in practice – not just what the documentation says.

Plexal asked us to test a few of the leading open-source options to see whether they caught what mattered, and if not, where the gaps were.

Key risks in scope included:

Deserialisation exploits: code hidden inside a model file that runs when the file is loaded.
LLM jailbreaks: carefully crafted prompts that try to push a model past its own safety rules.

The three tools we tested, all open source and widely used across the AI security community, were:

ModelScan:

A tool for checking your machine learning models before they reach production. It scans models across different formats for unsafe or malicious code, helping you avoid deploying something that could cause harm. It’s open source (developed by Protect AI) and free to use, though worth noting that like most community tools, it depends on active maintenance to stay current.‍

Fickling:

A pickle file scanner and decompiler. Fickling can be easily integrated in AI/ML environments to catch malicious pickle files that could compromise ML models or the hosting infrastructure. Fully open source and maintained by Trail of Bits.

Guardrails AI:

A framework for adding safety checks around large language models. Guardrails lets you define what “good” output looks like – for example, blocking harmful responses or enforcing a specific structure – and wraps those rules around your model automatically. It’s open source under the Apache licence, though there are managed and enterprise versions built on top of the same framework. Designed for customisation and observability so you can see exactly how and when it steps in.

What happened next? Introducing the AI Security Sandbox

Together we built the AI Security Sandbox: a modular framework to check AI security tools under realistic conditions. With the aim to create a system Plexal could run again and again to build evidence they could trust.

The AI Security Sandbox works by:

translating tool promises into testable claims,
generating realistic and adversarial attacks with a simulation engine,
running those attacks through a test harness that logs behaviour and results,
and keeping everything modular and containerised with Docker, so new tools or threats can be added with minor changes.

What we found

Model file scanners

ModelScan: ⚠️ missed a previously unknown deserialisation exploit (since patched), but it met all other claims.
Fickling: ✅ caught all exploits that it claimed to.

LLM guardrails

Guardrails AI: ⚠️ detected only 54.55% of jailbreak attempts in our tests. It stopped simple prompt injections but faltered on more complex adversarial prompts.

What’s a jailbreak? A crafted prompt designed to trick a large language model into ignoring its safety rules. For more on guardrails and jailbreaks:

Our blog on guardrails for LLMs

Why it matters

Our work with Plexal made the risks tangible. Breaking each tool into testable claims and simulating realistic adversarial behaviour turned blind trust into confidence grounded in evidence.

True security, though, is bigger than any single tool. It comes from confidence in the whole system: how models are trained and stored, the dependencies they rely on, and the environments they run in.

This project showed that trust must be earned through simulation, transparency and repeatable testing.

Beyond the project

The Sandbox was built so it could be reused and extended. Plexal now has the option to run it with new tools or new types of threat as they appear.

The takeaway

AI security tools are nascent. They form the defence for the threats they’re built to catch, but the landscape is slippery and changes fast – yesterday’s green tick doesn’t guarantee today’s assurance.

The AI Security Sandbox gives Plexal confidence in those defences. It lets them prove their controls work for the risks that matter now, and keep proving it as new threats emerge.

For Plexal, operating in domains where the cost of failure is high, that confidence is what matters most – not new software, not new process, but proof that their security stands up when it counts.

Trust grounded in evidence, and a way to keep earning it.

‍