Let's talk
Don’t like forms?
Thanks! We'll get back to you very soon.
Oops! Something went wrong while submitting the form.
Lab Notes • 10 minutes • Mar 28, 2024

How Safe Are Your LLM Safeguards?

Mikhail Iakovlev
Mikhail Iakovlev
Senior MLOps Engineer

This blog is part of our "Cybersecurity for Large Language Models" series. You can find out more here.

In the previous blogs on jailbreak attacks and membership inference attacks, we gave you a hint on how guardrails might be one of the potential solutions to tackle these problems. In this blog, we double down on this solution and explain more about adding safeguarding mechanisms on top of LLMs.

Just as a gorilla safeguards its territory, robust security measures in place can help prevent misuse of LLMs. Each safeguard can be thought of as an independent layer of defence, reinforcing the overall protection against potential threats and ensuring the responsible and ethical use of LLMs.

In this blog, we will outline the importance of having different safeguards, different possibilities to safeguard an LLM through a case study on off-topic modelling, and various approaches to evaluate these safeguards. 

Safeguards are by no means foolproof in all scenarios. Motivated attackers tend to find ways around safeguards as well and having a system in place that continuously evolves is essential in flagging these threats early.

This blog is part of our Cybersecurity for Large Language Models series. If you’re interested in learning more, check out this blog which provides an overview of the whole series.

Approaches to Evaluating LLM Safeguards

Bad actors can misuse an LLM to conduct cyber attacks, generate and propagate compelling misinformation content, or exploit vulnerabilities for illicit gains. A recent study by Check Point Research (CPR) detailed in their blog showcases the diverse array of methods employed by cybercriminals to create and distribute malware, including infostealers and encryption tools, leveraging the capabilities of OpenAI's ChatGPT.

Adding safeguards to an LLM can help detect and prevent such nefarious behaviours. For LLMs to work reliably and safely, safeguards themselves should be evaluated and carefully calibrated. The robust validation cycle loop for any safeguard can be categorised into the following three components: collecting a dataset, comprehensive testing and validation, and monitoring and feedback mechanisms.

Collecting an LLM safeguard evaluation dataset

An important step in creating safeguards is testing across a range of diverse scenarios. Part of that process involves collecting and curating evaluation datasets relevant to the safeguards objective. This is challenging, but one way is to bootstrap this process by posing the following question to the LLM itself: “What kind of questions should an AI assistant not answer?”. One such example is the Do-Not-Answer dataset for evaluating safeguards in LLM. There has also been a lot of progress in creating open-source safety datasets to assess the risks posed by LLMs.

Comprehensive safeguard testing and validation

This stage involves rigorous testing and validation of the safeguard under various conditions. The dataset collected in the previous step can be used to train an ML model. The evaluation process for different safeguard models is not one-size-fits-all. Different safeguard prioritises various metrics. For example, the toxicity detection safeguard prioritises recall over precision because it is crucial to detect any potentially toxic prompts, even if this occasionally results in flagging non-toxic prompts as potentially harmful. This emphasis ensures that no genuinely toxic prompts are overlooked, thereby enhancing overall user safety.

The Holistic Evaluation of Language Models (HELM) approach assesses language models comprehensively, considering multiple dimensions such as safety, fairness, and performance. A similar method to the HELM could be devised for testing and validating different safeguards against various Language Model Models (LLMs). As well as providing an idea of how well the safeguards act against our threats, the evaluation also gives us a direction of how we can improve the guards in the next iteration.

Safeguard Monitoring and feedback mechanisms

This practice involves ongoing monitoring of the safeguard's performance and incorporating feedback for improvement. For example, monitoring toxicity safeguards can provide insights into patterns and trends where safeguards struggle to detect, false negatives where safeguards failed to detect the toxic content, or contextual nuances where the toxicity of the content might be subtle or context-dependent, which the safeguard may not have been trained to recognize.

This process might reveal some of the gaps in the current safeguard for prompts that should have been caught by the safeguarding model but should have been avoided. This forms a collection of datasets that can be further used to improve the performance of safeguard models continuously.

Case study

Safeguarding an LLM can come in many different forms in terms of aims, i.e., what we are guarding against, and methods applied. The specific safeguarding aims for an LLM-based application are often dictated by the application's intended purpose and scope. For example, if we have a chatbot designed to provide support and information related to mental health, it is crucial to ensure that the chatbot stays focused on this topic and does not pivot into unrelated or potentially sensitive areas, such as politics. In this case, implementing effective off-topic detection and prevention becomes one of the key safeguarding objectives.

As we’ve discussed in the previous section, safeguarding an LLM involves multiple considerations and components, such as collecting relevant datasets, conducting testing and evaluation, implementing monitoring and feedback mechanisms. To illustrate how these components play together to achieve a specific safeguarding goal, let's consider a case study focused on preventing an LLM-based application from engaging in off-topic conversations.

Approaches to Off-topic LLM Safeguards

This can be achieved in a multitude different ways. So let’s consider four approaches: keyword filtering, classification-based topic modelling, training our own model, and LLM-based classification: 

  1. Keyword filtering: have allow and block lists of terms that are related to the topic of interest and are off-topic respectively. We can make a decision based on their presence
  2. Off-the-shelf topic classification model: take a model that can identify a topic (or multiple topics) that we consider on-topic. We can then use said model to check whether the text’s topic matches ones that are allowed.
  3. Our own trained model: to be more precise at detecting our specific topics that the app can talk about.
  4. LLM-based classification: using the LLM itself to determine whether the text is on topic or not.

The methods have differing complexity and are ordered by the amount of resources they require. Each has its pros and cons, so let’s discuss and evaluate the options we’ve outlined 

Keyword Filtering

The first method is very straightforward to implement and quite computationally efficient, however if the attacker knows the allow and block lists, they can be circumvented (e.g. misspelling of terms may lead to the filter by pass, even though the question will be understood by the LLM anyway). Even though the lists may not be known by the attacker initially, they may through educated guesses transform initial prompts such that the filters are not triggered. This can be mitigated by detecting brute force attempts and updating the lists with the newly identified terms and their variations.

Off-the-shelf models

Off-the-shelf models for topic classification can be a good starting point, especially if they've been trained on a large, diverse dataset. They can capture more nuanced off-topic content that simple keyword filtering might miss. However, these models may not be perfectly aligned with our specific use case, leading to false positives or negatives. And if an attacker can access the model or infer its behaviour, they might be able to craft adversarial inputs that fool the classifier. If the model is discovered and becomes available to the attacker, they can even study it offline to improve their approach. We should monitor the model’s performance to mitigate the risks. Having a keyword filter in front of such a model can also help us safeguard our system better.

Training our own model

A specialised model gives us much greater control and allows us to tailor it precisely to our needs. We can curate the training data to cover exactly the topics we care about and fine-tune the model for optimal performance. The downside is that this requires a significant investment of time and resources to collect data, train the model, and keep it updated over time. It also introduces a new potential attack surface - if an attacker can poison our training data or otherwise manipulate the model, they could degrade its effectiveness as a safeguard.

LLM-based classification

Using the LLM itself for off-topic classification is also an interesting possibility. In theory, the LLM should have a deep understanding of language and be able to recognize off-topic content that other methods might struggle with. We could prompt it with examples of on-topic and off-topic text and ask it to classify new inputs. The challenge here is that LLM responses can be inconsistent and require a lot of prompt tuning. Additionally, since the inputs are passed to the LLM, an attacker may try jailbreak attacks to avoid this safeguard too. And, of course, the computational cost of using the LLM for every input may become prohibitive at scale.

What’s the best approach to safeguarding an LLM?

In summary, a combination of the methods. 

By layering different safeguarding systems, we can create a robust, multi-faceted defence against off-topic attacks. However, it's crucial to remember that no safeguard is perfect, and attackers will continually seek new ways to circumvent our defences. This will always be an arms race between the attacker and the developers, through which the more complex safeguards will evolve. This is where monitoring becomes essential. We need to continuously monitor the performance of our safeguards. When new attack vectors are discovered, we must react quickly to update our models and keyword lists. Regular evaluation of our safeguards, both through automated testing and manual review, can help identify weaknesses before attackers exploit them.


Safeguarding Large Language Models is a complex and ongoing challenge that requires a proactive and adaptive approach. Through the case study of off-topic safeguards, we have explored various imaginary threat scenarios and methods, each of which has its strengths and weaknesses. The final choice of a method will most likely be a combination of multiple safeguards and will depend on the specific use case and available resources.

However, no single safeguard is foolproof, and attackers will continually search for new ways to break defences. Continuous monitoring, incident response, and regular evaluation of safeguards are essential to staying ahead of emerging threats. Additionally, collaboration and knowledge sharing within the AI community are crucial in advancing the development of effective safeguards. 

In conclusion, safeguarding LLMs is an ongoing process that requires a proactive, adaptive, and collaborative approach. As we move forward, prioritising the development of robust and reliable safeguards is essential to ensure the safe and beneficial deployment of LLMs in various applications.

Stay tuned for the next blog where we’ll be taking a dive into open source tooling that’s available to test the security of your LLM-based system!

Jointly written by Mikhail and Shubham.

Share this article