Guardrails for Large Language Models

This blog is part of our MindGPT series. You can find out more about MindGPT here.
‍
We are continuing our journey building MindGPT, a specialised large language model that can answer questions and summarise mental health information using the knowledge from two sources: the Mind and NHS websites. Our goal is to provide an accessible and transparent view of our current progress for anyone interested in following along. The GitHub repository containing all of our code is available to view at any time.

In this post, we will talk about AI safety, specifically LLM Guardrails, as well as applying these concepts to MindGPT itself.

What are Guardrails for LLMs and why do you need them?

As the name suggests, Guardrails are configured boundaries that an LLM is allowed to operate in, to ensure that our application behaves as we expect it to. These boundaries boil down to two aims:

We want to ensure that a system behaves in a certain way. For example, follow a specific answer structure, include a specific set of details, or ask follow up questions when necessary.
We want to ensure that a system does NOT behave in certain ways. For example, does not include harmful information, answer off-topic questions, or ask irrelevant questions unprompted.

Essentially, you can think of guardrails being put in place so that an LLM-based system behaves more predictably and reliably.

Applying Guardrails on top of LLMs is a broad topic and there is still a lot of open questions, so we cannot provide a general answer to it in a single blog post. However, we can draw a picture of what it may look like. In the nutshell:

We take user’s input or model’s response (depending on the aim).
Check if it conforms to our expectations.
If it’s not what we expect, we either try to fix it automatically (sometimes it’s possible), or tell the user what went wrong, and how it can be fixed manually (if there’s such option).

This kind of guardrails can be done by the developers themselves, and there are currently at least two libraries that can help with that:

NeMo is an open source guardrails toolkit created and maintained by NVIDIA. It uses text embeddings to match the real user inputs and model outputs, to the sample conversation flows defined in its configuration, and to determine what it should do with them. By its nature, NeMo requires that all (or most of the logic) for the LLM based application is described in their configuration.

On the other hand, there’s Guardrails.ai (also open source) by Shreya Rajpal and contributors that works rather differently. It utilises an LLM to match a user's input or a model's output to a schema that is defined by developers. This way we can check what information is contained in there, and decide what to do with it.

Guardrails and MindGPT

MindGPT is a good example of where there’s a need for guardrails to be in place. It’s a system where users can ask questions about sensitive topics and we want to prevent the underlying model from responding with bad or harmful advice. One way that this undesirable advice might be generated is through the user accidentally (or purposely, but that’s a different blog altogether) asking questions that are off-topic.

The MindGPT conversational system has been engineered with mental health questions in mind, and it can retrieve relevant pieces of information from its knowledge base to answer them. However, when provided with an off-topic question, the fetched context will be irrelevant (even though it may be the most ‘similar’, see our embedding blog for more detail), hence the answer will not be meaningful at best, and at worst, contain harmful information.

Let’s explore how the model underpinning MindGPT responds to a set of off-topic questions. I’ve selected three questions across a couple of topics to try out:

What is your favourite football team: Manchester United or Manchester City?
What’s your favourite season?
How to make pasta?

Ideally, we want MindGPT to reject answering these questions. However, here are the results:

Our model clearly showing a preference for a particular football team!

It's also got you covered if you fancy eating some pasta, apparently!

It's clear from the output above that the LLM we are using knows pasta recipes -- useful in another setting, perhaps. It also shows a preference for certain things, for example, "summer" and "Manchester United". While these might be funny, it demonstrates where the retrieved context is completely irrelevant since we have nothing about the weather or football in our knowledge base.

Let's see how we can prevent this by adding guardrails using Guardrails.ai.

Adding guardrails to an LLM with Guardrails.ai

Currently, the MindGPT system passes user questions directly to the underlying LLM without any checks. What we want to do is put an off-topic check between the system call to the LLM and the LLM itself, to avoid what we've seen previously. If this check were to fail, then ideally, the output should be something constant, such as "MindGPT is a chatbot that provides information about mental health. It cannot talk about [topic]".

Guardrails.ai enables us to do this, and it requires two components to run: the output schema (a schema to conform the input) and a prompt. As we want to just determine the topic of the user's question, the prompt is quite straightforward:

<pre><code>guardrails_prompt = """
Given the user’s question, determine its topic, and decide whether it's on topic of mental health or not.

User asked: ${user_question}

${gr.json_suffix_without_examples}
"""</code></pre>

<pre><code>gr.json_suffix_without_examples</code></pre> is a special prompt that is imputed by Guardrails.ai to tell the LLM that is being queried and to return JSON that conforms to the schema. The schema itself is also fairly simple and it takes the form of a Python class:

<pre><code>class OffTopicModel(BaseModel):
topic: str = Field(description="Topic of the question")
‍
is_mental_health: bool = Field(description="Is the question related to mental health? Set, False, if and only if the question is related to something other than mental health, such as sports, politics or weather.")

guard = gd.Guard.from_pydantic(output_class=OffTopicModel, prompt=guardrails_prompt)</code></pre>

Within the scheme we provide the types and descriptions, but for more complex tasks there is a variety of additional functionality that can be added such as validators and actions when guards fail.

We're now at a point where we can check if the example questions are caught by the guardrails:

<pre><code>>>> check_guard("How to make pasta?")
{'topic': 'Cooking', 'is_mental_health': False}
>>> check_guard("What is your favourite season?")
{'topic': 'weather', 'is_mental_health': False}
>>> check_guard("What is your favourite football team: Manchester United or Manchester City?")
{'topic': 'Sports', 'is_mental_health': False}
>>> check_guard("What are symptoms of depression?")
{'topic': 'Symptoms of Depression', 'is_mental_health': True}</code></pre>

We can see that Guardrails.ai is successfully determining the topics of the questions and detects that three of the four examples are not about mental health. So, with the <code>check_guard()</code> functionality integrated into the system, we can see what the results now look like:

Once the guardrails are in place, we can see that the model now doesn't have a preference

Now the model is prevented from answering questions that are off-topic, producing instead a constant output when the topic detected isn’t mental health.

Guardrails play a vital role in ensuring that LLM-based applications behave predictably and safely. They help guide desired behaviour and prevent undesirable actions, ultimately enhancing the user experience and safety of AI-powered systems. Guardrails.ai, with its schema-based approach, is a promising tool for achieving these goals, as demonstrated in the case of MindGPT. As AI continues to evolve, the implementation of effective guardrails will remain a critical aspect of responsible AI development. We’ve just scratched the surface here and In the future we are planning to explore the topic of LLM guardrails even deeper.

What's next?

We’ve shown here how you can apply guardrails on top of an LLM to prevent it answering questions that are off-topic. In our next blog, we’ll delve into the two tools mentioned here in a bit more detail: NeMo and Guardrails.ai, and compare how they work internally and when you might favour one over the other.

‍

Guardrails for Large Language Models

What are Guardrails for LLMs and why do you need them?

Guardrails and MindGPT

Adding guardrails to an LLM with Guardrails.ai

What's next?

More like this

MindGPT: An introduction

Purple Teaming your LLM with Purple Llama

How Safe Are Your LLM Safeguards?

Sign up to our newsletter