How Someone Can Steal Your Large Language Model

This blog is part of our “Cybersecurity for Large Language Models” series. You can find out more here.

On the savannah, being the biggest and the most powerful can be a great tactic for warding off predators and rivals. Adult African elephants face little danger from lions or hyenas, but their prominence and their majestic tusks make them an attractive target for poachers. In the same way, if you launch a popular LLM with useful capabilities, you can expect to attract bandits hoping to take that capacity for themselves.

In this blog, we will look at how model extraction attacks can be designed to allow attackers to do just that, and what measures we can put in place to prevent it.

While this blog can be read independently, it’s part of a series covering Cybersecurity for Large Language Models. If you’re interested in learning more, then do check out this blog post which provides an overview of the whole series.

What is a Model Extraction Attack?

In Model extraction attacks, an adversary uses their ability to interact with your model to construct training data that allows them to replicate its behaviour in a model of their own.

This requires surprisingly little information: the attacker doesn’t need to know the architecture of your model, its size or any hyperparameters, all that is needed is access to the model.

Once the attacker has this copycat model, they can harm you in one of two ways.

The first is to serve this model in direct competition with your own. Having invested less time and money into the development process, they may find themselves with a competitive edge.

Secondly, they may use their local version of the model to start constructing more sophisticated attacks, probing it for potential vulnerabilities and exploits, and knowing that such attacks will likely transfer to your model. In this second case, the fact that the attacker is developing these exploits offline means that they are impossible to detect until they are ready to launch against the target model.

How do model extraction attacks work on Large Language Models?

Model extraction attacks apply to all kinds of machine learning models, but the scale and versatility of LLMs makes them particularly attractive targets. To understand how these attacks work, we are going to look at a specific example developed by Birch et al from Lancaster University and Mindgard, which they refer to as a “Model Leeching” attack.

In this attack, an adversary wishes to construct their own model which is capable of reading texts and extracting information from that context to answer user questions. To do this, they construct thousands of queries to target LLM, chosen for its ability to extract information from text in this way, in order to get example outputs for given question/context pairs.

Each query to the LLM follows a standard prompt template, which gives instructions to the target model about what its output is expected to look like. The prompt template given by Birch et al is as follows:

An example prompt template for a model extraction attack

Here SQuAD refers to the particular dataset used in the paper, with each question being paired with a context which may or may not contain the information required to answer the question.

To give a concrete example of how this works, let’s take a specific question and context and feed it into ChatGPT.

A demonstration of a model extraction attack on GPT3.5

Here, I’ve taken an example of a question and context from the SQuAD dataset to show that GPT3.5 is very capable of following the instructions and extracting information from the text. To ensure that the answer is indeed taken from the text rather than from ChatGPT’s existing knowledge, I’ve altered the text slightly (in a way that some Whovians may consider sacrilege).

In a real attack, we would write a script to query the API with thousands of such requests, slowly gaining a bank of knowledge about how text can be extracted from these contexts, structured in an easily processable way.

From this the plan would be to train a model copying the capabilities of the target model. The construction of this model can be an outcome in its own right, but it can also be used to construct adversarial examples against the target model, with the assumption that attacks that work on the copycat model will likely also be successful on the original target model.

In this case, we are extracting ChatGPT’s reading comprehension ability, but one could imagine other attacks in a similar vein: we could systematically extract knowledge from a specialised in-context model, or gather data to reproduce the behaviour of a model fine-tuned to produce text in a particular style or topic area.

Preventing and detecting model extraction attacks

So, what can we do about model extraction attacks?

There are two aspects to this attack that give it a distinctive fingerprint: The format that the queries come in and the volume of queries required. Both of these give potential strategies for mitigation. Starting with the volume of queries required, a potential solution would be to simply rate limit users to not allow them to get enough information from the model to construct a viable copycat.

Rate limits/account limits can be circumvented by creating multiple accounts/API keys/IP addresses. For this reason, we may also wish to look at ways of filtering out these kinds of queries. Depending on how discerning we want to be about what to filter, we may only prevent queries that match the given template, or filter out JSON formatted output, or use some other custom rule about what kinds of queries raise red flags.

However, here we are entering a game of cat-and-mouse. While it is easy to see how we might prevent an attack which falls into this specific template, a determined attacker will adapt their queries to become harder to detect and to not fall into known templates. For this reason, it is important to ensure that you have logging and monitoring capabilities to detect changes in the quality and quantity of traffic to your deployed model, and ways of investigating what might be the cause of those changes.

Conclusion

Given the cost in training LLMs and the sheer amount of knowledge contained in a trained model, they are attractive targets for model extraction attacks. As these models grow in complexity, we should expect to see more examples of attempts at extracting knowledge from these models in an adversarial way. As such, it is important when deploying models to have sufficient understanding of the risks, and the tools in place to monitor for and mitigate such attacks.

In the next blog, we’ll stay within the realms of attacks on LLMs, focusing on jailbreak attacks.

‍

How Someone Can Steal Your Large Language Model

What is a Model Extraction Attack?

How do model extraction attacks work on Large Language Models?

Preventing and detecting model extraction attacks

Conclusion

More like this

MindGPT: An introduction

Purple Teaming your LLM with Purple Llama

Guardrails for Large Language Models

Sign up to our newsletter