MindGPT: Monitoring

This post is part of our MindGPT blog. You can find out more about MindGPT here.

This post is one of a series following the development of MindGPT, a chatbot service sitting atop a specialised LLM for summarising mental health information from two of the UK’s leading resources; the NHS and Mind websites. Today we’ll cover the motivation behind monitoring ML models in general and how we’ve done it for MindGPT. If you’ve been following along, you’ll know that we’re big believers in open-source here are Fuzzy Labs, so anyone interested can get their hands on the source-code here and have a play around. So with that in mind, let’s press on.

In the same way as you periodically wander around at home to check on your house plants and water anything that seems to be wilting, machine learning models in production need monitoring. Monitoring in the context of machine learning models is, in a nutshell, ensuring the data being fed to the model, and the outputs the model is generating remain reliable and of high quality over time and with exposure to new data. Since LLMs are large and complex, it can be challenging to monitor them.

When monitoring an LLM, one of the first decisions we have to make is which metrics we want to track. In the coming section we’ll cover some of the choices we’ve made and the motivation behind them. It is worth noting, however, that given the rapid advance of the field, there’s really no one-size-fits-all approach to monitoring LLMs, much less ML models in general.

Integrating monitoring into MindGPT

At this initial pass, we’ve elected to monitor two things; response quality and embedding drift. Maybe it’s clear why we want to monitor those things, but in case it’s not, let’s run through them.

Response Quality

Since users will engage in conversation with MindGPT, it’s important that the responses make sense. So the answers don’t need to just be technically accurate, but they need to be clear and easy to understand. After all, MindGPT is not designed to give advice regarding the treatment of mental health problems, but to help people better understand mental health issues by providing an entry point to a dense dataset. So, avoiding complicated medical jargon is important – if you know all of that stuff, MindGPT probably isn’t going to help you. Clearly, then, some measure of readability of answers provided by MindGPT would be a really useful thing to monitor, and will help us keep MindGPT accessible to all. The method we have used to calculate the readability of the response is the Flesch–Kincaid readability tests, the score is easy to interpret such that a score above 60 is considered Plain English and can be easily understood by 13 to 15 year old native English speakers.

Embedding Drift

We're reading data from relatively static sources, information pages on mental health websites. There's a potential for this information to change when new details are added about existing mental health conditions. This change in data could break assumptions made by the model, so it's important to make sure we detect these changes. By monitoring the embedding drift, we keep an eye on how the understanding the model has of the data shifts. A significant change in embeddings is likely to be a signal that new data is significantly different from old data, and we might be due for a re-tuning or re-training, or even a whole new model. There are many ways we can measure the embedding drift, to start with, we have decided to go with Euclidean distance. It’s a familiar method that’s widely used in machine learning that is easily implemented and can be used to plot embedding drift over time.

We now know what we want to monitor and why, so it’s finally time to get into the design and architecture behind the decisions we’ve made. Below you can see a bird’s-eye view of our monitoring mechanism.

‍

Our monitoring implementation is modular. This means we can easily modify the service to monitor a new metric in the future, and that we can deploy each component to kubernetes as a separate service. An advantage to deploying each component as a separate service is that we’re able to avoid taking down the metric database each time we make a change to one of the components as it lives in a different kubernetes pod.

As you can see in the architecture diagram above, each time the user poses a question to MindGPT, the question, and the response, makes a quick detour to the monitoring service. Since the questions and replies are potentially sensitive in nature, it’s important that we emphasise here that none of the questions or the responses from the model are stored anywhere. Instead, everything is processed in-memory and as soon as we’re out of scope of the metric service, the question and the response is lost forever, but we have some idea how well the model performed for the request, numerically.

<pre><code>from flask import Flask
app = Flask(__name__)
@app.route("/readability", methods=["POST"])
def readability():
llm_response = request.get_json()
readability_score = compute_readability(llm_response)
Store_readability_score_to_database(readability_score)</code></pre>

The metric service is a flask server with four routes. The first route is computing readability using the Flesch-Kincaid readability test. That is, each query to MindGPT results in a response being generated and returned to the user, is passed to our metric service and assessed for readability then the resulting score is deposited in our Postgres metric database.

<pre><code>@app.route("/embedding_drift", methods=["POST"])
def embedding_drift():
computed_embedding_drift_data = request.get_json()
Store_embedding_drift_data_to_database(computed_embedding_drift_data)</code></pre>

The second route is our server for storing embedding drift data. For any maths enthusiasts, our choice for measuring drift is Euclidean distance, which is a common default measure, but this could really be any measure of distance. Every time we scrape a new dataset, we compare the average dataset embeddings of the new data with our trusted reference dataset so we can see any differences represented by our distance metric.

<pre><code>@app.route("/query_readability", methods=["GET"])
def query_readability() -> List[Tuple[Any, ...]]:
return database_interface.query_relation(relation_name="readability")</code></pre>
‍
<pre><code>@app.route("/query_embedding_drift", methods=["GET"])
def query_embedding_drift() -> List[Tuple[Any, ...]]:
return database_interface.query_relation(relation_name="embedding_drift")</code></pre>

The last two routes act as gateways to our metric database so we have seamless access to our monitoring metrics. ( (Feel free to check out our repository for the metric service implementation to learn more about how monitoring works)

Let’s see how monitoring works in practice. First, we need to start our metric service, we can do this locally using docker-compose or run it on a kubernetes cluster provisioned using matcha. Once the metric service container is running, we should be able to curl it and get a friendly “hello” message.

From now on, every time we can ask a question on streamlit, the response will be automatically sent to our metric service via the `/embedding_drift` route as shown above. The service will first compute a readability score for the response, then store it in our postgres metric database.

From the image above, we can see that there are 3 tables in our metric database and the data stored in the readability. Notice how there are two rows of data inside containing a readability score. This is because MindGPT provides two responses; one generated using data from Mind to inform the model and the other from the NHS dataset. This approach is known as “in-context learning”.

What about embedding drift? As mentioned before, embedding drift is computed whenever the embedding pipeline is run. So let’s do that now.

When the pipeline finishes running, we should expect to see the embedding drift data in our metric database.

As expected, we see two entries in our database since we have two datasets used for in context learning. Since our data hasn’t changed, we also see that our euclidean distance is 0, as the vector embedding hasn’t changed. You can also see here that the actual conversational elements passed by the user and by MindGPT are not stored as part of the database.So what now? Whether we want to create a live dashboard with Grafana or some other tool, or create an alert which sounds when significant embedding drift has occurred, having these routes to our monitoring database makes life much easier. Our plan is to use the readability score as a way of garnering feedback from the user, e.g., if we detect that it's below a threshold, then we ask the user if we can save the response for diagnostic purposes. Similarly, there are data improvements that we can make in the backend to improve analysis, such as segmenting the metrics on the data. Finally and tangentially related to what's discussed in this blog is the collection of user feedback and ratings such that we can build a dataset for fine-tuning.

What’s next?

In this blog post we’ve looked at what monitoring is, why and how we do it and specifically, what we need to consider for a model like MindGPT. In our next blog, we’ll explore all things related to prompt engineering. Hopefully you found this read enjoyable and learned something useful. Stay tuned for our next blog post!

‍

Integrating monitoring into MindGPT

Response Quality

Embedding Drift

What’s next?

More like this

MindGPT: An introduction

Purple Teaming your LLM with Purple Llama

Guardrails for Large Language Models

Sign up to our newsletter