MindGPT: Data Version Control

This post is part of our “MindGPT: DataOps for Large Language Models” blog. You can find out more here.

The topic today, as the title suggests, is data version control. We’ll take a look at why you might want to version data, which packages are out there to help you do it and how we’ve chosen to do it in MindGPT.

This post is one of a series of posts shadowing the development of MindGPT, such that anyone interested can follow along as we develop a specialized LLM for summarising mental health information from two of the UK’s most popular resources - the NHS and Mind websites. We believe in open-source software here at Fuzzy labs, and most of the code that we write can be seen on our GitHub. MindGPT is no different. We want users to be able to see everything, rely upon and trust our code, so feel free to go and take a look at the repository any time.

We have written about open-source data version control before. If you want a shorter article with less MindGPT flavour, go check it out on our DataOps blog. If you’ve stumbled across this blog post without having seen any of the others in the series, feel free to go back and read them in order, starting with the first in the series. If you just want to read about data version control then stick around – each instalment of the MindGPT blog is a self-contained glance at a topic through an LLM lens.

What is data version control?

Almost ubiquitously, data changes over time. Less than a century ago, medical practitioners thought inducing shock or comas using hypothermia was a reasonable treatment for mental illness. Fast forward to modern medicine, and there are incredibly strict standards of care based on empirical evidence. A drastic comparison to be sure, but it illustrates the point – having up-to-date, well-informed data when operating in a sensitive space, such as mental wellness, is essential.

Data version control provides the means through which to attach tags to specific versions of data used to train an LLM (or any other ML model), and to switch between versions to address public and specialist sentiment, fall back to old versions in the case that bad data somehow makes its way to production, ensure reproducibility, or any of a plethora of other reasons.

The first question anyone from a software development background might ask is, “well, git is pretty good for this version control stuff, why can’t I just use that?” The answer is that usually you’ll still be using git, but to track files of the size git is designed for, and your data version control package will link those smaller files to your much larger data files. Git is designed for versioning smaller files than those used to store data in most ML applications, and by operating in this way, you can avoid cluttering up your git repository with huge data files without losing the ability to collaborate and share data easily, as long as your co-workers have the same data version control package installed as you.

Data version control for MindGPT

When choosing a tool for data version control for a project such as MindGPT, there are a few things we should consider, we’ll consider each of them in short subsections.

How easy is the tool to use?

MindGPT is open-source and anyone can contribute, so a popular tool experienced data scientists can pick up and use with minimal effort would be ideal. For those without much experience, it would be beneficial if the tool had a gentle learning curve.

How much data do we have?

We’re scraping two popular, but not enormous, websites – each time we scrape the sites we generate data to the order of megabytes so we don’t need to be overly concerned if each new version of the data is stored separately. There are other tools more suited to lake-sized data, so if we were working with a vast quantity of data we should choose a tool which avoids making copies.

How easy is it to insert into an existing workflow?

We’re using git version control for our code, so some compatibility with git would make changing data versions straightforward.

Choosing and implementing data version control for MindGPT

Taking the above into account, we chose DVC (the package, not the concept) as the most suitable candidate.

DVC works by generating a hash anytime new data is added, you can configure a remote storage bucket for DVC to interact with, and as long as your collaborators have access to that storage bucket, they can pull or checkout any version of the data they please. The hash is essentially a path within the storage bucket, and is kept in a `.dvc` file in your data directory. You then add the `.dvc` file to git, and use git to keep track of it. The `.dvc` file is less than 10 lines long, so poses no issue to git. You can then use the `git tag` functionality to attach labels with some meaning to humans to the data. You can then push your newly versioned data to your remote dvc storage bucket and new versions of the `.dvc` files to git. If you ever want to go back to an old version of the data, you simply checkout the tag in the data directory using git, then run `dvc pull`. This is, in a nutshell, how the data versioning works for MindGPT, so let’s go over it in the context of the existing pipelines.

Every time that the scraping pipeline is run, the data from both websites is stored in two separate `.csv` files. Those `csv` files are then added to `dvc`, and a commit is created in the background. This can be achieved by calling the version_new_data() function in the MindGPT `utils` submodule.

<pre><code>def version_new_data(filename_roots: List[str]) -> None:
"""Version new data in the data/ directory.
Args:
filename_roots: Roots of the filenames for the data to be versioned.
"""
csv_files = [f"{filename}.csv" for filename in filename_roots]
dvc_files = [f"{filename}.csv.dvc" for filename in filename_roots]
add_csv_files_to_dvc(csv_files)
add_and_commit_dvc_files_to_git(dvc_files)
</code></pre>

You can then tag your new data versions and push them to git and your storage bucket using the functionality below.

<pre><code>def push_data() -> None:
"""Push the current data to the bucket."""
sp.run("dvc push", shell=True, cwd=PROJECT_ROOT_DIR)
def push_and_tag_dvc_changes_to_git(tag: str) -> None:
"""Push the current commits to remote, then tag the commit and push the tag.
Args:
tag: Used to form the data tag.
"""
sp.run("git push", shell=True, cwd=PROJECT_ROOT_DIR)
sp.run(f"git tag {tag}", shell=True, cwd=PROJECT_ROOT_DIR)
sp.run(f"git push origin {tag}", shell=True, cwd=PROJECT_ROOT_DIR)
</code></pre>

That’s it! The data has been versioned and pushed to a remote storage bucket, and a git tag exists for the latest version of your data. You can repeat this workflow as many times as is needed during your workflow. For MindGPT, we perform it once after scraping the raw data, then again after the data is cleaned and validated.

This means any improvements to the data preparation pipeline doesn’t require new data to be scraped, but the newly prepared data is versioned and stored with the appropriate raw data using the DVC package. It also means that if you want to skip the data scraping and/or the data preparation pipelines, you can! You can checkout the one of your data tags, run `dvc pull`, and proceed with previously prepared data – tagged and pushed versions of the data to the data version control storage bucket is our central source of truth, we accept that the data is reliable and accurate if it made it that far.

When it comes to training and deploying an LLM, cloud resources are your best friend. However, provisioning and setting up your data version control storage bucket, and other necessary resources, can quickly become a headache. To streamline the process, we created matcha. Matcha is a command line tool you can use to provision remote ML resources quickly – in one step you can have a full MLOps stack up and running. Matcha is open-source and is still in alpha release, but several of us here use it daily to provision azure resources

As a result of the work on MindGPT, we added a feature to matcha to also create a storage bucket specifically for data version control. Feel free to check it out, or you can swing by our github page. Your feedback would be really helpful, and hopefully matcha can return the favor by streamlining your resource deployment for you.

What’s next?

In this blog post we’ve covered why data version control is important, what we need to consider when choosing a tool, and how and when we’ve versioned our data for MindGPT. In the next one, we’ll discuss all things related to monitoring LLMs.

‍

MindGPT: Data Version Control

What is data version control?

Data version control for MindGPT

Choosing and implementing data version control for MindGPT

What’s next?

More like this

MindGPT: An introduction

Purple Teaming your LLM with Purple Llama

Guardrails for Large Language Models

Sign up to our newsletter