Data Version Control
Intro
Data is everything in machine learning. That’s true whether you’re building a model to recommend products, or recognise cat pictures. Managing data is complex and time-consuming. Our datasets are often very large, which makes storing and sharing them with a team difficult. Data also evolves over time, with potentially many people contributing changes at any moment, and each change having an effect on model performance.
As you may guess, data version control is all about tracking changes to a dataset. But there’s more to it than that: where these tools shine is in enabling people to share and collaborate on data, in much the same way as tools like Git enable code collaboration.
A good data version control solution provides:
- A central location where data is stored.
- A history of changes to datasets, so that any historical version can be reproduced whenever required.
- The ability to easily share datasets with others and enable collaboration.
- An intuitive workflow that supports the data scientist in their work.
Do I need It?
In our view, this is something that every machine learning project needs. In light of how important data is, to not have version control is to set yourself up for failure. Without the right tools, you not only need to figure out how to handle evolving data, but also how to store and make that data available to team members, and these are all things which data version control tools handle.
Can’t I just use Git?
If you’re familiar with Git or similar source control tools, you might wonder why you need a special tool for data version control. The main reason is size: Git is designed for small text files, and typical datasets used in machine learning are just too big. Furthermore, Git is not designed with data science workflows in mind, while specialist tools are.
Can’t I just use Git?
If you’re familiar with Git or similar source control tools, you might wonder why you need a special tool for data version control. The main reason is size: Git is designed for small text files, and typical datasets used in machine learning are just too big. Furthermore, Git is not designed with data science workflows in mind, while specialist tools are.
What are the options?
When comparing data version control tools, we need to consider a few things:
- How and where data is stored. Because we want to share data, we need to store data in a convenient central location for everybody to access it. Often this comes in the form of cloud storage, such as Amazon S3, or a Google Cloud Bucket.
- What kind of data formats are supported. Some tools are generalists, supporting all kinds of data, while others specialise in one kind, for instance only tabular data, or only images.
- Scalability. If you work with big data then you may need a tool that’s designed to for that scale, while if size isn’t a concern then you have more options to choose from. Big data tools come with a complexity cost that isn’t worth paying unless you’re working with datasets that are in excess of a terabyte in size.