Data is everything in machine learning. That’s true whether you’re building a model to recommend products, or recognise cat pictures. Managing data is complex and time-consuming. Our datasets are often very large, which makes storing and sharing them with a team difficult. Data also evolves over time, with potentially many people contributing changes at any moment, and each change having an effect on model performance.
As you may guess, data version control is all about tracking changes to a dataset. But there’s more to it than that: where these tools shine is in enabling people to share and collaborate on data, in much the same way as tools like Git enable code collaboration.
A good data version control solution provides:
- A central location where data is stored.
- A history of changes to datasets, so that any historical version can be reproduced whenever required.
- The ability to easily share datasets with others and enable collaboration.
- An intuitive workflow that supports the data scientist in their work.