It’s in the name - ‘data science’. As with all sciences, creating machine learning models requires: a degree of experimentation; some educated guesswork in terms of what will and won’t work; thorough documentation to record your experiments and outcomes.
If you want to create the best model possible it makes sense that you’ll need to do a number of experiments before you arrive at that model - trying out different techniques, gathering better data, and tweaking certain parameters. Even after that initial model is built, you are more than likely going to need to recreate (retrain) that model in the future as things change - data, user behaviour, changes in features. This retraining will require even more experiments.
Assuming 1 data scientist can only perform 1 experiment at a time before moving onto the next experiment, can you speed up model development by having more data scientists working on experiments concurrently?
Do you need more cooks, or just a sharper knife?
Imagine you have a team of N data scientists, each working to develop the best possible model. Each of them is taking data, writing code, tweaking parameters and building models that they record the effectiveness of. In theory N data scientists should be able to develop N times as many models or develop models N times quicker than just having 1 data scientist. But this is often not the case, due to various inefficiencies, and duplication of effort.
Without the correct tools and workflow available to data scientists they are not able to collaborate and share resources efficiently. Specific problems include:
- Not able to easily share datasets.
- No way to create different versions of datasets.
- Fixing the same problem in multiple different ways by not efficiently sharing and versioning code.
- No method for centrally recording each experiment.
- No mechanism for sharing experiments and learnings.
The primary tool that helps resolve a lot of these issues is a centralised experiment tracker. We've written a guide on the various options but fundamentally they all do the same thing: they capture all of the inputs that went into building an experimental model (version of data, version of code, parameters used to build the model) as well as capturing the output and accuracy of the model.
By implementing the experiment tracker and making a small modification to the data scientists’ training code, you are able to automatically capture all experiments centrally. Allowing everyone to see what everyone else is doing and thus collaborating more effectively and developing models quicker.
The experiment tracker needs inputs though so you would also need to implement a data versioning tool (DVC being our open source tool of choice) which integrates nicely with git and ensures that training code and data is all versioned at the same time.
Ignore the problem at your peril
Developing new or updated models can be painfully slow. From a business perspective this means that new ML/AI features aren’t getting deployed as quickly as you want AND if you want to replace a model that isn’t performing correctly then this also takes too long. Ultimately the users of your service aren’t experiencing the features you want them to AND/OR your product is delivering sub optimal results to the users.
In addition to the impact on your customers, your data scientists may become frustrated and demotivated by how long it takes to deploy models. As a result morale may be low, and you risk your best data scientists leaving. It also affects your ability to recruit the best data scientists who are very discerning about only working for an organisation that has efficient tools and processes in place.
Centralised Experiment Trackers
So a centralised experiment tracker may well be the smart solution for your business. Many hands doesn’t always make light work. You may not need more people, instead create an environment that helps your current data scientists work together more effectively, and without frustration.