Shifting Sands

How to build AI models on a solid foundation, when the underlying data is constantly moving.

Company

Headquarters

Industry

Getting your model into production is a great feeling, all that effort experimenting with data, tweaking parameters and writing code in order to come up with a model that is as good as you can make it. Give yourself a pat on the back!

But don’t pat too hard. Change is coming.

In normal software development, you might anticipate certain scenarios which would require you to update your code – for example, when you either a) want new features or b) there’s a bug in the existing code. The same conditions apply to machine learning, but you must also be mindful of the ever-changing nature of data.

We train machine learning models from data. We refer to this as training data. When we want to ask questions of our model we pass new data into it and expect to get a valuable answer from it. But data changes based on how it is used in the real world. Over time our models become less and less effective and need retraining from new data.

Picture this.

Let’s make this a bit more tangible. Imagine a company running a successful digital art marketplace. Their business relies on users buying art that they like. Therefore if they can present pieces to them that are more to their taste then they can hope for more sales, more money, world peace etc.

To do this they build a machine learning system known as a recommendation engine. This is made up of two different types of models - one model that groups together users based on what content they’ve engaged with in the past and another model that groups together artwork that is similar.

These models combine to create effective recommendations from their current data set. This data includes.

Demographic data about their users - age, gender, location, interests.
Data about which artwork those users have interacted with - clicked on, favourited, purchased.
Visual data about each piece of art - style, colours, format.
Textual data about the art - description, tags.

Over time all of this data changes, and it constantly changes. They gain new users and new art is created. It’s also likely that previous recommendations have caused users to interact with pieces of art that they may not have previously, and they're potentially influencing a change in their tastes. As a result their models would need to be regularly retrained to be as effective as possible for this new data set.

When the first model was initially built, a set of tools will have been used that allowed them to collaborate well with other people working on the data and also to track the results of all their experiments. This is all good stuff but there is still quite a bit of human interaction required to run these experiments, train and deploy the models.

Humans are creative and thoughtful but error prone. It's easy to forget how a specific step in the process was run, or even forget to run a step at all!

The solution

So how do you build a new model quickly and efficiently to deal with this changing landscape?

The key is pipelines. In simple terms, you take each step in the model training process and record it as code. This way, not only can you repeat those steps consistently whenever you need to, but you’re also assured that nothing gets forgotten about.

Pipelines automate pulling in new data.
Pipelines automate transforming data into the shape you need it.
Pipelines can train a new model.
Pipelines can deploy the model to production.

The pipeline tool is what directs other tools to do what’s necessary in order to build and deploy your models. First, it fetches data from a database, and the next step is preparing that data for training, for example by filtering out things you don’t need. Next, it runs a training script and produces a model which is saved, packaged and, in the final step, deployed using a model serving framework.

There are a few good open source options for pipeline tools but ZenML is particularly useful. What makes ZenML special is its simple, intuitive pipelines, and integrations into a wide range of additional MLOps tools. This empowers you to pick the right tools for each individual step, and join them all up into a single pipeline.

Without getting too technical, another important ZenML concept is the pipeline orchestration environment. This is where the action happens - where the pipeline actually runs. For this, Kubeflow is a robust and popular choice for scalable model training.

For model deployment and serving there are again a number of options, but in this scenario a tool called KFServing would be perfect. This tool is part of Kubeflow, making it a natural fit in your tech stack for a project of this nature.

Act Fast

An outdated model is going to give unsuitable recommendations to your users. In the best case this will result in a short term reduction in sales. Worst case though, your users get annoyed at inappropriate recommendations, lose faith in the platform and go elsewhere to buy their art.

On the flip side, a really good recommendation is going to drive more engagement with your users and increase sales. This in turn will encourage more artists to publish work on your site as they believe that you’re going to give them the best chance of selling what you’ve created. More art = more things to sell = more revenue.

Pipelines

As Mark Twain once said “It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.”

It’s very dangerous to assume that the data on which an original model is based, is a constant. Data is constantly changing, and shifting, and so it’s important you allow for this from the outset.

A machine learning pipeline tool is at the heart of any effective MLOps toolset. It’s the thing that orchestrates the tools we use for data management, model training and deployment.

A well thought out pipeline tool allows data scientists to do what they are best at, giving them more time to experiment and develop better models and ultimately users the best possible experience.

‍

More case studies

Too many cooks?

Why more data scientists doesn’t always mean better results: effective tools for collaboration.

Flying Blind