We all know how important data is in machine learning. But we usually don't work from raw data. In reality, we pick out the most useful attributes from our data, and create new attributes using existing ones. These are called features, and features are what we actually use to train models.
Imagine we want to predict how likely a customer is to buy a product. Our data might tell us the customers’ date of birth, exact income, and location, and from there we might calculate their age, and turn their exact income into an income bracket (e.g. £31200 becomes £30-40000).
As a data science team builds more models, and increasingly complex ones, keeping track of features becomes challenging enough to justify a dedicated tool for the job. In this article we'll look at what features stores do, and when it's appropriate to use them. We’ll be talking in particular about Feast, a popular open source Feature Store, but much of what we say here applies to other feature stores too.
Feast, which confusingly stands for Feature Store, manages and serves machine learning features in production. It’s completely open source, easy to setup and comes with commercial backing from Tecton, the primary contributor to Feast.
So what problems can Feast solve for us? Let’s look at the earlier example, predicting whether a customer will buy a product.
A database for features
We need access to the data so we can experiment, train and monitor the model. There will be new data all the time, as new customers appear, so we need to keep track of these changes as well.
Feast gives you a central place to store and share features, as well as track their history. Data version control tools have a similar role (see our guide to data version control), but those tools only work with raw data, not features.
It’s important to know that Feast itself doesn’t include a database. Instead, it connects to an external database which is used to persist features. This way you can use any database you like, with out-of-the-box support for BigQuery, DynamoDB, Redis, and Spark among others.
Features are accessed through Feast’s Python library, which provides an easy interface to read, write and manipulate features during model training and serving.
A tool for feature engineering
The difference between raw data and features is subtle but important. It becomes more obvious when we think about features that are calculated. For instance, if our data has a date of birth column, but our model actually needs age, then we calculate and save age as a feature.
Feature engineering, i.e. transforming raw data so that it can be used to train a good model, is a big part of machine learning, and Feast helps us keep track of those training features in one place.
Feast isn’t only about training, it’s just as useful for model serving and monitoring. Imagine we want a prediction for a particular customer. Our model wants to know the customer’s age (among other things), but age is a feature that we calculated using date of birth, so we need to prepare the inputs for the model in the same way that we did during training.
This is a common MLOps problem. Model servers, and model monitoring systems, both need to know about the features that are used during training. Feature stores unify all three of these into a single database.
Do I need a feature store?
Setting up a feature store like Feast adds some overheads to your project. The good news is that Feast itself doesn’t need to be deployed anywhere; you just need to add a Python library to your project. But Feast does need to use an external database, such as BigQuery, DynamoDB, Redis, to store its features, and that’s the main overhead to take into account.
Not everybody needs a feature store, and certainly in the early stages of a model’s development, it adds some complexity. There are a number of scenarios where one will be of benefit to your project.
Often if you've got a pre-existing data lake or data warehouse, they become unwieldy; just retrieving or describing data becomes a difficult task. Development teams spend time writing complex code for retrieving and transforming data, and that code needs to be repeated in different places. A feature store provides a centralised, accessible location for features, and a standardised logical way to access it.
This centralised store can be managed, modified and accessed in production. A central store provides a better data workflow in scenarios where data is constantly changing, such as in fraud detection, or recommendation engines. We can collaborate on our data as it appears in real time, instead of fetching new data from a database each time it changes.
If we’re producing multiple models from a single dataset, it’s likely that different models will re-use certain features. For instance, in a natural language dataset there’s a good chance that teams are regularly producing features like word frequency tables from the same raw data. Feature stores help us to share features among different teams and for different models. This means we can perform transformations on our code in one place and use those values across our project, saving time and reducing code repetition.
There are infrastructure, time and maintenance costs to setting up a feature store, so if your goal is to quickly build a model for research or to establish feasibility, then you probably don’t need a feature store yet. You might still want simpler data management options, like DVC.
As your data and features grow, and become more complex, a feature store starts to become more reasonable. Additionally, once you start building multiple models, and / or have multiple teams using the same dataset for different purposes, a feature store turns out to be an asset to enabling collaboration.
Finally, you might wonder if it makes sense to use a data version control tool like DVC at the same time as Feast or another feature store. It usually doesn’t make sense to have both, and you’d be inviting even more complexity if you did. Feast already does data versioning, except it does it on the level of features rather than raw data.
Feature stores are an important part of your machine learning infrastructure, and they’re especially useful when you’re dealing with complex datasets across multiple teams. Feast is one of the most mature open source feature stores available, and it can integrate with a lot of common databases, both for storing features and ingesting data.
There are a few factors to consider when deciding whether a feature store is right for you:
Are you just starting out with an experimental model? If so, a feature store may be too much overhead for now.
Does new data appear often? If you have new data coming in all the time, then feature stores can help you manage this, by giving you a central place to calculate and store features. Streaming data is an area where Feast shines in particular.
Are there multiple models trained from the same dataset? There’s a good chance that some features are being re-used in multiple models. Having a feature store avoids duplication by making it easy to share features.