Whenever we train and deploy a machine learning model, we want to make sure that the model performs well in production.
Models need monitoring because things happen in the real world that we can't account for during training. The most obvious example of this is when the real-world data drifts from the training data, or when we encounter outliers. We use monitoring to make decisions, like when to re-train or when to get new data.
There are a few specific things to monitor for
- Data drift: data in the real-world starts to differ from the training data, indicating a need to re-train with new data.
- Outliers: inputs outside the range of our expectations. Potentially the model produces nonsense outputs as a result.
- Bias: the model is biased towards certain predictions in a way that we didn't realise during training. We may need to re-train with better data.
At a high level there are two kinds of model monitoring:
- Offline analysis: collect a snapshot of inputs and outputs from the production model and perform a one-time analysis. This is a manual process but it can be useful if there's a one-off question we need to answer about a model, for instance whether it exhibits a specific bias.
- Real-time monitoring: we have a monitoring service that collects live data from the model, performs analysis in real time and alerts us about specific model performance problems when they arise.
In this blog post, we will focus mainly on real-time monitoring, comparing the approach taken by two tools, Evidently and Seldon's Alibi Detect.
There are a few things that we use to evaluate and compare these tools:
- Compatibility: which model serving tools, and which ML frameworks will it work with?
- Integrations: can it be integrated into existing application monitoring infrastructure?
- Capabilities: what aspects of model monitoring can it handle well? What data types are supported, and which statistical tests are available (here, statistical test refers to the algorithm used to assess things like drift or bias in model monitoring systems. There are a lot of these out there, but we won't delve into their details in this article).
Evidently is a Python library available under the Apache 2.0 license. This is a tool devoted to making model monitoring simple. Evidently is platform-agnostic, so it works with any model serving setup and any machine learning framework.
Out of the box, Evidently comes with some basic dashboards telling us about:
- Input data drift,
- Model output (aka target drift),
- Model and data quality.
These dashboards make it easy to get started right away with Evidently, but for robust production monitoring we can take advantage of another feature, which is integration into Prometheus and Grafana. The combination of Prometheus and Grafana is already a widely-adopted solution for application monitoring and dashboarding, and by combining model metrics with application metrics you're able to build dashboards that give real-time insights into your entire software stack.
The simplest way to use Evidently. You'll need some reference data, which is a sample of the data that was used to train the model, along with some current data, which you've collected from the production model. As an example, we can perform data drift analysis in a Jupyter Notebook:<pre><code>from evidently.dashboard import Dashboard
from evidently.dashboard.tabs import DataDriftTab
drift_dashboard = Dashboard(tabs=[DataDriftTab(verbose_level=1)])
drift_dashboard.calculate(reference_data, current_data, column_mapping=None)
Which generates a dashboard that looks like this:
The Data Drift dashboard:
- Shows the distribution of both the reference and real-world data,
- Tells us the similarity between the reference and real-world data,
- Informs us whether drift was detected.
In order to determine that data has drifted, Evidently compares the reference and real-world distribution using a statistical test. These tests may be familiar to many data scientists: to detect drift, a test called Kolmogorov-Smirnov is used for numerical features, and a Chi-Squared test for categorical features. It's worth noting that these tests are suited to tabular data, but not to other kinds of data, like images.
Though it's easy to get this up and running, offline analysis isn't very useful for production monitoring. For that, we need real-time.
Evidently have provided an example set-up for real-time monitoring on Github. Essentially, the way this works is you have a monitoring service which sits alongside your deployed model. The monitoring service collects the inputs and outputs to your model and calculates metrics for things like drift (using the same statistical tests that are available for offline analysis).
The provided example includes:
- A Flask app acting as the Monitoring Service.
- A monitoring configuration file, to define what kind of monitoring to perform, and which features are of interest.
- Configuration for Prometheus and Grafana.
- A docker-compose configuration to tie all of these together.
Typically, you'll want to deploy one monitoring service per model. Using the provided example, we were able to create a simple monitoring service for Data Drift, requiring only small modifications to the code and configuration.
Finally, the provided example demonstrates using Evidently with Prometheus and Grafana. As we mentioned earlier, this is a really powerful feature, because it means you can combine your model metrics with application metrics from other components in your software stack, and build unified dashboards.
Below shows a simple example of this:
Alibi Detect is a Python library developed by Seldon and available under the Apache 2.0 license. This library aims to be a "go-to library for outlier, adversarial and drift detection in Python".
Alibi is built around things called detectors. A detector addresses a specific kind of model monitoring use-case. These are similar to Evidently's monitoring service concept. Alibi includes a number of pre-made detectors:
- Outlier detectors,
- Drift detectors,
- Adversarial detectors.
These all work with a variety of data types:
- Tabular (with numerical and categorical features),
- Time Series,
In comparison with Evidently, there is a richer set of data types and statistical tests available in Alibi. However, it's important to know it's not platform-agnostic; Alibi is designed to be used together with Seldon Core. In contrast, Evidently aims to work anywhere.
In a similar way to Evidently, Alibi can be used directly in Python scripts or notebooks to do offline analysis. The usage and features looks very similar to Evidently's, so we haven't included a code sample here. Onwards to real-time monitoring!
Real-time monitoring is where Alibi Detect really shines. The implementation is mature and relatively easy to set up. However, since it is tied strongly to Seldon Core, it only makes sense to use Alibi if you also use Seldon for model serving.
In case you're not familiar with it, a quick explanation on Seldon Core: it's a model serving framework built on top of Kubernetes. Seldon makes it very easy to deploy a model; what Alibi adds is the ability to deploy a detector alongside the model. The detector is just like Evidently's monitoring service, in that it observes model inputs and outputs, and calculates metrics that are relevant to model performance.
Unlike Evidently, Alibi comes with a set of pre-trained detectors which cover common monitoring tasks. You can either use these, or build your own, depending on your needs.
Both tools are under development and we expect to see a lot of new features in the future.
It's true that Alibi is further along than Evidently in terms of features. But Evidently aims to provide a solid approach to monitoring that works on any platform, and that's a more ambitious goal. We think Evidently will catch up quickly feature-wise.
Depending on the task, we can provide some suggestions as to which tool is more suitable.
For offline analysis:
- Alibi has infrastructure overheads that are difficult to justify for offline analysis if you're not already using Seldon for other things.
- If you want feature-rich reports with only a few lines of code, and you work with numerical and/or categorical tabular data, Evidently has you covered.
- If your data is not tabular, or you need to use a wider range of statistical tests, there are other options out there, but you might consider implementing what you need in Evidently and becoming a contributor.
For real-time monitoring:
- Alibi Detect works best in combination with Seldon Core. This makes it a great choice if you already use Seldon Core. Otherwise, it brings a lot of infrastructure overhead.
- Alibi Detect is also a better choice if you have non-tabular data, need outlier detection, or some statistical test that is not available in Evidently. we anticipate Evidently catching up quickly here.
- Evidently is a flexible general-purpose solution that works everywhere, making it great all-round choice, outside of the Seldon ecosystem.