In this blog post, we will focus on comparing two libraries, Great Expectations and Deepchecks that address the issue of data validation. Data Validation is the process of identifying anomalies in datasets prior to training models. These anomalies include:
data type mismatch
missing or unexpected features
data domain violation
Our Awesome Open Source MLOps repository provides a comprehensive list of all open source tools that can be used for data validation. If you think some tools are missing from the list, please open a pull request.
The following 3 criteria are used to compare the two libraries
Data Format : In this criteria, the focus would be to check if all formats of data are supported. The various data formats include images, tables, text, etc both structured and unstructured data.
Customisability : This criteria checks if the libraries can be extended for use-cases that are not supported by built-in techniques.
ML Workflow : ML Workflow criteria compares where each of the libraries sit in the end-to-end ML workflow. For data validation, the libraries would be helpful after a clean dataset is created and after train and test splits are created to catch various data anomalies.
Great Expectations is a Python library available under the Apache 2.0 license. This library is useful to validate, document, and profile the data to maintain the data quality.
Great Expectations provides 3 key features each of which we’ll detail below.
Expectations act as tests for data validation. An example of Expectations is shown below
where data validation is performed on column “passenger_count”. Success or Failure result is returned if all values in that column are between 1 and 6 (inclusive).
Expectation Suites are collections of Expectations. Great Expectations uses Expectations or Expectation Suites to validate the data.
Automated data profiling
This feature automatically generates basic statistics and suites of Expectations based on the observed data.
Great Expectations generates a HTML document that contains both the Expectation suites and data validation results. An example in the screenshot below shows a Data Doc that is a human-readable report of data quality.
Deepchecks is a Python library available under the GNU Affero General Public license. This library provides ways to test and validate both data and models.
Deepchecks offers 3 checks in different phases of the ML workflow each of which we’ll detail below.
Deepchecks provides in-built checks such as feature label correlation, conflicting labels, identifier label correlation, class imbalance, data duplicates, data type mismatch, data domain violation, and outlier detection, amongst others, for tabular datasets as a data validation check. Data integrity checks can also be performed on vision datasets.
The train-test validation suite is used to compare the distribution between train and test datasets. Additional checks include detecting new labels in train-test, dataset size comparison, feature and label drift for train-test dataset, data leakage, etc. This suite also runs checks for vision tasks such as image segmentation, object detection, and image classification.
Model Performance Evaluation
The model Performance Evaluation suite provides checks against a trained model. These checks include a confusion matrix report, model inference time, comparing model performance on the train test dataset, overfitting detection, amongst others. This suite can also run checks for vision tasks such as image segmentation, object detection, and image classification.
Hands on Tutorial
We have created a colab notebook which introduces all the concepts introduced above for both the libraries using python code and sample datasets.
In the Great Expectations section, we introduce 3 concepts in the different sections of the tutorial
Automatic Profiling using BasicSuiteBuilderProfiler section introduces Automatic Data Profiling and Data Docs.
Automatic Profiling using PandasProfiling section introduces a different way of Automatic Data Profiling using PandasProfiling library.
In the Deepchecks section, we perform 3 checks for both tabular and vision datasets.
Data Integrity : The deepchecks integrity suite is relevant any time you have data that you wish to validate: whether it’s on a fresh batch of data, or right before splitting it or using it for training.
Train-Test Validation : The deepchecks train-test validation suite is relevant any time you wish to validate two data subsets. For example, comparing distributions across different train-test splits (e.g. before training a model or when splitting data for cross-validation) or comparing a new data batch to previous data batches
Model Performance Evaluation : The deepchecks model evaluation suite is relevant any time you wish to evaluate your model. For example, thorough analysis of the model’s performance before deploying it or evaluation of a proposed model during the model selection and optimization stage.
Both Great Expectations and Deepchecks are able to provide data validation checks against all the anomalies described above.
Data Format : Does the library work with all data formats?
Great Expectations provides validation for tabular datasets.
Deepchecks in-addition to tabular datasets provides validation for vision tasks such as image classification, object detection, semantic segmentation, etc.
Customisation : Is the library customisable?
Great Expectations supports writing custom Expectations. A detailed list of built-in and custom expectations can be found here.
Deepchecks also provides ability to write custom checks but for computer vision tasks only the PyTorch framework is supported. A detailed list of built-in checks for tabular datasets can be found here and for vision tasks the list can be found here.
Great Expectations provides validation only for data. This means, they are useful at the start of the ML Lifecycle before passing the datasets for training.
Deepchecks provides validation for both data and models. This means, data integrity checks can be performed before passing datasets to the model, train-test validation checks after datasets split, and model performance evaluation checks after getting a trained model.
To summarise, there are various advantages and disadvantages of using both the libraries. Deepchecks particularly stands out as it can be used for both data and model validation covering more phases in the ML workflow compared to Great Expectations. Deepchecks provides validation support for image related tasks in addition to tabular datasets. Great Expectations provides data validation across a wide range of integrations e.g. MySQL, Prefect, Databricks, Spark, BigQuery, etc. If the data source is one of the supported integrations then our suggestion would be to use Great Expectations. If the data source is in another format e.g. images then Deepchecks is a great choice. Deepchecks is also a great choice if we want both data validation and model validation providing more rigorous tests throughout ML workflow.