Managing Python Dependencies 😅

Main Image Reference

At FuzzyLabs, Python is our bread and butter, and we always try to find the best tooling to help us deliver complex MLOps projects. One key strength of Python is its vast ecosystem of third-party libraries, which can save developers time and effort by providing pre-built code for common tasks. However, using so many libraries makes it challenging to manage dependencies and verify each installed dependency is compatible with the rest. Python dependency management is a frequent source of pain, especially so for MLOps practitioners, who need to deal with a lot of different data science libraries in their projects.

Effective Python dependency management is crucial for a number of reasons. First and foremost, it helps to ensure that a project runs smoothly and without errors. Without it, conflicts can arise between libraries, causing crashes or unexpected behavior. While software engineers are familiar with the importance of dependency management, data scientists may not fully understand the significance of managing dependencies or know which tools to use. In this blog, we will talk about different tools that we have played with for managing Python dependencies across various MLOps projects and why we love the Poetry (not to be confused with literature) library. The Python ecosystem is divided when it comes to what tools they prefer for managing Python dependencies, a sentiment best echoed by the xkcd comic below.

From our experience working across a large number of MLOps projects, we've come up with some criteria for what the best tool should provide:

Usability: Ease of use and good documentation
Virtual environments: Ease of running code inside virtual environments
Python versions: Installing and switching between different Python versions
Dependency management: Adding dependencies should be straightforward and dependency conflicts should be resolved appropriately
Reproducibility: Reproducible environments (able to reproduce the same virtual environment with exactly the same dependencies, deterministically, on the same platform)
Collaboration:. Virtual environments should work seamlessly across various platforms, e.g. macOS, Linux, Windows

Let’s understand each of the criteria and why we think it is important.

Usability: It’s a simple first requirement, but the tool should be easy to use - if there’s a usage barrier, then the complexities of dealing with dependencies are increased tenfold. Tightly coupled with this is documentation. The tool should have great documentation so that we understand exactly what it’s doing, how it’s doing it, and how to diagnose problems easily, if and when they occur.
‍
Virtual environments: Virtual environments can be thought of as isolated bubbles that will not interfere with other Python projects and their Python package dependencies. These isolated spaces contain project-specific dependencies, required by each project separately. The tool should enable us to easily run code in isolation. This criterion assesses a tool for how easily it enables this isolation - is it a multi-step process or a single command? Is it clear and obvious as to how the tool does this?
Consider a scenario where one MLOps project requires “pandas==1.4.0” – a specific version of the Pandas library – while another project requires “pandas==1.5.0”. A single global installation cannot serve both projects. Virtual environments solve this problem of dependency conflict. We can easily create and maintain separate virtual environments for each project.
‍
Python versions: It should be easy to switch between different Python versions. This includes installing the Python version, creating a virtual environment and installing all the dependencies required for the project. This also verifies that our project is compatible across various Python versions.
‍
Dependency management: Dependency management is the process of documenting the required libraries for your project. Every MLOps project stands tall on the shoulders of various third-party libraries. These libraries can in turn make use of other libraries and so on. Each project contains a document that provides a list of libraries and the corresponding version required to run that project.
For example, the libraries required by an example MLOps project are listed in the file shown below. The tool should install the dependencies mentioned and also provide a solution if any dependency conflicts arise while installing any of the libraries.<pre><code>timm==0.6.12
numpy==1.19.5
pandas==1.1.5
mlflow==1.30.0
</pre></code>
‍
Reproducibility: Reproducibility in this context means that it should be easy for anyone to create the same virtual environment and have the same dependencies installed in the virtual environment within the same platforms. Many tools provide one file that acts as a source of truth for all the dependencies that are required by the project. This file can be shared and we can easily reproduce an environment. The process of reproducing environments should be deterministic. It should produce the same environment and version-specific dependencies over multiple runs.
The reproducibility of virtual environments is critical, especially considering our collaboration with clients. Our deliverables must be readily shareable and replicable. Failing to meet this fundamental requirement would render our collaborative efforts futile.
‍
Collaboration: For ease of collaboration, we expect the reproducibility of the environment to be platform agnostic. Naturally, we will have multiple developers working on the same project and we want to ensure that, regardless of the platform (MacOS/Linux/Windows), everyone can reproduce the same environment with the same dependencies.

Consider a scenario where, for a hypothetical MLOps project, one developer is working on MacOS and their colleagues are all on Linux platforms. If, for example, the MacOS developer adds a dependency to the project, (after pushing and pulling the updated code) the tool should be able to install the same version of the dependency on each Linux platform.

Here's the summary of our experience with various combinations of tools:

*works conditionally

venv + pip

venv (virtual environment) and pip (package installer for python) are the most popular tools for managing dependencies due to their ease of use. Most Python versions (>3.3) come pre-installed with these two tools.

We use a combination of venv and pip whereby venv manages virtual environments and pip is used to install dependencies and handle conflicts for Python packages.

Usability: Python provides detailed and example-rich documentation on both venv and pip. It provides a guide on how virtual environments work and provides an API for third-party virtual environment creators. pip provides an excellent user guide on how to install, list, search, and uninstall Python packages.
Virtual environments: Virtual environments using venv are created using the “python3 -m venv my_env” command. This command creates a folder named “my_env”. This folder contains scripts that are used to activate the virtual environment. When a project-specific dependency is installed, it gets stored in this folder.
Python versions: Installing and switching between different Python versions is not an easy task using this combination. We have to install different versions of Python globally and maintain different virtual environments for each version.
Dependency management: pip’s dependency resolver is used to handle dependency conflicts that may arise. Documentation on dependency resolution provides insights on how pip handles dependency conflicts. It also contains recommendations on best practices to specify requirements to avoid conflicts.
Reproducible environment: pip provides a way to lock virtual environments to make environments reproducible on the same platforms.

The below command creates a “requirements-freeze.txt” file that can be used to reproduce the same virtual environment, with the same dependencies, on a similar platform to the one that created the file.

It is possible to separate dependencies into groups such as “dev” or “tests” by creating separate requirements files for each group. Separating dependencies provides a neat way to organise the dependencies for a Python project. For example, we can have a “docs” group that contains dependencies required to build the documentation or a “tests” group containing dependencies required to run tests. We can choose to install only the group of dependencies required for a specific purpose.
<pre><code>pip freeze > requirements-freeze.txt <pre><code>

Collaboration: The “pip freeze” command approach pins the various Python dependencies and also adds platform dependencies to the requirements file (for an example running on Windows, it adds the pywin32 package as a dependency). It also makes it harder to easily upgrade dependency packages as upgrading one package might break its dependency on some other pinned package.

There are additional tools like pip-tools, pip-compile or pip-reqs that we can use to solve the above issue and keep pinned dependencies fresh. But this involves even more tools to manage !😣

Verdict 🧑‍⚖️

Comparing this tool against the criteria:

Usability ✅
Virtual environments ✅
Python versions ❌
Dependency management ✅
Reproducible environments ✅
Collaboration ❌

Conda + Conda-lock

Conda is a cross-platform, open-source package and environment management system. It is easy to use and makes it simple to switch between different virtual environments and python versions. Additionally, it simplifies the installation of dependencies across various virtual environments that are running different Python versions.

Conda-lock creates a lightweight lockfile for Conda environments. These lock files can be used to create reproducible Conda environments with the same Python dependencies on any platform.

We use a combination of Conda and Conda-lock where Conda manages Python versions and virtual environments and Conda-lock creates cross-platform compatible environments.

Usability: Both Conda and Conda-lock are easy to use and provide rich documentation on how to use these tools. Conda documentation provides an in-depth tutorial on how to get started along with cheat sheets for the most important information about using Conda. The Conda-lock document shows how to use the tool for different source formats and various flags that can be used to configure this tool.
Virtual environments: Conda takes a similar approach to that of venv while creating virtual environments. A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. Instead of creating a separate directory inside the project, it creates an environment directory inside the “minconda3/envs” or “anaconda3/envs” folder. This folder contains directories of all the environments managed by Conda. This setup makes it easy to reuse the same environment for different projects as well.
Python versions: Conda provides an easy way to configure the python version of the virtual environments. The “python” flag in the “Conda create” command is used to configure the environment with a specific version of Python while creating it. An example command below creates a virtual environment named “myfancy_env” with Python version 3.10 installed inside the environment.

<pre><code>conda create -n myfancy_env python=3.10</code></pre>
Dependency management: Conda implements its own resolver to determine exactly which version of packages to install to avoid the dependency conflict. The default resolver in Conda is slow, i.e., it takes time to install packages while also resolving any dependency conflict. In addition to the default resolver, Conda provides a libmamba, an alternative faster resolver that makes Conda runs faster.
Reproducible environment: After installing all required dependencies, Conda enables us to export all dependencies in a “environment.yml” file that can be shared to reproduce the exact Conda environment including the name of the environment.
To reproduce the same environment on a similar platform to the one that created the “environment.yml” file, we can use the command “Conda env create -f environment.yml”.
The “Conda env export” command also exports platform related dependencies inside the “environment.yml” file. So, it is not helpful if the developers are running projects across multiple platforms.

Collaboration: To make “environment.yml” compatible across various platforms, we used the Conda-lock tool. Using `environment.yml` produced by Conda, Conda-lock creates a “lock file”, which contains a set of URLs to download for all the packages specified in the environment.yml file. This lock file can be shared and used to recreate the same environment on any platform.
<pre><code># install Conda-lock inside the virtual environment
pip install Conda-lock
# generate a multi-platform lockfile
Conda-lock -f environment.yml -p osx-64 -p linux-64</code></pre>

You can easily test the above using an example “environment.yml” file provided here.

Verdict 🧑‍⚖️

Comparing this tool against the criteria for the best tool

Ease of use and good documentation ✅
Virtual environments ✅
Python versions ✅
Dependency management ✅
Reproducible environments ✅
Collaboration ❌ ✅ (works only conditionally)

Poetry + Pyenv

Poetry is an open source tool for Python dependency management and packaging. It is one of the most feature-rich dependency management tools for Python. Poetry creates a lockfile which ensures reproducible environments across various platforms.

Pyenv is an open source tool for managing all Python versions. It provides an easy way to switch between multiple versions of Python.

We use a combination of Pyenv and Poetry whereby Pyenv manages Python versions and Poetry manages Python dependencies.

Ease of use and good documentation: Poetry has fantastic documentation which covers everything you might want to know and the pyenv documentation consists of an extensive project readme. Installing these two libraries is tricky and getting it right is important. For example, Pyenv requires dependencies that need to be installed based on your platform. Both these libraries are straightforward and easy to use once the installation and the initial configuration process is figured out.
Virtual environments: Poetry has support for creating virtual environments. The documentation states that, “Poetry makes project environment isolation one of its core features.”. By default, Poetry will use the same Python version that was used during Poetry’s installation to create a virtual environment. Poetry can easily be configured to use a specific Python version while creating a virtual environment using Pyenv. Similar to Conda, Poetry stores virtual environments inside a separate folder so that we can reuse the same environment for multiple projects.
Python versions: Pyenv is used to install and maintain different Python versions. The process to perform this operation is as simple as following two lines. The first command will install any desired Python version. The second command activates the newly installed version of Python. Poetry can further make use of this new Python version to create a virtual environment for corresponding Python versions.
<pre><code>pyenv install 3.9.8
# Activate Python 3.9 for the current project
pyenv local 3.9.8</code></pre>

Dependency management

Poetry implements its own exhaustive dependency resolver which, as per the documentation, will always find a solution if it exists. Interestingly, others have found that Poetry can take quite a long time to resolve dependencies.
Reproducible environment

One of the most important files for working with Poetry is the “pyproject.toml” file. All the configuration and dependencies for any Python project are specified in this file. Poetry simply resolves all dependencies listed in the “pyproject.toml” file and downloads the latest version of their files. All the packages and their exact versions are downloaded in the “poetry.lock” file. This lock file contains the locked specific versions of all the packages required by the project.

Poetry provides a way to organise project dependencies by groups. Adding dependencies in separate groups like docs and tests, creates a separate section in “pyproject.toml”. For example, adding pytest and mkdocs packages in group “test” and “docs” using commands “poetry add pytest -G test” and “poetry add mkdocs -G docs ”.

Example of “pyproject.toml” file looks like following:
<pre><code>[tool.poetry]
name = "my-project"
version = "0.1.0"
description = ""
authors = ["Your Name <you@example.com>"]
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.10, <3.11"
pandas = "1.5.2"
sqlalchemy = "<=1.4.41"
zenml = {version = "0.35.0", extras = ["server"]}

[tool.poetry.group.test.dependencies]
pytest = "^6.0.0"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"</code></pre>

Collaboration: To ensure all developers are using the same versions of each package, we can commit this “poetry.lock” file to version control. Anyone cloning the project can run the following install command to install the dependencies and create the same environment, regardless of the platform.

<pre><code>poetry install</pre></code>

Poetry uses the exact versions listed in “poetry.lock” to ensure that the package versions are consistent for everyone working on the project.

So far, we have tested and are very happy with using this workflow across various MLOps projects. It works straight out of the box 😀!

Verdict 🧑‍⚖️

Comparing this tool against the criteria for the best tool

Usability ✅
Virtual environments ✅
Python versions ✅
Dependency management ✅
Reproducible environments ✅
Collaboration ✅

In addition, Poetry also provides a way to package and upload the Python libraries to PyPI using “poetry build” and “poetry publish” commands.

Conclusion

We have explored in-depth three combinations of tools such as venv + pip, Conda + Conda-lock and Pyenv + Poetry that we used for various MLOps projects. We also compared each combination of tools against a criteria for the best tool and provided a verdict on how it fared against the same.

The combination of Pyenv and Poetry has provided us with an excellent workflow, allowing us to stay focused and unburdened by dependency management. We are delighted with its effectiveness. Of course, everyone has a preference on what combination of tools they prefer while working on Python projects. The Python ecosystem contains many such combinations of tools like Pipenv + Pyenv or Hatch or PDM that can also meet the outlined criteria.

Let us know what combination of tools your team uses for managing Python dependencies.

‍

Managing Python Dependencies 😅

venv + pip

Verdict 🧑‍⚖️

Conda + Conda-lock

Verdict 🧑‍⚖️

Poetry + Pyenv

Verdict 🧑‍⚖️

Conclusion

More like this

Exploring the Landscape of Open-Source Large Language Models (LLMs)

Guardrails for LLMs: a tooling comparison

Validation: Deepchecks Vs Great Expectations

Sign up to our newsletter