This is the final blog as a part of ZenML’s Month of MLOPs competition. The problem statement for the competition was to implement a creative ML application using ZenML. At Fuzzy Labs, we came up with a fun and challenging idea of deploying our ML application on the edge device, specifically Nvidia Jetson Nano. What makes deploying on edge device interesting is we are constrained by the resources and we have to make the most out of what is available.
Our application consists of a ML model trained to play Dobble. If you are wondering, what is Dobble? An explainer video by the most experienced Dobble player in our team should answer that question. In the first post of the series, we covered data science stuff such as creating dataset, approach to preparing, packaging the dataset and motivation behind choosing a SSD model for training. In the second post of the series, we dived deeper into the process of creating ZenML pipelines for different tasks to get a trained object detection model as an output, ready for deployment. ZenML helped streamline the journey of model development to deployment. In the last part of the series, we outline different steps we took to play Dobble game on the edge device.
This is the final blog in a three part series:
The data science: how our model works and how we trained it.
The pipeline: a deep dive into the ZenML pipeline that ties everything together, going from data input to model deployment.
The edge deployment: how we play Dobble on Jetson Nano.
Accelerate the model with TensorRT
In the previous post, we created a MLOps pipeline using ZenML which trains an object detection model. This model learns to recognise all the symbols on a Dobble card. The final output that we get from the pipelines is a ONNX (Open Neural Network Exchange) exported model of the PyTorch trained model. We convert the PyTorch model to ONNX for the downstream task of running inference on Jetson device using TensorRT. TensorRT is a high performance inference library developed by Nvidia that provides up to 4x performance speed-up compared to PyTorch models. The Nvidia TensorRT documentation mentions that there are 3 ways for converting any model to TensorRT engine. The first way is to use the TF-TRT library for converting Tensorflow models. TensorRT supports automatic conversion from the ONNX model as the second approach that we use in our workflow. Third approach is to manually construct a network using the TensorRT API.
According to TensorRT docs, ONNX conversion is generally the most performant way of automatically converting an ONNX model to a TensorRT engine.
For a given image of Dobble card as input, the output of ONNX object detection model is a set of bounding boxes and predicted labels. A cherry-picked example below shows prediction for a input image, where the image on the left is the original Dobble card, the image in the middle is the ground truth with bounding boxes for each symbol and the image at the right are the predictions from the model. Surprised 😯 !
In a Dobble game scenario we are provided with 2 cards. We have to find a way to separate these multiple cards, since the model was trained on a set of individual cards. The performance would be worse for scenes with multiple cards on a background because the background interferes leading the model to produce mispredictions. To distinguish these multiple cards, we use standard OpenCV techniques such as thresholding, morphing and finding contours to get a crop of individual card. An example below demonstrates 2 cropped images identified from the original image.
The individual crop of cards can then be passed as input to get predictions (bounding boxes and labels) from the model as output. A big shout out to the jetson-inference library that did most of the heavy lifting for us, reducing the inference code complexity on jetson down to just two lines. One function for parsing ONNX model and creating a TensorRT engine, and another function for running inference using TensorRT engine that produces detections as predictions.
Next, we write a simple match and compare function to find a matching symbol amongst all the identified symbols on all the cropped cards and draw the bounding boxes around matching symbol as output. Voilà, we won the round 🎉!
Diagram above illustrates all the different components that are involved in the inference application and steps taken to get the final output, bounding box around matched symbol. Having stitched all these components together, it was time to deploy and test this application on Jetson Nano edge device.
Serving models on the Jetson Nano
The initial approach for fetching the ONNX model for deployment was using ZenServer, short for ZenML server. All pipeline runs get pushed to ZenServer. ZenServer keeps track of all the pipeline runs and related metadata inside a database. This allows easy access to fetch specific steps from any pipeline runs and their corresponding output artifacts. In our workflow, we query the pipeline runs to fetch the output artifact returned during export onnx step which in this case is a ONNX model. We wrote a client for connecting to the ZenML server and that will fetch the latest ONNX model on jetson itself. This process updates the model used for inference on jetson. But the approach failed as we were unable to successfully install the ZenML package on Jetson platform. The ZenML package does not have a release for ARM, which is used in Jetson Nano, and it is not possible to build it from scratch. This issue arises due to the ml-metadata library, one of the dependencies of ZenML, not being available on ARM. Hence, we had to fetch models in a different way.
MLflow server was used as a replacement for ZenServer. MLflow was already a part of the stack as an experiment tracker in the training pipeline. It was deployed using awesome mlops recipes that automated the process of creating various cloud resources required for running ZenML pipelines. In addition to tracking various parameters, metrics, we had also logged the final trained PyTorch and ONNX model as artifacts. All we needed to modify was the logic to fetch the ONNX model from MLflow server instead of ZenML server. The authentication on Jetson for this server involves setting two variables “MLFLOW_TRACKING_USERNAME” , “MLFLOW_TRACKING_PASSWORD” and setting the MLflow server url as “tracking_uri” parameter in MLflow client. MLflow client is used to download the logged artifact, in this case ONNX model for particular “run_id”. Run_id is a unique identifier corresponding to each logged experiment.
All this logic for fetching the model -detailed above-, and running inference application -detailed in the previous section- using the fetched model are packaged together into a docker image on a Jetson device. This way, the process to update the model also resides on Jetson itself. The input to the entire application can either be a camera device or a recorded video of the game. This application runs inference on the input and draws a bounding box around matching symbols on the Dobble cards as output. A shout out to jetson-utils library that implemented the logic for initialising the camera or input video, pre-processing the inputs from either of the sources used as input to the model and creating a output video stream.
In this article, we outlined the steps taken and challenges encountered in deploying ML models to the edge with the Nvidia Jetson Nano. And, finally, we’re ready to take on anybody at the game of Dobble — so long as we can cheat! If you want to play along, why not check out our repo on Github