Real-time Object Detection using SSD MobileNet V2 on Video Streams

An easy workflow for implementing pre-trained object detection architectures on video streams

Odemakinde Elisha

Published in

Heartbeat

6 min readSep 17, 2020

In this article, we’ll be learning the following:

What object detection is
Various TensorFlow models for object detection.
Implementing MobileNetV2 on video streams.
Conclusion
References

What is Object Detection?

Object detection can be defined as a branch of computer vision which deals with the localization and the identification of an object. Object localization and identification are two different tasks that are put together to achieve this singular goal of object detection.

Object localization deals with specifying the location of an object in an image or a video stream, while object identification deals with assigning the object to a specific label, class, or description. With computer vision, developers can flexibly do things like embed surveillance tracking systems for security enhancement, real-time crop prediction, real-time disease identification/ tracking in the human cells, etc.

TensorFlow Model Zoo for Object Detection

The TensorFlow Model Zoo is a collection of pre-trained object detection architectures that have performed tremendously well on the COCO dataset. The model zoo can be found here. The model architectures include:

CenterNet
EfficientDet
MobileNet
ResNet
R-CNN
ExtremeNet

CenterNet (2019) is an object detection architecture based on a deep convolution neural network trained to detect each object as a triplet (rather than a pair) of keypoints, so as to improve both precision and recall. More information about this architecture can be found here.

EfficientDet (2019) is an object detection architecture built to scale up model efficiency in computer vision. This architecture achieves much better efficiency than prior architectures across a wide spectrum of resource constraints. More information about this architecture can be found here.

MobileNet is an object detector released in 2017 as an efficient CNN architecture designed for mobile and embedded vision application. This architecture uses proven depth-wise separable convolutions to build lightweight deep neural networks. More information about the architecture can be found here.

RetinaNet is an architecture developed by the Facebook research team in 2018. RetinaNet uses a Feature Pyramid Network (FPN) backbone on top of a feed-forward ResNet architecture to generate a rich, multi-scale convolutional feature pyramid. It is a one-staged detector (that is, a single network, unlike R-CNN, which is 2-staged). More information about the architecture can be found here.

R-CNN (2014) is a 2-stage object detection architecture. It is a region-based CNN that uses a Region Proposal Network to generate regions of interests in the first stage, and then sends the region proposal down the pipeline for object classification and bounding box regression. More information about the architecture can be found here.

ExtremeNet (2019) is a bottom-up object detection framework that detects four extreme points (top-most, left-most, bottom-most, right-most) of an object to find extreme points, by predicting four multi-peak heatmaps for each object category. More information about the architecture can be found here.

Let’s go ahead to trying out one of these model architectures on a typical video stream.

Implementation of MobileNetV2 on video streams

The following steps will help us achieve our object detection goal:

Install the TensorFlow Object detection API.
Download the model file from the TensorFlow model zoo.
Setting up the configuration file and model pipeline
Create a script to put them together.

Installing TensorFlow Object Detection API

To get this done, refer to this blog:

Tensorflow Object Detection API

Part 1- Installation with tensorflow — CPU : Windows 10

medium.com

Downloading the model file from the TensorFlow model zoo.

To download the network architecture, you can follow the process below:

Download the MobileNetV2 pre-trained model to your machine
Move it to the object detection folder.
Create a main.py python script to run the real-time program.

Having installed the TensorFlow Object Detection API, the next step is to import all libraries—the code below illustrates that. Do take note that we need other packages like SciPy, NumPy for numerical computation, and PIL and Matplotlib for image processing and visualization:

Having imported all needed libraries, the next step is to write a simple Python script that helps us load images or convert real-time video frames into NumPy arrays. The code below helps us to get this done efficiently:

Setting up the configuration file and model pipeline

Now that we can efficiently convert video frames into arrays, let’s go ahead and set up the configuration file and model pipeline. To do this, we can follow these steps:

Identifying the path to the pipeline config of our MobileNetV2 model. This configuration file defines the model architecture and params.
Specifying the checkpoint file of the model to be used (model_dir).
Initialize model prediction by passing in the config path of the model.
Use TensorFlow to restore the model’s last checkpoint by specifying the checkpoint directory.

All of the above is completed in the gist below:

Next, we need to merge the full pipeline to detect the images and assign labels from the pre-trained model. To do this, the get_model_detection_function in the gist below helps to:

Pre-process the image.
Assign a target label to the object in the image.
Predicts the probability of the target label to each frame in the image.

This script below helps us complete this process:

Last but not least, we need to to initialize label mapping. The label map contains the target label of the pre-trained classes. This is used to help the model specify the label name of every object being identified in the frame. The gist below helps us specify the path to the label map and load all labels with their associated values:

Now that all this is set, the next step is to initialize the video stream with OpenCV, and then initialize a video writer. The code below helps us to get this done:

Putting it all together

Now that we have the video stream and the writer in place, the next step is to keep the video stream live and perform real-time object detection by looping through the frames catpured from the video stream. As long as this keeps running, we can visually see the object detection result by displaying it on our screen.

Finally, once the stream goes off, the video writer then converts all frames captured so far into a video (with the real-time object detection result). The below code helps us get this done from end-to-end.

Result and conclusion

The video above shows an active demonstration of all we have been talking about. Though this was recorded in ‘BGR’ format, you can always specify ‘RGB’ while trying out your own real-time object detector with the MobileNet V2 architecture.

Lastly, in the video, it took a while before the architecture could identify people at the rear end, as well as a few close by. This doesn’t mean the architecture isn’t capable of doing this. A likely reason is the fact that the work was recorded in ‘BGR’ format, while the network is familiar to the ‘RGB’.

Nonetheless, recording in BGR isn’t a unique reason for not making the model see people from the rear. The reason for this is that BGR images can always be converted to RGB images when fed into the network architecture for real-time prediction.

The performance of the model on unseen data (the video frames) is awesome and unique because the model was able to maintain its pre-trained performance with the COCO dataset on an untrained video stream. To better enhance the performance of this model on frames like the above, we’d need to retrain the architecture on more data, and particularly ground-truth data that represents the core problem we’re trying to solve.

I do hope you’ve learned a lot from this tutorial. If so, do share with friends and colleagues.

Thank you.

Reference

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.