Multi-Object Tracking by Detection: A Comprehensive Guide

	Year	Appearence Features	Camera Compensation	HOTA MOT20	Extra data
SORT	2016	❌	❌		❌
DeepSORT	2017	✅	❌		✅
ByteTrack	2021	✅	❌	61.3	✅
BoT-SORT	2022	✅	✅	63.3	✅
SMILEtrack	2022	✅	✅ (?)	63.4	✅

Introduction

Tracking by detection is an object tracking approach that first detects objects in each frame of a video and then associates the detections across frames. This process involves matching detections by analyzing their location, appearance, or motion characteristics. Tracking by detection has become the most popular method for addressing object tracking due to the rapid development of reliable object detectors.

The intention of this blog is to keep myself updated with the bibliography of tracking by detection methods. My intention is to regularly update this blog with new information and resources I find interesting. I have included the SORT and DeepSORT papers in the list, despite being older methods, as they laid the groundwork for many of the techniques covered here.

SORT

It is a very good and simple work from 2016 that quickly became a standard in the field. The author’s main goal was to create the fastest possible tracker relying on the quality of the object detector predictions. Appearance features of the objects are not used; the system relies solely on bounding box position and size.

They employ two classical methods:

Kalman Filter: is in charge of handling motion prediction, this is, figuring out where a track is going to move in the next frame given previous states. Track states are modeled with six different variables:
\[\mathbf{x} = [u,v,s,r,\dot{u},\dot{v},\dot{s}]^T,\]
These are the center of the target bounding box ($u, v$), the scales and aspect ratio of it ($s, r$) and their velocity components ($\dot{u},\dot{v},\dot{s}$).
Hungarian method: used in the data association step to match new predictions with tracks based on IoU metric.

SORT architecture diagram

An object detector returns bounding boxes for frame 0.
In T=0, a new track is created for each of the predicted bounding boxes
KF will predict a new position for each of the tracks
Object detector returns bounding boxes for frame 1
These bounding boxes are associated with tracks positions predicted by KF
New tracks are created for unmatched bounding boxes
Unmatched tracks can be terminated if they are not matched to any detection for $T_{Lost}$ frames.
Matched tracks and new tracks are passed to the next time step
Back to 3

DeepSORT

DeepSORT is an extension of SORT that uses appearance features. It enhances SORT by adding a simple CNN extension that extracts appearance features from bounding boxes, improving object tracking, especially during occlusions. An object can be re-identified using appearance similarity after being occluded for a long period of time

Each track maintains a gallery of the last $n$ appearance descriptors, enabling cosine distance calculations between new detections and descriptors. Track age, determined by frames since the last association, plays a crucial role in the association process. DeepSORT adopts a cascade approach, prioritizing tracks with lower ages over a single-step association between predicted Kalman states and new measurements.

DeepSORT architecture diagram

There is a small modification on the Kalman Filter prediction step that is included in the code but not mentioned in the original paper. The matrices $Q$, $R$ of the Kalman Filter were chosen in SORT to be time indepent, however in DeepSORT it was suggested to choose $Q%$, $R$ as functions of the scale of the bounding box. This can be due to the scale is less likely to change over time than other features and it can be also be used to compensate for changes in camera’s viewpoint.

The cascade association step would look like this:

for track_age in range(1, maximum_age):
    tracks_to_associate = get_tracks_with_age(tracks, track_age)
    associate(tracks_to_associate, detections)
    remove_associated_detections(detections)

ByteTrack

ByteTrack is a recent object tracking algorithm that proposes a simple but effective optimization for the data association step. Most methods filter out detections with low confidence scores. This is because low-confidence detections are more likely to be false positives, or to correspond to objects that are not present in the scene. However, this can lead to problems when tracking objects that are partially occluded or that undergo significant appearance changes.

ByteTrack addresses this problem by using all detections, regardless of their confidence score. The algorithm works in two steps:

High-confidence detections: High-confidence detections are associated with tracks using intersection-over-union (IoU) or appearance features. Both approaches are evaluated in the results section of the paper.
Low-confidence detections: Low-confidence detections are associated with tracks using only IoU. This is because low-confidence detections are more likely to be spurious or inaccurate, so it is important to be more conservative when associating them with tracks.

ByteTrack architecture diagram

The ByteTrack algorithm has been shown to be very effective and it is currently among the top-performing methods on the MOT Challenge leaderboard.

BoT-SORT

I personally love the BoT-SORT paper. It is build upon ByteTrack and it combines three different ideas that work very well together. These are:

Kalman Filter update: SORT introduced a way of modelling the track state vector using a seven-tuple $\mathbf{x} = [x_c,y_c,a,h,\dot{x_c},\dot{y_c},\dot{s}]^T$. BoT-SORT proposes to replace the scale and aspect ratio of the bounding box ($s$, $a$) with the widht and height ($w$, $h$) to create an eight-tuple:
\[\mathbf{x} = [x_c,y_c,w,h,\dot{x_c},\dot{y_c},\dot{w}, \dot{h}]^T\]
They also choose Q, R matrices from the Kalman Filter as functions of the bounding box width and height. Recall that in DeepSORT, only the scale of the bounding box influenced on the Q, R matrices (see section 3.1 of BoT-SORT paper for more details).
Camera Motion Compensation: In dynamic camera situations, objects that are static can appear to move, and objects that are moving can appear to be static. The Kalman Filter does not take camera motion into account for its predictions, so BoT-SORT proposes to incorporate this knowledge. To do this, they use the global motion compensation technique (GMC) from the OpenCV Video Stabilization module. This technique extracts keypoints from consecutive frames and computes the homography matrix between the matching pairs. This matrix can then be used to transform the prediction bounding box from the coordinate system of frame $k − 1$ to the coordinates of the next frame $k$ (see section 3.2 of BoT-SORT paper to a full formulation on how incorporate the homography matrix in the prediction step).

Player is static on the pitch while throwing the ball but location on the image changes due to camera movement.
1. IoU - ReID Fusion: BoT-SORT proposes a new way of solving the association step by combining motion and appearance information. The cost matrix elements are computed as follows:
\[\hat{d}^{cos}_{i,j} = \begin{equation} \begin{cases} 0.5 \cdot {d}^{cos}_{i,j}, ({d}^{cos}_{i,j} < \theta_{emb}) \hat{} ({d}^{iou}_{i,j} < \theta_{iou})\\ 1, \text{otherwise} \end{cases} \end{equation}\] \[C_{i,j} = min(d^{iou}_{i,j}, \hat{d}^{cos}_{i,j})\]
The appearence distance is recomputed as shown in the first equation. The idea is to filter out pairs with large iou or large appearance distance (two different thresholds are used here). Then, the cost matrix element is updated as the minimum between the IoU and the new appearance distance. This method seems to be handcrafted, and the authors likely spent a significant amount of time evaluating different thresholds on the MOT17 dataset to arrive at this formulation. Note thresholds are callibrated using MOT17 validation set.

BoT-SORT architecture diagram

SMILEtrack

This method currently holds the title of being the State-of-the-Art (SOTA) in the MOT17 and MOT20 datasets. It builds upon ByteTrack but throws in a handful of fresh ideas designed to give appearance features more importance.

I spent a couple hours trying to understand the paper but I have to admit it felt very confusing to me, so I went straight to the code. Things got even trickier there; I noticed quite a few things that didn’t align with what was mentioned in the paper. As a results, so I opened an issue on the project’s GitHub repository. I’ll update this section once I hear back from the authors.

References

[1] Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016, September). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464-3468). IEEE.
[2] Wojke, N., Bewley, A., & Paulus, D. (2017, September). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645-3649). IEEE.
[3] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., … & Wang, X. (2022, October). Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision (pp. 1-21). Cham: Springer Nature Switzerland.
[4] Aharon, N., Orfaig, R., & Bobrovsky, B. Z. (2022). BoT-SORT: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651.
[5] Wang, Y. H. (2022). SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking. arXiv preprint arXiv:2211.08824.

Miguel Méndez

Posts