2.1. Semantic Instance Segmentation
There is a clear difference between semantic segmentation and semantic instance segmentation. In semantic segmentation, every pixel of an image receives a class label (e.g., person, car, bicycle). In this case, there is no distinction between objects belonging to the same class. On the other hand, in semantic instance segmentation, objects of the same class are considered individual instances. So, semantic instance segmentation detects and localizes, at the pixel-level, objects instances in images. Even if impressive results have been reported for other segmentation techniques, semantic instance segmentation still represents one of the biggest challenges imposed by computer vision. There is a high interest in solving this task, as instance labelling provides additional information in comparison with semantic segmentation. Autonomous driving, medicine, assistive devices and surveillance are only a few applications in which semantic instance segmentation would represent a highly required input.
Instance segmentation solutions can be classified into one-stage and two-stage instance segmentation. Generally, the two-stage methods perform object detection, which is then followed by segmentation. Mask RCNN [
3] extends [
4] by adding a parallel branch to predict an object mask for the corresponding bounding box. Therefore, the loss function of Mask RCNN [
3] combines the losses for bounding box, class recognition and mask segmentation. In addition, to improve the accuracy of the segmentation, RoI Pooling is replaced by RoI Align. MaskLab [
5] is also built on top of the Faster RCNN [
4] object detector. It combines semantic and direction prediction to achieve foreground/background segmentation. Semantic segmentation assists the model in distinguishing between objects of different semantic classes including background. On the other hand, direction prediction, which estimates each pixel’s direction toward its corresponding center, allows the separation of instances of the same semantic class. Hypercolumn features are exploited for mask refinement, and deformable crop and resize are used to improve the object detection branch. Huang et al. [
6] proposed a framework that scores the instance segmentation mask. The score of the instance mask is penalized if it has a high classification score and the mask is not good enough. PANet [
7] enhances the information flow in proposal-based solutions for semantic instance segmentation. A bottom–up path augmentation is added to enhance the extraction of the lower layers’ features. All feature levels are linked by adaptive feature pooling.
Even if the solution discussed above achieves state-of-the-art performance in terms of accuracy, the time needed for inference is rather high, which makes them unsuitable for integration in real-time systems.
One-stage methods usually perform detection and segmentation simultaneously. The authors of [
8] use a fully convolutional network with two branches: one for estimating the segment instances and the other for scoring the instances. Each output pixel is a classifier of relative positions of instances which are further assembled using an instance assembling module. The first attempt for real-time instance segmentation was proposed in [
9]. The instance segmentation is broken into two tasks: generating a set of prototype masks and predicting per-instance mask coefficients. The instance masks are produced by combining the prototype with the mask coefficients. Furthermore, in [
10], deformable convolutions are added to the backbone, and the prediction head is optimized by using multi-scale anchors with different aspect ratios for each FPN level. The same idea as in Mask Scoring RCNN [
6] is used to assign scores to the predicted masks.
Transformer-based networks were successfully applied in various computer vision tasks and held impressive results. Mask DINO [
11] extends DINO [
12] by adding a new branch to perform mask prediction for panoptic, instance and semantic segmentation. Content query embeddings from DINO [
12] are used to perform mask classification for all segmentation tasks. QueryInst [
13] proposes a query-based end-to-end instance segmentation with parallel supervision on six dynamic mask heads. QueryInst exploits the intrinsic one-to-one correspondence in queries across different stages. ISTR [
14] matches low-dimensional mask embeddings with ground truth to compute the loss. In addition, it uses a recurrent refinement strategy to simultaneously perform detection and segmentation.
Image semantic and semantic instance segmentation, object detection and tracking, video object detection and segmentation, and video semantic segmentation have been well studied over time in comparison with video instance segmentation. This task implies the detection, segmentation and tracking of objects in a video sequence. Yang et al. [
15] were the first to tackle the topic of video instance segmentation. Their works use the state-of-the-art method [
3] for image instance segmentation. A new branch is added to Mask RCNN for tracking instances across video frames. The instances are stored in external memory and matched with objects in later frames. A similar solution to [
15] is proposed in [
16]; it predicts a basis mask and a set of coefficients to improve the segmentation quality. It achieves a better execution time than [
15], as it is built on [
17]. Another online method for video instance segmentation is proposed by Li et al. [
18]. Inter-frame correlations are encoded by using a bottom–up framework equipped with a temporal context fusion module. An instance-level correspondence across adjacent frames, instance flow, is used for efficient and robust tracking among instances. A few works treat the video instance segmentation in offline mode [
19,
20,
21]. These methods typically model the temporal information.
2.2. Dense Optical Flow
Optical flow represents the motion of objects between consecutive frames (
Figure 1) and expresses the relative movement between objects and the camera.
There are two types of optical flow:
sparse optical flow and
dense optical flow. Sparse optical flow describes the flow vector only for some objects’ features (e.g., edges, corners), while dense optical flow computes the flow vectors of all pixels from the image, as pictured in
Figure 2.
The information provided by optical flow proves to be useful in a wide range of computer vision systems as well as in other applicative domains (e.g., action recognition, video compression, robots and vehicle navigation, video surveillance, fluid flow, etc.).
Horn and Schunck [
22] and Lucas and Kanade [
23] were the first authors to tackle the subject of optical flow more than three decades ago. Their patch-based approach uses a Taylor series expansion of the displaced image function to obtain sub-pixel estimates [
23]. Horn and Schunck [
22] proposed a regularization-based framework to simultaneously minimize the intensity between the corresponding pixels (over all flow vectors). The authors of [
24] combine the ideas of [
22,
23] into a single framework that uses a locally aggregated Hessian as the brightness constancy term. Various techniques that use a combination of global and local motion were proposed [
25,
26,
27,
28].
Generally, the solutions published over time focused on improving the accuracy of the optical flow method rather than achieving real-time operation capability [
29,
30,
31,
32,
33,
34]. Some of the authors use powerful hardware resources to obtain an acceptable runtime [
35,
36], while others try to make a compromise between accuracy and time [
37]. An efficient patch-based correspondence is proposed by Kroeger et al. in [
38], which leads to a low computational time.
The fast evolution of deep neural networks led to their use in multiple computer vision problems, such as optical flow. DeepFlow [
39] uses dense sampling to retrieve quasi-dense correspondences, which are further optimized using an energy variational framework. In [
40] a classical spatial pyramid is combined with deep learning to estimate large motions. PWC-Net [
41] computes a feature pyramid from each frame, warps the CNN features to the second image, and then builds a cost volume based on these two.