Next Article in Journal
Investigation of a Monturaqui Impactite by Means of Bi-Modal X-ray and Neutron Tomography
Next Article in Special Issue
Analytics of Deep Neural Network-Based Background Subtraction
Previous Article in Journal
The Potential of Cognitive Neuroimaging: A Way Forward to the Mind-Machine Interface
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Background Subtraction for Moving Object Detection in RGBD Data: A Survey

by
Lucia Maddalena
1,* and
Alfredo Petrosino
2
1
National Research Council, Institute for High-Performance Computing and Networking, 80131 Naples, Italy
2
Department of Science and Technology, University of Naples Parthenope, 80143 Naples, Italy
*
Author to whom correspondence should be addressed.
Submission received: 16 April 2018 / Revised: 7 May 2018 / Accepted: 9 May 2018 / Published: 16 May 2018
(This article belongs to the Special Issue Detection of Moving Objects)

Abstract

:
The paper provides a specific perspective view on background subtraction for moving object detection, as a building block for many computer vision applications, being the first relevant step for subsequent recognition, classification, and activity analysis tasks. Since color information is not sufficient for dealing with problems like light switches or local gradual changes of illumination, shadows cast by the foreground objects, and color camouflage, new information needs to be caught to deal with these issues. Depth synchronized information acquired by low-cost RGBD sensors is considered in this paper to give evidence about which issues can be solved, but also to highlight new challenges and design opportunities in several applications and research areas.

1. Introduction

Background modeling is a critical component for motion detection tasks, and it is essential for most of modern video surveillance applications. Usually, the color information provides most of the information useful to detect foreground and to solve all the basic issues related to this task [1,2,3,4,5]. Anyway, problems like light switches or local gradual changes of illumination, shadows cast by the foreground objects, and color camouflage due to similar color of foreground and background regions are still open. The recent broad availability of depth data (from stereo vision to off-the-shelf RGBD sensors, such as time-of-flight and structured light cameras) opened new ways of dealing with the problem. Indeed, dense depth data provided by RGBD cameras is very attractive for foreground/background segmentation in indoor environments (due to the range camera limitations), since it does not suffer from the above-mentioned challenging issues that affect color-based algorithms. Moreover, depth information is beneficial to detect and reduce the effection of moved background objects.
On the other hand, the use of just depth data poses several problems that do not assure the required efficiency: (a) depth-based segmentation fails in case of depth camouflage that appears when foreground objects move towards the modeled background; (b) object silhouettes are strongly affected by the high level of depth data noise at object boundaries; (c) depth measurements are not always available for all the image pixels due to multiple reflections, scattering in particular surfaces, or occlusions. All these issues arose with several background modeling approaches based solely on depth as proposed in [6,7,8,9,10], mainly as building blocks for people-detection and tracking systems [11,12,13,14].
Therefore, many recent methods try to exploit the complementary nature of color and depth information acquired with RGBD sensors. Generally, these methods either extend to RGBD data’s well-known background models initially designed for color data [15,16] or model the scene background (and sometimes also the foreground) based on color and depth independently and then combine the results, on the basis of different criteria [17,18,19,20] (see Section 3).
Several reviews related to RGBD data have been recently presented. In [21], Cruz et al. provide one of the first surveys of academic and industrial research on Kinect and RGBD data, showing the basic principles to begin developing applications using Kinect. Greff et al. [8] present a comparison between background subtraction algorithms using depth cameras. In [22], Zhang unravels the intelligent technologies encoded in Kinect, such as sensor calibration, human skeletal tracking, and facial-expression tracking. It also demonstrates a prototype system that employs multiple Kinects in an immersive teleconferencing application. In [23], Han et al. present a comprehensive review of recent Kinect-based computer vision algorithms and applications, giving insights on how researchers exploit and improve computer vision algorithms using Kinect. In [24], Camplani et al. survey multiple human tracking in RGBD data.
The present paper aims to provide a comprehensive review of methods which exploit RGBD data for moving object detection based on background subtraction. We do not review methods based only on RGB features, as that would need a dedicated survey of its own and would demand much greater space—for RGB only background subtraction, the reader is referred to the reviews presented in [1,2,3,5]. We provide a brief analysis of the main issues and a concise description of the existing literature. Moreover, we summarize the metrics commonly used for the evaluation of these methods and the datasets that are publicly available. Finally, we provide the most extensive comparison of the existing methods on some datasets.

2. RGBD Data and Related Issues for Background Subtraction

Color cameras are based on sensors like CCD or CMOS, which provide a reliable representation of the scene with high-resolution images. Background subtraction using this kind of sensors often results in a precise separation between foreground and background, even though well-known scene background modeling challenges for moving object detection must be taken into account [25,26]:
  • Bootstrapping: The challenge is to learn a model of the scene background (to be adopted for background subtraction) even when the usual assumption of having a set of training frames empty of foreground objects fails.
  • Color Camouflage: When videos include foreground objects whose color is very close to that of the background, it is hard to provide a correct segmentation based only on color.
  • Illumination Changes: The challenge is to adapt the color background model to strong or mild illumination changes to achieve an accurate foreground detection.
  • Intermittent Motion: The issue is to detect foreground objects even if they stop moving (abandoned objects) or if they were initially stationary and then start moving (removed objects).
  • Moving Background: The challenge is to model not only the static background but also slight changes in the background that are not interesting for surveillance, such as waving trees in outdoor videos.
  • Color Shadows: The challenge is to discriminate foreground objects by shadows cast on the background by foreground objects that apparently behave as moving objects.
Depth sensors provide partial geometrical information about the scene that can help solving some of the above problems. A depth image, storing for each pixel a depth value proportional to the estimated distance from the device to the corresponding point in the real world, can be obtained with different methods [27]:
  • Stereo vision [28]: this is a passive technique where the depth is derived from the disparity between images captured from a camera pair. Stereo vision systems need to be well-calibrated and can fail when the scene is not sufficiently textured. Moreover, algorithms for stereo reconstruction are often computationally expensive. Finally, stereo vision systems cannot work in low light conditions. In this case, infrared (IR) lights can be added to the system, but then, the color information is lost, which generates segmentation and matching difficulties.
  • Time-of-Flight (ToF) [29]: ToF cameras are active sensors that determine the per-pixel depth value by measuring the time taken by IR light to travel to the object and back to the camera. A ToF camera provides more accurate depth images than a stereo vision system, but it is very expensive and limited to low image resolution. The measured depth map can be noisy both spatially and temporally, and noise is content-dependent and hence, difficult to remove by traditional filtering methods.
  • Structured light [30]: A structured light sensor consists of an IR emitter and an IR camera. The emitter projects an IR speckle pattern onto the scene; the camera captures the reflected pattern and correlates it against a stored reference pattern on a plane, providing the depth values. Well known examples include the Microsoft Kinect version 1 (in the following simply named Kinect) and the Asus Xtion Pro Live. These sensors can acquire higher resolution images than a ToF camera at a lower price. The drawback is that depth information is not always well estimated at the object boundaries and for areas too far from/too close to the IR projector. Also, the noise in depth measurements increases quadratically with increasing distance from the sensor [31].
Even though depth data solves some of the previously highlighted background maintenance issues, being independent of scene color and illumination conditions, it suffers from several problems, independent of which technology is used for its estimation. Indeed, as for color data, depth data suffers from bootstrapping, intermittent motion, and moving background. Moreover, challenges specific for depth data include [32,33].
  • Depth Camouflage: When foreground objects are very close in depth to the background, the sensor gives the same depth data values for foreground and background, making it hard to provide a correct segmentation based only on depth.
  • Depth Shadows: Similar to the case of color, depth shadows are caused by foreground objects blocking the IR light emitted by the sensor from reaching the background.
  • Specular Materials: When the scene includes specular objects, IR rays from a single incoming direction are reflected back in a single outgoing direction, without causing the diffusion needed to obtain depth information.
  • Out of Sensor Range: When foreground or background objects are too close to/far from the sensor, the sensor is unable to measure depth, due to its minimum and maximum depth specifications.
In the last three cases, where depth cannot be measured at a given pixel, the sensor returns a special non-value code to indicate its inability to measure depth [32], resulting in an invalid depth value (shown as black pixels in the depth images reported in Figure 1). These invalid values must be suitably handled to exploit depth for background subtraction.

3. Methods

In the last twenty years, several methods have been proposed for background subtraction exploiting depth data, as an alternative or complement to color data. A summary of background subtraction methods for RGBD videos is given in Table 1. Here, apart from the name of the authors and the related reference (column Authors and Ref.), we report (column Used data) whether they exploit only the depth information (D) or the complementary nature of color and depth information (RGBD). Moreover, we specify (column Depth data) how the considered depth data is acquired (Kinect, ToF cameras, stereo vision devices). Furthermore, we specify (column Model) the type of model adopted for the background, including Codebook [34], Frame difference, Kernel Density Estimation (KDE) [35], Mixture of Gaussians (MoG) [36], Robust Principal Components Analysis (RPCA) [37], Self-Organizing Background Subtraction (SOBS) [38], Single Gaussian [39], Thresholding, ViBe [40], and WiSARD weightless neural network [41]. Finally, we specify (column No. of models) if they extend to RGBD data well-known background models originally designed for color data (1 model) or model the scene background based on color and depth independently and then combine the results, on the basis of different criteria (2 models).
In the following, we provide a brief description of the reviewed methods, presented in chronological order. In case of research dealing with higher-level systems (e.g., teleconferencing, matting, fall detection, human tracking, gesture recognition, object detection), we limit our attention to background modeling and foreground detection.
Eveland et al. [6] present a method of statistical background modeling for stereo sequences based on the disparity images extracted from stereo pairs. The depth background is modeled by a single Gaussian, similarly to [39], but selective update prevents the incorporation of foreground objects into the background.
The method proposed by Gordon et al. [42] is an adaptation of the MoG algorithm to color and depth data obtained with a stereo device. Each background pixel is modeled as a mixture of four-dimensional Gaussian distributions: three components are the color data (the YUV color space components), and the fourth one is the depth data. Color and depth features are considered independent, and the same updating strategy of the original MoG algorithm is used to update the distribution parameters. The authors propose a strategy where, for reliable depth data, depth-based decisions bias the color-based ones: in case that a reliable distribution match is found in the depth component, the color-based matching criterion is relaxed, thus reducing the color camouflage errors. When the stereo matching algorithm is not reliable, the color-based matching criterion is set to be harder to avoid problems such as shadows or local illumination changes.
Ivanov et al. [43] propose an approach based on stereo vision, which uses the disparity (estimated offline) to warp one image of the pair in the other one, thus creating a geometric background model. If the color and brightness between corresponding points do not match, the pixels either belong to a foreground object or to an occlusion shadow. The latter case can be further disambiguated using more than two camera views.
Harville et al. [16] propose a foreground segmentation method using the YUV color space with the additional depth values estimated by stereo cameras. They adopt four-dimensional MoG models, also modulating the background model learning rate based on scene activity and making color-based segmentation criteria dependent on depth observations.
Kolmogorov et al. [44] describe two algorithms for bi-layer segmentation fusing stereo and color/contrast information, focused on live background substitution for teleconferencing. To segment the foreground, this approach relies on stereo vision, assuming that people participating in the teleconference are close to the camera. Color information is used to cope with stereo occlusion and low-texture regions. The color/contrast model is composed of MoG models for the background and the foreground.
Crabb et al. [45] propose a method for background substitution, a regularly used effect in TV and video production. Thresholding of depth data coming from a ToF camera, using a user-defined threshold, is adopted to generate a trimap (consisting of background, foreground, and uncertain pixels). Alpha matting values for uncertain pixels, mainly in the borders of the segmented objects, are needed for a natural looking blending of those objects on a different background. They are obtained by cross-bilateral filtering based on color information.
In [11] by Guomundsson et al., 3D multi-person tracking in smart-rooms is tackled. They adopt a single Gaussian model for the range data from a two-modal camera rig (consisting of a ToF range camera and an additional higher resolution grayscale camera) for background subtraction.
In [46], Wu et al. present an algorithm for bi-layer segmentation of natural videos in real time using a combination of infrared, color, and edge information. A prime application of this system is in telepresence, where there is a need to remove the background and replace it with a new one. For each frame, the IR image is used to pre-segment the color image using a simple thresholding technique. This pre-segmentation is adopted to initialize a pentamap, which is then used by graph cuts algorithm to find the final foreground region.
The depth data provided by a ToF camera is used to generate 3D-TV contents by Frick et al. [7]. The MoG algorithm is applied to the depth data to obtain foreground regions, which are then excluded by median filtering to improve background depth map accuracy.
In [47], Leens et al. propose a multi-camera system that combines color and depth data, obtained with a low-resolution ToF camera, for video segmentation. The algorithm applies the ViBe algorithm independently to the color and the depth data. The obtained foreground masks are then combined with logical operations and post-processed with morphological operations.
MoG is also adopted in the algorithm proposed by Stormer et al. [48], where depth and infrared data captured by a ToF camera are combined to detect foreground objects. Two independent background models are built, and each pixel is classified as background or foreground only if the two models matching conditions agree. Very close or overlapping foreground objects are further separated using a depth gradient-based segmentation.
Wang et al. [49] propose TofCut, an algorithm that combines color and depth cues in a unified probabilistic fusion framework and a novel adaptive weighting scheme to control the influence of these two cues intelligently over time. Bilayer segmentation is formulated as a binary labeling problem, whose optimal solution is obtained by minimizing an energy function. The data term evaluates the likelihood of each pixel belonging to the foreground or the background. The contrast term encodes the assumption that segmentation boundaries tend to align with the edges of high contrast. Color and depth foreground and background pixels are modeled through MoGs and single Gaussians, respectively, and their weighting factors are adaptively adjusted based on the discriminative capabilities of their models. The algorithm is also used in an automatic matting system [82] to automatically generate foreground masks, and consequently trimaps, to guide alpha matting.
Dondi et al. [50] propose a matting method using the intensity map generated by ToF cameras. It first segments the distance map based on the corresponding values of the intensity map and then applies region growing to the filtered distance map, to identify and label pixel clusters. A trimap is obtained by eroding the output to select the foreground, dilating it to select foreground, and selecting as indeterminate the remaining contour pixels. The obtained trimap is fed in input to a matting algorithm that refines the result.
Frick et al. [51] use a thresholding technique to separate the foreground from the background in multiple planes of the video volume, for the generation of 3D-TV contents. A posterior trimap-based refinement using hierarchical graph cuts segmentation is further adopted to reduce the artifacts produced by the depth noise.
Kawabe et al. [52] employ stereo cameras to extract pedestrians. Foreground regions are extracted by MoG-based background subtraction and shadow detection using the color data. Then the moving objects are extracted by thresholding the histogram of depth data, computed by stereo matching.
Mirante et al. [53] exploit the information captured by a multi-sensor system consisting of a stereo camera pair with a ToF range sensor. Motion, retrieved by color and depth frame difference, provides the initial ROI mask. The foreground mask is first extracted by region growing in the depth data, where seeds are obtained by the ROI, then refined based on color edges. Finally, a trimap is generated, where uncertain values are those along the foreground contours, and are classified based on color in the CIELab color space.
Rougier et al. [54] explore the Kinect sensor for the application of detecting falls in the elderly. For people detection, the system adopts a single Gaussian depth background model.
Schiller and Koch [55] propose an approach to video matting that combines color information with the depth provided by ToF cameras. Depth keying is adopted to segment moving objects based on depth information, comparing the current depth image with a depth background image (constructed by averaging several ToF-images). MoG is adopted to segment moving objects based on color information. The two segmentations are weighted using two types of reliability measure for depth measurements: the depth variance and the amplitude image of the ToF-camera. The weighted average of the color and depth segmentations is used as matting alpha value for blending foreground and background, while its thresholding (using a user-defined threshold) is used for evaluating moving object segmentation.
Stone and Skubic [56] use only the depth information provided by a Kinect device to extract the foreground. For each pixel, minimum and maximum depth values d m and d M are computed by a set of training images to form a background model. For a new frame, each pixel is compared against the background model, and those pixels which lie outside the range [d m − 1, d M + 1] are considered foreground.
In [9], Han et al. present a human detection and tracking system for a smart environment application. Background subtraction is applied only on the depth images as frame-by-frame difference, assisted by a clustering algorithm that checks the depth continuity of pixels in the neighborhood of foreground pixels. Once the object has been located in the image, visual features are extracted from the RGB image and are then used for tracking the object in successive frames.
In the surveillance system based on the Kinect proposed by Clapés et al. [57], a per pixel background subtraction technique is presented. The authors propose a background model based on a four-dimensional Gaussian distribution (using color and depth features). Then, user and object candidate regions are detected and recognized using robust statistical approaches.
In the gesture recognition system presented by Mahbub et al. [10], the foreground objects are extracted by the depth data using frame difference.
Ottonelli et al. (2013) [59] refine ViBe segmentation of the color data by adding to the achieved foreground mask a compensation factor computed based on the color and depth data obtained by a stereo camera.
In the object detection system presented by Zhang et al. [60], background subtraction is achieved by single Gaussian modeling of the depth information provided by a Kinect sensor.
Fernandez-Sanchez et al. [58] adopt Codebook as background model and consider data captured by Kinect cameras. They analyze two approaches that differ in the depth integration method: the four-dimensional Codebook (CB4D) considers merely depth as a fourth channel of the background model, while the Depth-Extended Codebook (DECB) adds a joint RGBD fusion method directly into the model. They proved that the latter achieves better results than the former. In [15], the authors consider stereo disparity data, besides color. To get the best of color and depth features, they extend the DECB algorithm through a post-processing stage for mask fusion (DECB-LF), based on morphological reconstruction using the output of the color-based algorithm.
Braham et al. [61] adopt two background models for depth data, separating valid values (modeled by a single Gaussian model) and invalid values (holes). The Gaussian mean is updated to the maximum valid value, while the standard deviation follows a quadratic relationship with respect to the depth. This leads to a depth-dependent foreground/background threshold that enables the model to adapt to the non-uniform noise of range images automatically.
In [17], Camplani and Salgado propose an approach, named CL W , based on a combination of color and depth classifiers (CL C and CL D ) and the adoption of the MoG model. The combination of classifiers is based on a weighted average that allows to adaptively modifying the support of each classifier in the ensemble by considering foreground detections in the previous frames and the depth and color edges. For each pixel, the support of each classifier to the final segmentation result is obtained by considering the global edge-closeness probability and the classification labels obtained in the previous frame. In [62], the authors improve their method, proposing a method named MoG-RegPRE, that builds not only pixel-based but also region-based models from depth and color data, and fuses the models in a mixture of experts fashion to improve the final foreground detection performance.
Chattopadhyay et al. [63] adopt RGBD streams for recognizing gait patterns of individuals. To extract RGBD information of moving objects, they adopt the SOBS model for color background subtraction and use the obtained foreground masks to extract the depth information of people silhouettes from the registered depth frames.
In [18], Gallego and Pardás present a foreground segmentation system that combines color and depth information captured by a Kinect camera to perform a complete Bayesian segmentation between foreground and background classes. The system adopts a combination of spatial-color and spatial-depth region-based MoG models for the foreground, as well as two color and depth pixel-wise MoG models for the background, in a Logarithmic Opinion Pool decision framework used to combine the likelihoods of each model correctly. A post-processing step based on a trimap analysis is also proposed to correct the precision errors that the depth sensor introduces in the object contour.
The algorithm proposed by Giordano et al. in [64] explicitly models the scene background and foreground with a KDE approach in a quantized x-y-hue-saturation-depth space. Foreground segmentation is achieved by thresholding the log-likelihood ratio over the background and foreground probabilities.
Murgia et al. [65] propose an extension of the Codebook model. Similarly to CB4D [58], it includes depth as a fourth channel of the background model but also applies colorimetric invariants to modify the color aspect of the input images, to give them the aspect they would have under canonical illuminants.
In [66], Song et al. model grayscale color and depth values based on MoG. The combination of the two models is based on the product of the likelihoods of the two models.
Boucher et al. [67] initially exploit depth information to achieve a coarse segmentation, using middleware of the adopted ASUS Xtion camera. The obtained mask is refined in uncertain areas (mainly object contours) having high background/foreground contrast, locally modeling colors by their mean value.
Cinque et al. [68] adapt to Kinect data a matting method previously proposed for ToF data. It is based on Otsu thresholding of the depth map and region growing for labeling pixel clusters, assembled to create an alpha map. Edge improvement is obtained by logical OR of the current map with those of the previous four frames.
Huang et al. [69] propose a post-processing framework based on an initial segmentation obtained solely by depth data. Two post-processing steps are proposed: a foreground hole detection step and object boundary refining step. For foreground hole detection, they obtain two weak decisions based on the region color cue and the contour contrast cue, adaptively fused according to their corresponding reliability. For object boundary refinement, they apply weighted fusion of motion probability weighted temporal prior, color likelihood, and smoothness constraints. Therefore, besides handling challenges such as color camouflage, illumination variations, and shadows, the method maintains spatial and temporal consistency of the obtained segmentation, a fundamental issue for the telepresence target application.
Javed et al. [70] propose the DEOR-PCA (Depth Extended Online RPCA) method for background subtraction using binocular cameras. It consists of four main stages: disparity estimation, background modeling, integration, and spatiotemporal constraints. Initially, the range information is obtained using disparity estimation algorithms on a set of stereo pairs. Then, OR-PCA is applied to each of color left image and related disparity image to model the background, separately. The integration adds low-rank and sparse components obtained via OR-PCA to recover the background model and foreground mask from each image. The reconstructed sparse matrix is then thresholded to get the binary foreground mask. Finally, spatiotemporal constraints are applied to remove from the foreground mask most of the noise due to depth information.
In [71], Nguyen et al. present a method where, as an initial offline step, noise is removed from depth data based on a noise model. Background subtraction is then solved by combining RGB and depth features, both modeled by MoG. The fundamental idea in their combination strategy is that when depth measurement is reliable, the segmentation is mainly based on depth information; otherwise, RGB is used as an alternative.
Sun et al. [72] propose a MoG model for color information and a single Gaussian model for depth, together with a color-depth consistency check mechanism driving the updating of the two models. However, experimental results aim at evaluating background estimation, rather than background subtraction.
Tian et al. [73] propose a depth-weighted group-wise PCA-based algorithm, named DG-PCA. The background/foreground separation problem is formulated as a weighted L 2 , 1 -norm PCA problem with depth-based group sparsity being introduced. Dynamic groups are first generated solely based on depth, and then an iterative solution using depth to define the weights in L 2 , 1 -norm is developed. The method handles moving cameras through global motion compensation.
In [19], Huang et al. present a method where two separate color and depth background models are based on ViBe, and the two resulting foreground masks are fused by weighted average. The result is further adaptively refined, taking into account multi-cue information (color, depth, and edges) and spatiotemporal consistency (in the neighborhood of foreground pixels in the actual and previous frames).
In [20], Liang et al. propose a method to segment foreground objects based on color and depth data independently, using an existing background subtraction method (in the experiments they choose MoG). They focus on refining the inaccurate results through supervised learning. They extract several features from the source color and depth data in the foreground areas. These features are fed to two independent classifiers (in the experiments they choose random forest [83]) to obtain a better foreground detection.
In [74], Palmero et al. propose a baseline algorithm for human body segmentation using color, depth, and thermal information. To reduce the spatial search space in subsequent steps, the preliminary step is background subtraction, achieved in the depth domain using MoG.
In the method proposed by Chacon et al. [75], named SAFBS (Self-Adapting Fuzzy Background Subtraction), background subtraction is based on two background models for color (in the HSV color space) and depth, providing an initial foreground segmentation by frame differencing. A fuzzy algorithm computes the membership value of each pixel to background or foreground, based on color and depth differences, as well as depth similitude, of the current frame and the background. Temporal and spatial smoothing of the membership values is applied to reduce false alarms due to depth flickering and imprecise measurements around object contours, respectively. The classification result is then employed to update the two background models, using automatically computed learning rates.
De Gregorio and Giordano [76] adapt an existing background modeling method using the WiSARD weightless neural network (WNNs) [41] to the domain of RGBD videos. Color and depth video streams are synchronously but separately modeled by WNNs at each pixel, using a set of initial video frames for network training. In the detection phase, classification is interleaved with re-training on current colors whenever pixels are detected as belonging to the background. Finally, the obtained output masks are combined by an OR operator and post-processed by morphological filtering.
Javed et al. [77] investigate the performance of an online RPCA-based method, named SRPCA, for moving object detection using RGBD videos. The algorithm consists of three main stages: (i) detection of dynamic images to create an input dynamic sequence by discarding motionless video frames; (ii) computation of spatiotemporal graph Laplacians; and (iii) application of RPCA to incorporate the preceding two steps for the separation of background and foreground components. In the experiments, the algorithm is tested by using only intensity, only RGB, and RGBD features, leading to the surprising conclusion that best results are achieved using only intensity features.
The algorithm proposed by Maddalena and Petrosino [78], named RGBD-SOBS, is based on two background models for color and depth information, exploiting a self-organizing neural background model previously adopted for RGB videos [84]. The resulting color and depth detection masks are combined, not only to achieve the final results but also to better guide the selective model update procedure.
Minematsu et al. [79] propose an algorithm, named SCAD, based on a simple combination of the appearance (color) and depth information. The depth background is obtained using, for each pixel, its farthest depth value along the whole video, thus resulting in a batch algorithm. The likelihood of the appearance background is computed using texture-based and RGB-based background subtraction. To reduce false positives due to illumination changes, SCAD roughly detects foreground objects by using texture-based background subtraction. Then, it performs RGB-based background subtraction to improve the results of texture-based background subtraction. Finally, foreground masks are obtained using graph cuts to optimize an energy function which combines the two likelihoods of the background.
Moyá-Alcover et al. [32] construct a scene background model using KDE with a three-dimensional Gaussian kernel. One of the dimensions models depth information, while the other two model normalized chromaticity coordinates. Missing depth data are modeled using a probabilistic strategy to distinguish pixels that belong to the background model from those which are due to foreground objects. Pixels that cannot be classified as background or foreground are placed in the undefined class. Two different implementations are obtained depending on whether undefined pixels are considered as background (GSM U B ) or foreground (GSM U F ), demonstrating their suitability for scenes where actions happen far or close to the sensor, respectively.
Trabelsi et al. [80] propose the RGBD-KDE algorithm, also based on a scene background model using KDE, but using a two-dimensional Gaussian kernel. One of the dimensions models depth information, while the other models the intensity (average of RGB components). To reduce computational complexity, the Fast Gaussian Transform is adapted to the problem.
Zhou et al. [81] construct color and depth models based on ViBe and fuse the results in a weighting mechanism for the model update that relies on depth reliability.

4. Metrics

The usual way of evaluating the performance of background subtraction algorithms for moving object detection in videos is to pixel-wise compare the computed foreground masks with the corresponding ground truth (GT) foreground masks [26,85] and compute suitable metrics. Metrics frequently adopted for evaluating background subtraction methods in RGBD videos are summarized in Table 2. Here, we report their name (column Name), abbreviation (column Acronym), definition (column Computed as), and whether they should be minimized (↓) or maximized (↑) to have more accurate results (column Better if). All these metrics are defined in terms of the total number of True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN) pixels in the whole video. Most of the metrics reported in Table 2 are frequently used for evaluating background subtraction methods in RGB videos [26,85]. The exception is Si Ω , specifically adopted to analyze the errors close to the object boundaries Ω , where depth data is usually very imprecise. In [17], Ω is defined as an image region made of pixels surrounding the ground truth object boundary and having a distance from it of at most 10 pixels.
Where more than one metric is considered, overall metrics to rank the accuracy of the compared methods are also proposed by some authors [17,32], based on the rankings achieved by the methods according to each of the metrics.

5. Datasets

Several RGBD datasets exist for different tasks, including object detection and tracking, object and scene recognition, human activity analysis, 3D-simultaneous localization and mapping (SLAM), and hand gesture recognition (e.g., see surveys in [24,86,87]). However, depending on the application they have been devised for, they can include single RGBD images instead of videos, or they can supply GTs in the form of bounding boxes, 3D geometries, camera trajectories, 6DOF poses, or dense multi-class labels, rather than GT foreground masks.
In Table 3, we summarize some publicly available RGBD datasets suitable for background subtraction that include videos and, eventually, GT foreground masks. Specifically, we report their acronym, website, and reference publication (column Name & Refs.), the source for depth data (column Source), whether or not they also provide GT foreground masks (column GT masks), the number of videos they include (column No. of videos), some RGBD background subtraction methods adopting them for their evaluation (column Adopted by), and the main application they have been devised for (column Main application).
The GSM dataset [32] includes seven different sequences designed to test some of the main problems in scene modeling when both color and depth information are used: color camouflage, depth camouflage, color shadows, smooth and sudden illumination changes, and bootstrapping. Each sequence is provided with some hand-labeled GT foreground masks. All the sequences are also included in the SBM-RGBD dataset [33] and accompanied by 56 GT foreground masks.
The Kinect dataset [90] contains nine single person sequences, recorded with a Kinect camera, to show depth and color camouflage situations that are prone to errors in color-depth scenarios.
The MICA-FALL dataset [71] contains RGBD videos for the analysis of human activities, mainly fall detection. Two scenarios are considered for capturing activities that happen at the center field of view of one of the four Kinect sensors or at the cross-view of two or more sensors. Besides color and depth data, accelerometer information and the coordinates of 20 skeleton joints are provided for every frame.
The MULTIVISION dataset consists of two different sets of sequences for the objective evaluation of background subtraction algorithms based on depth information as well as color images. The first set (MULTIVISION Stereo [15]) consists of four sequences recorded by stereo cameras, combined with three different disparity estimation algorithms [103,104,105]. The sequences are devised to test color saturation, color and depth camouflage, color shadows, low lighting, flickering lights, and sudden illumination changes. The second set (MULTIVISION Kinect [58]) consists of four sequences recorded by a Kinect camera, devised to test out of sensor range depth data, color and depth camouflage, flickering lights, and sudden illumination changes. For all the sequences, some frames have been hand-segmented to provide GT foreground masks. The four MULTIVISION Kinect sequences are also included in the SBM-RGBD dataset [33] and accompanied by 294 GT foreground masks.
The Princeton Tracking Benchmark dataset [95] includes 100 videos covering many realistic cases, such as deformable objects, moving camera, different occlusion conditions, and a variety of clutter backgrounds. The GTs are manual annotations in the form of bounding-boxes drawn around the objects on each frame. One of the sequences (namely, sequence bear_front) is also included in the SBM-RGBD dataset [33] and accompanied by 15 GT foreground masks.
The RGB-D Object Detection dataset [17] includes four different sequences of indoor environments, acquired with a Kinect camera, that contain different demanding situations, such as color and depth camouflage or cast shadows. For each sequence, a hand-labeled ground truth is provided to test foreground/background segmentation algorithms. All the sequences, suitably subdivided and reorganized, are also included in the SBM-RGBD dataset [33] and accompanied by more than 1100 GT foreground masks.
The RGB-D People dataset [98] is devoted to evaluating people detection and tracking algorithms for robotics, interactive systems, and intelligent vehicles. It includes more than 3000 RGBD frames acquired in a university hall from three vertically mounted Kinect sensors. The data contains walking and standing persons seen from different orientations and with different levels of occlusions. Regarding the ground truth, all frames are annotated manually to contain bounding boxes in the 2D depth image space and the visibility status of subjects. Unfortunately, the GT foreground masks built and used in [62] are not available.
The SBM-RGBD dataset [33] is a publicly available benchmarking framework specifically designed to evaluate and compare scene background modeling methods for moving object detection on RGBD videos. It involves the most extensive RGBD video dataset ever made for this specific purpose and also includes videos coming from other datasets, namely, GSM [32], MULTIVISION [58], Princeton Tracking Benchmark [95], RGB-D Object Detection dataset [17], and UR Fall Detection Dataset [106,107]. The 33 videos acquired by Kinect cameras span seven categories, selected to include diverse scene background modeling challenges for moving object detection: illumination changes, color and depth camouflage, intermittent motion, out of sensor depth range, color and depth shadows, and bootstrapping. Depth images are already synchronized and registered with the corresponding color images by projecting the depth map onto the color image, allowing a color-depth pixel correspondence. For each sequence, pixels that have no color-depth correspondence (due to the difference in the color and depth cameras centers) are signaled in a binary Region-of-Interest (ROI) image and are excluded by the evaluation.
Other publicly available RGBD video datasets are worth mentioning, being equipped with pixel-wise GT foreground masks, which are devoted to specific applications. These include the BIWI RGBD-ID dataset [108,109] and the IPG dataset [110,111], targeted to people re-identification, and the VAP Trimodal People Segmentation Dataset [74,112], that contains videos captured by thermal, depth, and color sensors, devoted to human body segmentation.

6. Comparisons

Due to the public availability of data, GTs, and results obtained by existing background subtraction algorithms handling RGBD data, five of the RGBD datasets described in Section 5 have been adopted by several authors for benchmarking new algorithms for the problem. Here, we summarize and compare the published results.

6.1. Comparisons on the MULTIVISION Kinect Dataset

Performance comparisons on the MULTIVISION Kinect dataset are reported in Table 4. Here, values for the DECB and the CB4D algorithms by Fernandez et al. [58] and the Codebook algorithm using only color (CB) and only depth (CB-D) are those reported in [58]. Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80]. Values for the RSBS (Random Sampling-based Background Subtraction) algorithm by Huang et al. [19] and the SAFBS algorithm by Chacon et al. [75] are those reported by their authors.
It can be observed that in general, depth alone (i.e., CB-D and D-KDE) allows achieving results better than color alone (i.e., CB and C-KDE), being insensitive to illumination variations (e.g., in sequences ChairBox and Hallway) and color camouflage (e.g., in sequence Hallway). The exception clearly holds for the case of depth camouflage, as in sequence Wall (see Figure 2). For all the sequences, the combined use of both information allows in general to achieve comparable or better performance.

6.2. Comparisons on the MULTIVISION Stereo Dataset

Performance comparisons on the MULTIVISION Stereo dataset are reported in Table 5. Here, values for the DECB-LF algorithm by Fernandez et al. [15], the DECB and the CB4D algorithms by Fernandez et al. [58], and for the Codebook algorithm using only color (CB) and only depth (CB-D) are those reported in [15]. Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80]. Values for the DEOR-PCA algorithm by Javed et al. [70] are those reported by the same authors.
It can be observed that, for all the videos, the combined use of both color and depth information (i.e., DEOR-PCA, DECB, DECB-LF, and RGBD-KDE methods) allows achieving results better than those obtained by color alone (i.e., CB and C-KDE methods) or depth alone (i.e., CB-D and D-KDE methods). Moreover, the difficulty in estimating and discriminating disparities in case of flickering lights (e.g., in LCDScreen and LabDoor videos) and in case of depth camouflage (e.g., in Crossing video) leads depth alone-based methods to obtain worse results as compared to color alone-based methods. Only for sequence Suitcase (see Figure 3), where the main issue is color camouflage, depth alone allows achieving results better than color alone, due to the high accuracy of the estimated depth information.

6.3. Comparisons on the RGB-D Object Detection Dataset

Performance comparisons on the RGB-D Object Detection dataset are reported in Table 6. Here, values for the two weak color and depth classifiers (CL C and CL D ) and the weighted color and depth classifier (CL W ) by Camplani and Salgado [17], the four-dimensional MoG model (MoG4D) by Gordon et al. [42], the combined RGB and depth ViBe model (ViBeRGB+D) by Leens et al. [47], and the combined RGB and depth MoG model (MoGRGB+D) by Stormer et al. [48] are those reported in [17]. Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80]. Values for the AMDF (Adaptive Multi-Cue Decision Fusion) algorithm by Huang et al. [69], the RFBS (Refinement Framework for Background Subtraction) algorithm by Liang et al. [20], the EC-RGBD algorithm by Nguyen et al. [71], the enhanced classifier (MoG-RegPRE) by Camplani et al. [62], the GSM U B and GSM U F algorithms by Moyá et al. [32], and the SAFBS algorithm by Chacon et al. [75] are those reported by the related authors.
Good performance can be achieved for color camouflage (ColCamSeq) and shadows (ShSeq), as well as for sequence GenSeq (see Figure 4), which combines different issues (color shadows, color and depth camouflage, and noisy depth data). On the other hand, depth camouflage (DCamSeq) seems to be a problem for most of the methods using depth.

6.4. Comparisons on the GSM Dataset

Performance comparisons on the GSM dataset are reported in Table 7. Here, values for the GSM U B and GSM U F algorithms by Moyá et al. [32] are those reported on the dataset website. Values for the RGBD-KDE algorithm by Trabelsi et al. [80] and the KDE algorithm using only color (C-KDE) and only depth (D-KDE) are those reported in [80].
It can be observed that the compared methods based on the combination of color and depth information robustly deal with all the issues related to RGBD data: intermittent object motion (Sleeping-ds), illumination changes (TimeOfDay-ds and LightSwitch-ds), color camouflage (Cespatx-ds), depth camouflage (Despatx-ds, see Figure 5), color and depth shadows (Shadows-ds), and bootstrapping (Bootstraping-ds). It should be pointed out that, in the case of TimeOfDay-ds and Ls-ds sequences, the performance analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics. Indeed, there are no foreground objects throughout the whole sequences, their rationale being the willingness of not detecting false positives under varying illumination conditions. This leads to having no positive cases in the ground truths and, consequently, to undefined values of Precision, Recall, and F-measure. While for GSM U B and GSM U F values in these undefined cases are set to zero, a different handling must have been adopted for the other compared methods.

6.5. Comparisons on the SBM-RGBD Dataset

Performance comparisons on the SBM-RGBD dataset are reported in Table 8 and Table 9. Here, values for the RGBD-SOBS and RGB-SOBS algorithms by Maddalena and Petrosino [78], the SRPCA algorithm by Javed et al. [77], the AvgM-D algorithm by Li and Wang [113], the Kim algorithm by Younghee Kim [114], the SCAD algorithm by Minematsu et al. [79], the cwisardH+ algorithm by De Gregorio and Giordano [76], and the MFCN algorithm by Zeng et al. [102], are those reported by the related authors. All the performance measures have been computed using the complete set of GTs and are available at [115].
It can be observed that the deep learning-based MFCN algorithm almost always achieves the best results in all the video categories, in terms of all the metrics. This is certainly possible thanks to the availability of such a wide dataset to train the network. Several conclusions can be drawn for each of the considered challenges by observing the remaining results. Bootstrapping can be a problem when using only color information, especially for selective background subtraction methods (e.g., RGB-SOBS), i.e., those that update the background model using only background information. Indeed, once a foreground object is erroneously included into the background model (e.g., due to inappropriate background initialization or to inaccurate segmentation of foreground objects), it will hardly be removed by the model, continuing to produce false negative results. The problem is even harder if some parts of the background are never shown during the sequences, as it happens in most of the videos of the Bootstrapping category. Indeed, in these cases, also the best performing background initialization methods [116,117] fail and only alternative techniques (e.g., inpainting) can be adopted to recover missing data [118]. Nonetheless, depth information seems to be beneficial for affording the challenge, as reported in Table 8, where accurate results are achieved by most of the methods that exploit depth information.
As expected, all the methods that exploit depth information achieve high accuracy in case of color camouflage and illumination changes. In the latter case, it should be pointed out that, since this video category includes the two TimeOfDay-ds and Ls-ds sequences of the GSM dataset (without any foreground object), the performance analysis should be based on Specificity, FPR, FNR, and PWC, rather than on the other three metrics (see Section 6.4).
Depth can be beneficial also for detecting and properly handling cases of intermittent motion. Indeed, foreground objects can be easily identified based on their depth, that is lower than that of the background, even when they remain stationary for long time periods. Methods that explicitly exploit this characteristic succeed in handling cases of removed and abandoned objects, achieving high accuracy.
Overall, shadows do not seem to pose a strong challenge to most of the methods. Indeed, depth shadows due to moving objects cause some undefined depth values, generally close to the object contours, but these can be handled based on motion. Color shadows can be handled either exploiting depth information, that is insensitive to this challenge, or through color shadow detection techniques when only color information is taken into account.
Depth camouflage and out of range (see Figure 6) are among the most challenging issues, at least when information on color is disregarded or not properly combined with depth. Indeed, even though the accuracy of most of the methods is moderately high, several false negatives are produced.

6.6. Summary of the Findings and Open Issues

From the reported comparisons, it can be argued that, generally, most of the issues related to RGB data may be solved by accurate depth information, being insensitive to scene color and illumination conditions (color camouflage, illumination changes, and color shadows) and providing geometric information of the scene (bootstrapping and intermittent motion). This does not hold in cases where depth measurements or estimation are not sufficiently accurate. However, the combined use of both color and depth information was shown to allow achieving results better than those obtained by color alone or depth alone. Indeed, a clever combination of this information enables the exploitation of depth benefits, at the same time overcoming the issues arising from eventual depth inaccuracies, by exploiting the complimentary color information.
Open issues remain when depth and color information fail to be complimentary. As an example, it has been shown that an object moving on a wall can be detected based on its color, rather than its camouflaged depth. However, what if the object has the same color of the wall? Future research directions should certainly investigate these cases.

7. Conclusions and Future Research Directions

The paper provides a comprehensive review of methods which exploit RGBD data for moving object detection based on background subtraction, a building block for many computer vision applications. The main issues and the existing literature are briefly reviewed. Moreover, the metrics commonly used for the evaluation of these methods and the datasets that are publicly available are summarized. Finally, the most extensive comparison of the existing methods on some datasets is provided, which can serve as a reference for future methods aiming at overcoming the highlighted open issues.

Author Contributions

The authors contributed equally to this work.

Acknowledgments

L.M. acknowledges the INdAM Research group GNCS and the INTEROMICS Flagship Project funded by MIUR, Italy. A.P. acknowledges Project PLI 4.0 Horizon 2020-PON 2014-2020.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bouwmans, T. Traditional and recent approaches in background modeling for foreground detection: An overview. Comput. Sci. Rev. 2014, 11, 31–66. [Google Scholar] [CrossRef]
  2. Cuevas, C.; Martínez, R.; García, N. Detection of stationary foreground objects: A survey. Comput. Vis. Image Underst. 2016, 152, 41–57. [Google Scholar] [CrossRef]
  3. Shah, M.; Deng, J.D.; Woodford, B.J. Video background modeling: Recent approaches, issues and our proposed techniques. Mach. Vis. Appl. 2014, 25, 1105–1119. [Google Scholar] [CrossRef]
  4. Vaswani, N.; Bouwmans, T.; Javed, S.; Narayanamurthy, P. Robust PCA and Robust Subspace Tracking. arXiv, 2017; arXiv:1711.09492. [Google Scholar]
  5. Xu, Y.; Dong, J.; Zhang, B.; Xu, D. Background modeling methods in video analysis: A review and comparative evaluation. CAAI Trans. Intell. Technol. 2016, 1, 43–60. [Google Scholar] [CrossRef]
  6. Eveland, C.; Konolige, K.; Bolles, R.C. Background modeling for segmentation of video-rate stereo sequences. In Proceedings of the 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA, USA, 25 June 1998; pp. 266–271. [Google Scholar]
  7. Frick, A.; Kellner, F.; Bartczak, B.; Koch, R. Generation of 3D-TV LDV-content with Time-Of-Flight Camera. In Proceedings of the 2009 3DTV Conference: The True Vision-Capture, Transmission and Display of 3D Video, Potsdam, Germany, 4–6 May 2009; pp. 1–4. [Google Scholar] [CrossRef]
  8. Greff, K.; Brandão, A.; Krauß, S.; Stricker, D.; Clua, E. A Comparison between Background Subtraction Algorithms using a Consumer Depth Camera. In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP 2012), Rome, Italy, 24–26 February 2012; Volume 1, pp. 431–436. [Google Scholar]
  9. Han, J.; Pauwels, E.J.; de Zeeuw, P.M.; de With, P.H.N. Employing a RGB-D sensor for real-time tracking of humans across multiple re-entries in a smart environment. IEEE Trans. Consum. Electron. 2012, 58, 255–263. [Google Scholar] [CrossRef]
  10. Mahbub, U.; Imtiaz, H.; Roy, T.; Rahman, M.S.; Ahad, M.A.R. A template matching approach of one-shot-learning gesture recognition. Pattern Recognit. Lett. 2013, 34, 1780–1788. [Google Scholar] [CrossRef]
  11. Guomundsson, S.A.; Larsen, R.; Aanaes, H.; Pardas, M.; Casas, J.R. TOF imaging in Smart room environments towards improved people tracking. In Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008; pp. 1–6. [Google Scholar]
  12. Xia, L.; Chen, C.C.; Aggarwal, J.K. Human detection using depth information by Kinect. In Proceedings of the 2011 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2011), Colorado Springs, CO, USA, 20–25 June 2011; pp. 15–22. [Google Scholar] [CrossRef]
  13. Almazan, E.J.; Jones, G.A. Tracking People across Multiple Non-overlapping RGB-D Sensors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 831–837. [Google Scholar]
  14. Galanakis, G.; Zabulis, X.; Koutlemanis, P.; Paparoulis, S.; Kouroumalis, V. Tracking Persons Using a Network of RGBD Cameras. In Proceedings of the 7th International Conference on PErvasive Technologies Related to Assistive Environments (PETRA ’14), Rhodes, Greece, 27–30 May 2014; ACM: New York, NY, USA, 2014; p. 63. [Google Scholar]
  15. Fernandez-Sanchez, E.J.; Rubio, L.; Diaz, J.; Ros, E. Background subtraction model based on color and depth cues. Mach. Vis. Appl. 2014, 25, 1211–1225. [Google Scholar] [CrossRef]
  16. Harville, M.; Gordon, G.; Woodfill, J. Foreground segmentation using adaptive mixture models in color and depth. In Proceedings of the IEEE Workshop on Detection and Recognition of Events in Video, Vancouver, BC, Canada, 8 July 2001; pp. 3–11. [Google Scholar] [CrossRef]
  17. Camplani, M.; Salgado, L. Background foreground segmentation with RGB-D Kinect data: An efficient combination of classifiers. J. Vis. Commun. Image Represent. 2014, 25, 122–136. [Google Scholar] [CrossRef]
  18. Gallego, J.; Pardás, M. Region based foreground segmentation combining color and depth sensors via logarithmic opinion pool decision. J. Vis. Commun. Image Represent. 2014, 25, 184–194. [Google Scholar] [CrossRef]
  19. Huang, J.; Wu, H.; Gong, Y.; Gao, D. Random sampling-based background subtraction with adaptive multi-cue fusion in RGBD videos. In Proceedings of the 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Datong, China, 15–17 October 2016; pp. 30–35. [Google Scholar] [CrossRef]
  20. Liang, Z.; Liu, X.; Liu, H.; Chen, W. A refinement framework for background subtraction based on color and depth data. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 271–275. [Google Scholar] [CrossRef]
  21. Cruz, L.; Lucio, D.; Velho, L. Kinect and RGBD Images: Challenges and Applications. In Proceedings of the 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials, Ouro Preto, Brazil, 22–25 August 2012; pp. 36–49. [Google Scholar]
  22. Zhang, Z. Microsoft Kinect Sensor and Its Effect. IEEE MultiMedia 2012, 19, 4–10. [Google Scholar] [CrossRef] [Green Version]
  23. Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced Computer Vision With Microsoft Kinect Sensor: A Review. IEEE Trans. Cybern. 2013, 43, 1318–1334. [Google Scholar] [CrossRef] [PubMed]
  24. Camplani, M.; Paiement, A.; Mirmehdi, M.; Damen, D.; Hannuna, S.; Burghardt, T.; Tao, L. Multiple human tracking in RGB-depth data: A survey. IET Comput. Vis. 2017, 11, 265–285. [Google Scholar] [CrossRef]
  25. Toyama, K.; Krumm, J.; Brumitt, B.; Meyers, B. Wallflower: Principles and practice of background maintenance. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 1, pp. 255–261. [Google Scholar] [CrossRef]
  26. Goyette, N.; Jodoin, P.M.; Porikli, F.; Konrad, J.; Ishwar, P. A novel video dataset for change detection benchmarking. IEEE Trans. Image Process. 2014, 23, 4663–4679. [Google Scholar] [CrossRef] [PubMed]
  27. Zanuttigh, P.; Marin, G.; Dal Mutto, C.; Dominio, F.; Minto, L.; Cortelazzo, G.M. Time-of-Flight and Structured Light Depth Cameras; Springer International Publishing: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  28. Hu, X.; Mordohai, P. A Quantitative Evaluation of Confidence Measures for Stereo Vision. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 2121–2133. [Google Scholar] [PubMed]
  29. Kolb, A.; Barth, E.; Koch, R.; Larsen, R. Time-of-Flight Cameras in Computer Graphics. Comput. Graph. Forum 2010, 29, 141–159. [Google Scholar] [CrossRef]
  30. Daneshmand, M.; Helmi, A.; Avots, E.; Noroozi, F.; Alisinanoglu, F.; Arslan, H.S.; Gorbova, J.; Haamer, R.E.; Ozcinar, C.; Anbarjafari, G. 3D Scanning: A Comprehensive Survey. arXiv, 2018; arXiv:1801.08863. [Google Scholar]
  31. Khoshelham, K.; Elberink, S.O. Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications. Sensors 2012, 12, 1437–1454. [Google Scholar] [CrossRef] [PubMed]
  32. Moyà-Alcover, G.; Elgammal, A.; i Capó, A.J.; Varona, J. Modeling depth for nonparametric foreground segmentation using RGBD devices. Pattern Recognit. Lett. 2017, 96, 76–85. [Google Scholar] [CrossRef]
  33. Camplani, M.; Maddalena, L.; Moyá Alcover, G.; Petrosino, A.; Salgado, L. A Benchmarking Framework for Background Subtraction in RGBD Videos. In Proceedings of the New Trends in Image Analysis and Processing (ICIAP 2017), Catania, Italy, 11–15 September 2017; Battiato, S., Farinella, G.M., Leo, M., Gallo, G., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 219–229. [Google Scholar]
  34. Kim, K.; Chalidabhongse, T.H.; Harwood, D.; Davis, L. Real-time foreground-background segmentation using codebook model. Real-Time Imaging 2005, 11, 172–185. [Google Scholar] [CrossRef] [Green Version]
  35. Elgammal, A.M.; Harwood, D.; Davis, L.S. Non-parametric Model for Background Subtraction. In Proceedings of the 6th European Conference on Computer Vision, Dublin, Ireland, 26 June–1 July 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 751–767. [Google Scholar]
  36. Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Fort Collins, CO, USA, 23–25 June 1999; Volume 2, p. 252. [Google Scholar] [CrossRef]
  37. Candès, E.J.; Li, X.; Ma, Y.; Wright, J. Robust Principal Component Analysis? J. ACM 2011, 58, 11. [Google Scholar] [CrossRef]
  38. Maddalena, L.; Petrosino, A. A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications. IEEE Trans. Image Process. 2008, 17, 1168–1177. [Google Scholar] [CrossRef] [PubMed]
  39. Wren, C.R.; Azarbayejani, A.; Darrell, T.; Pentland, A.P. Pfinder: Real-time tracking of the human body. IEEE Trans. Pattern Anal. Mach. Intell. 1997, 19, 780–785. [Google Scholar] [CrossRef]
  40. Barnich, O.; Droogenbroeck, M.V. ViBE: A powerful random technique to estimate the background in video sequences. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 945–948. [Google Scholar] [CrossRef]
  41. Aleksander, I.; Thomas, W.; Bowden, P. WISARD·a radical step forward in image recognition. Sens. Rev. 1984, 4, 120–124. [Google Scholar] [CrossRef]
  42. Gordon, G.G.; Darrell, T.; Harville, M.; Woodfill, J. Background Estimation and Removal Based on Range and Color. In Proceedings of the 1999 Conference on Computer Vision and Pattern Recognition (CVPR ’99), Ft. Collins, CO, USA, 23–25 June 1999; pp. 2459–2464. [Google Scholar] [CrossRef]
  43. Ivanov, Y.; Bobick, A.; Liu, J. Fast Lighting Independent Background Subtraction. Int. J. Comput. Vis. 2000, 37, 199–207. [Google Scholar] [CrossRef]
  44. Kolmogorov, V.; Criminisi, A.; Blake, A.; Cross, G.; Rother, C. Bi-layer segmentation of binocular stereo video. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, p. 1186. [Google Scholar] [CrossRef]
  45. Crabb, R.; Tracey, C.; Puranik, A.; Davis, J. Real-time foreground segmentation via range and color imaging. In Proceedings of the 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Anchorage, AK, USA, 23–28 June 2008; pp. 1–5. [Google Scholar] [CrossRef]
  46. Wu, Q.; Boulanger, P.; Bischof, W.F. Robust Real-Time Bi-Layer Video Segmentation Using Infrared Video. In Proceedings of the 2008 Canadian Conference on Computer and Robot Vision, Windsor, ON, Canada, 28–30 May 2008; pp. 87–94. [Google Scholar] [CrossRef]
  47. Leens, J.; Piérard, S.; Barnich, O.; Van Droogenbroeck, M.; Wagner, J.M. Combining Color, Depth, and Motion for Video Segmentation. In Proceedings of the Computer Vision Systems: 7th International Conference on Computer Vision Systems (ICVS 2009), Liège, Belgium, 13–15 October 2009; Fritz, M., Schiele, B., Piater, J.H., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 104–113. [Google Scholar] [CrossRef]
  48. Stormer, A.; Hofmann, M.; Rigoll, G. Depth gradient based segmentation of overlapping foreground objects in range images. In Proceedings of the 2010 13th International Conference on Information Fusion, Edinburgh, UK, 26–29 July 2010; pp. 1–4. [Google Scholar] [CrossRef]
  49. Wang, L.; Zhang, C.; Yang, R.; Zhang, C. TofCut: Towards Robust Real-time Foreground Extraction using Time-of-flight Camera. In Proceedings of the 3DPVT, Paris, France, 17–20 May 2010. [Google Scholar]
  50. Dondi, P.; Lombardi, L. Fast Real-time Segmentation and Tracking of Multiple Subjects by Time-of-Flight Camera—A New Approach for Real-time Multimedia Applications with 3D Camera Sensor. In Proceedings of the Sixth International Conference on Computer Vision Theory and Applications (VISAPP 2011), Vilamoura, Portugal, 5–7 March 2011; pp. 582–587. [Google Scholar]
  51. Frick, A.; Franke, M.; Koch, R. Time-Consistent Foreground Segmentation of Dynamic Content from Color and Depth Video. In Proceedings of the Pattern Recognition: 33rd DAGM Symposium, Frankfurt/Main, Germany, 31 August–2 September 2011; Mester, R., Felsberg, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 296–305. [Google Scholar] [CrossRef]
  52. Kawabe, M.; Tan, J.K.; Kim, H.; Ishikawa, S.; Morie, T. Extraction of individual pedestrians employing stereo camera images. In Proceedings of the 2011 11th International Conference on Control, Automation and Systems, Gyeonggi-do, Korea, 26–29 October 2011; pp. 1744–1747. [Google Scholar]
  53. Mirante, E.; Georgiev, M.; Gotchev, A. A fast image segmentation algorithm using color and depth map. In Proceedings of the 2011 3DTV Conference: The True Vision—Capture, Transmission and Display of 3D Video (3DTV-CON), Antalya, Turkey, 16–18 May 2011; pp. 1–4. [Google Scholar]
  54. Rougier, C.; Auvinet, E.; Rousseau, J.; Mignotte, M.; Meunier, J. Fall Detection from Depth Map Video Sequences. In Proceedings of the Toward Useful Services for Elderly and People with Disabilities: 9th International Conference on Smart Homes and Health Telematics (ICOST 2011), Montreal, QC, Canada, 20–22 June 2011; Abdulrazak, B., Giroux, S., Bouchard, B., Pigot, H., Mokhtari, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 121–128. [Google Scholar] [CrossRef]
  55. Schiller, I.; Koch, R. Improved Video Segmentation by Adaptive Combination of Depth Keying and Mixture-of-Gaussians. In Proceedings of the Image Analysis—17th Scandinavian Conference (SCIA 2011), Ystad, Sweden, 23–25 May 2011; pp. 59–68. [Google Scholar] [CrossRef]
  56. Stone, E.E.; Skubic, M. Evaluation of an inexpensive depth camera for passive in-home fall risk assessment. In Proceedings of the 2011 5th International Conference on Pervasive Computing Technologies for Healthcare (PervasiveHealth) and Workshops, Dublin, Ireland, 23–26 May 2011; pp. 71–77. [Google Scholar] [CrossRef]
  57. Clapés, A.; Reyes, M.; Escalera, S. Multi-modal user identification and object recognition surveillance system. Pattern Recognit. Lett. 2013, 34, 799–808. [Google Scholar] [CrossRef]
  58. Fernandez-Sanchez, E.J.; Diaz, J.; Ros, E. Background Subtraction Based on Color and Depth Using Active Sensors. Sensors 2013, 13, 8895–8915. [Google Scholar] [CrossRef] [PubMed]
  59. Ottonelli, S.; Spagnolo, P.; Mazzeo, P.L.; Leo, M. Improved video segmentation with color and depth using a stereo camera. In Proceedings of the 2013 IEEE International Conference on Industrial Technology (ICIT), Cape Town, South Africa, 25–28 February 2013; pp. 1134–1139. [Google Scholar]
  60. Xucong Zhang, X.W.; Jia, Y. The visual internet of things system based on depth camera. In Proceedings of the Chinese Intelligent Automation Conference (CIAC 2013), Yangzhou, China, 23–25 August 2013. [Google Scholar]
  61. Braham, M.; Lejeune, A.; Droogenbroeck, M.V. A physically motivated pixel-based model for background subtraction in 3D images. In Proceedings of the 2014 International Conference on 3D Imaging (IC3D), Liege, Belgium, 9–10 December 2014; pp. 1–8. [Google Scholar] [CrossRef]
  62. Camplani, M.; del Blanco, C.R.; Salgado, L.; Jaureguizar, F.; García, N. Multi-sensor background subtraction by fusing multiple region-based probabilistic classifiers. Pattern Recognit. Lett. 2014, 50, 23–33. [Google Scholar] [CrossRef]
  63. Chattopadhyay, P.; Roy, A.; Sural, S.; Mukhopadhyay, J. Pose Depth Volume extraction from RGB-D streams for frontal gait recognition. J. Vis. Commun. Image Represent. 2014, 25, 53–63. [Google Scholar] [CrossRef]
  64. Giordano, D.; Palazzo, S.; Spampinato, C. Kernel Density Estimation Using Joint Spatial-Color-Depth Data for Background Modeling. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 4388–4393. [Google Scholar] [CrossRef]
  65. Murgia, J.; Meurie, C.; Ruichek, Y. An Improved Colorimetric Invariants and RGB-Depth-Based Codebook Model for Background Subtraction Using Kinect. In Proceedings of the Human-Inspired Computing and Its Applications, Tuxtla Gutiérrez, Mexico, 16–22 November 2014; Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 380–392. [Google Scholar]
  66. Song, Y.M.; Noh, S.; Yu, J.; Park, C.W.; Lee, B.G. Background subtraction based on Gaussian mixture models using color and depth information. In Proceedings of the 2014 International Conference on Control, Automation and Information Sciences (ICCAIS 2014), Gwangju, South Korea, 2–5 December 2014; pp. 132–135. [Google Scholar] [CrossRef]
  67. Boucher, A.; Martinot, O.; Vincent, N. Depth Camera to Improve Segmenting People in Indoor Environments—Real Time RGB-Depth Video Segmentation. In Proceedings of the 10th International Conference on Computer Vision Theory and Applications, Berlin, Germany, 11–14 March 2015; Volume 3, pp. 55–62. [Google Scholar]
  68. Cinque, L.; Danani, A.; Dondi, P.; Lombardi, L. Real-Time Foreground Segmentation with Kinect Sensor. In Proceedings of the Image Analysis and Processing (ICIAP 2015), Genoa, Italy, 11–17 September 2015; Murino, V., Puppo, E., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 56–65. [Google Scholar]
  69. Huang, M.; Chen, Y.; Ji, W.; Miao, C. Accurate and Robust Moving-Object Segmentation for Telepresence Systems. ACM Trans. Intell. Syst. Technol. 2015, 6, 17. [Google Scholar] [CrossRef]
  70. Javed, S.; Bouwmans, T.; Jung, S.K. Depth extended online RPCA with spatiotemporal constraints for robust background subtraction. In Proceedings of the Korea-Japan Workshop on Frontiers of Computer Vision (FCV 2015), Mokpo, South Korea, 28–30 January 2015; pp. 1–6. [Google Scholar]
  71. Nguyen, V.T.; Vu, H.; Tran, T.H. An Efficient Combination of RGB and Depth for Background Subtraction. In Proceedings of the Some Current Advanced Researches on Information and Computer Science in Vietnam: Post-proceedings of The First NAFOSTED Conference on Information and Computer Science, Ha Noi, Vietnam, 13–14 March 2014; Dang, Q.A., Nguyen, X.H., Le, H.B., Nguyen, V.H., Bao, V.N.Q., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 49–63. [Google Scholar] [CrossRef]
  72. Sun, B.; Tillo, T.; Xu, M. Adaptive Model for Background Extraction Using Depth Map. In Proceedings of the Advances in Multimedia Information Processing (PCM 2015): 16th Pacific-Rim Conference on Multimedia, Gwangju, South Korea, 16–18 September 2015; Ho, Y.S., Sang, J., Ro, Y.M., Kim, J., Wu, F., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2015. Part II. pp. 419–427. [Google Scholar] [CrossRef]
  73. Tian, D.; Mansour, H.; Vetro, A. Depth-weighted group-wise principal component analysis for video foreground/ background separation. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 Septmber 2015; pp. 3230–3234. [Google Scholar] [CrossRef]
  74. Palmero, C.; Clapés, A.; Bahnsen, C.; Møgelmose, A.; Moeslund, T.B.; Escalera, S. Multi-modal RGB–Depth–Thermal Human Body Segmentation. Int. J. Comput. Vis. 2016, 118, 217–239. [Google Scholar] [CrossRef]
  75. Chacon-Murguia, M.I.; Orozco-Rodríguez, H.E.; Ramirez-Quintana, J.A. Self-Adapting Fuzzy Model for Dynamic Object Detection Using RGB-D Information. IEEE Sens. J. 2017, 17, 7961–7970. [Google Scholar] [CrossRef]
  76. De Gregorio, M.; Giordano, M. WiSARD-based learning and classification of background in RGBD videos. In Proceedings of the New Trends in Image Analysis and Processing (ICIAP 2017), Catania, Italy, 11–15 September 2017; Battiato, S., Farinella, G.M., Leo, M., Gallo, G., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  77. Javed, S.; Bouwmans, T.; Sultana, M.; Jung, S.K. Moving Object Detection on RGB-D Videos Using Graph Regularized Spatiotemporal RPCA. In Proceedings of the New Trends in Image Analysis and Processing (ICIAP 2017), Catania, Italy, 11–15 September 2017; Battiato, S., Farinella, G.M., Leo, M., Gallo, G., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 230–241. [Google Scholar]
  78. Maddalena, L.; Petrosino, A. Exploiting Color and Depth for Background Subtraction. In Proceedings of the New Trends in Image Analysis and Processing (ICIAP 2017), Catania, Italy, 11–15 September 2017; Battiato, S., Farinella, G.M., Leo, M., Gallo, G., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017; pp. 254–265. [Google Scholar]
  79. Minematsu, T.; Shimada, A.; Uchiyama, H.; Taniguchi, R. Simple Combination of Appearance and Depth for Foreground Segmentation. In Proceedings of the New Trends in Image Analysis and Processing (ICIAP 2017), Catania, Italy, 11–15 September 2017; Battiato, S., Farinella, G.M., Leo, M., Gallo, G., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
  80. Trabelsi, R.; Jabri, I.; Smach, F.; Bouallegue, A. Efficient and fast multi-modal foreground-background segmentation using RGBD data. Pattern Recognit. Lett. 2017, 97, 13–20. [Google Scholar] [CrossRef]
  81. Zhou, X.; Liu, X.; Jiang, A.; Yan, B.; Yang, C. Improving Video Segmentation by Fusing Depth Cues and the Visual Background Extractor (ViBe) Algorithm. Sensors 2017, 17, 1177. [Google Scholar] [CrossRef] [PubMed]
  82. Wang, L.; Gong, M.; Zhang, C.; Yang, R.; Zhang, C.; Yang, Y.H. Automatic Real-Time Video Matting Using Time-of-Flight Camera and Multichannel Poisson Equations. Int. J. Comput. Vis. 2012, 97, 104–121. [Google Scholar] [CrossRef]
  83. Dollár, P.; Zitnick, C.L. Fast Edge Detection Using Structured Forests. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1558–1570. [Google Scholar] [CrossRef] [PubMed]
  84. Maddalena, L.; Petrosino, A. The SOBS algorithm: What are the limits? In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012; pp. 21–26. [Google Scholar] [CrossRef]
  85. Goyette, N.; Jodoin, P.M.; Porikli, F.; Konrad, J.; Ishwar, P. Changedetection.net: A new change detection benchmark dataset. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012; pp. 1–8. [Google Scholar] [CrossRef]
  86. Firman, M. RGBD Datasets: Past, Present and Future. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 661–673. [Google Scholar]
  87. Cai, Z.; Han, J.; Liu, L.; Shao, L. RGB-D datasets using microsoft kinect or similar sensors: A survey. Multimedia Tools Appl. 2017, 76, 4313–4355. [Google Scholar] [CrossRef]
  88. GSM Dataset. Available online: http://gsm.uib.es/ (accessed on 15 May 2018).
  89. Kinect Database. Available online: https://imatge.upc.edu/web/resources/kinect-database-foreground-segmentation (accessed on 15 May 2018).
  90. Gallego, J. Parametric Region-Based Foreground Segmentation in Planar and Multi-View Sequences. Ph.D. Thesis, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain, 2013. [Google Scholar]
  91. MICA-FALL Dataset. Available online: http://mica.edu.vn/perso/Tran-Thi-Thanh-Hai/MFD.html (accessed on 15 May 2018).
  92. MULTIVISION Kinect Dataset. Available online: http://atcproyectos.ugr.es/mvision/index.php?option=com_content&view=article&id=45&Itemid=57 (accessed on 15 May 2018).
  93. MULTIVISION Stereo Dataset. Available online: http://atcproyectos.ugr.es/mvision/index.php?option=com_content&view=article&id=45&Itemid=57 (accessed on 15 May 2018).
  94. Princeton Tracking Benchmark Dataset. Available online: http://tracking.cs.princeton.edu/dataset.html (accessed on 15 May 2018).
  95. Song, S.; Xiao, J. Tracking Revisited Using RGBD Camera: Unified Benchmark and Baselines. In Proceedings of the 2013 IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 233–240. [Google Scholar]
  96. RGB-D Object Detection Dataset. Available online: http://eis.bristol.ac.uk/~mc13306/ (accessed on 15 May 2018).
  97. RGB-D People Dataset. Available online: http://www2.informatik.uni-freiburg.de/~spinello/RGBD-dataset.html (accessed on 15 May 2018).
  98. Spinello, L.; Arras, K.O. People detection in RGB-D data. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 25–30 September 2011; pp. 3838–3843. [Google Scholar] [CrossRef]
  99. SBM-RGBD Dataset. Available online: http://rgbd2017.na.icar.cnr.it/SBM-RGBDdataset.html (accessed on 15 May 2018).
  100. Kim, Y. Kim Method. Unpublished work. 2017. [Google Scholar]
  101. Li, G.L.; Wang, X. AvgM-D. Unpublished work. 2017. [Google Scholar]
  102. Zeng, D.; Zhu, M. Background Subtraction Using Multiscale Fully Convolutional Network. IEEE Access 2018. [Google Scholar] [CrossRef]
  103. Ralli, J.; Díaz, J.; Ros, E. Spatial and temporal constraints in variational correspondence methods. Mach. Vis. Appl. 2013, 24, 275–287. [Google Scholar] [CrossRef]
  104. Hirschmuller, H. Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
  105. Tomasi, M.; Vanegas, M.; Barranco, F.; Daz, J.; Ros, E. Massive Parallel-Hardware Architecture for Multiscale Stereo, Optical Flow and Image-Structure Computation. IEEE Trans. Circuits Syst. Video Technol. 2012, 22, 282–294. [Google Scholar] [CrossRef]
  106. Kwolek, B.; Kepski, M. Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Prog. Biomed. 2014, 117, 489–501. [Google Scholar] [CrossRef] [PubMed]
  107. UR Fall Detection Dataset. Available online: http://fenix.univ.rzeszow.pl/~mkepski/ds/uf.html (accessed on 15 May 2018).
  108. BIWI RGBD-ID Dataset. Available online: http://robotics.dei.unipd.it/reid/index.php/8-dataset/2-overview-biwi (accessed on 15 May 2018).
  109. Munaro, M.; Fossati, A.; Basso, A.; Menegatti, E.; Van Gool, L. One-Shot Person Re-identification with a Consumer Depth Camera. In Person Re-Identification; Gong, S., Cristani, M., Yan, S., Loy, C.C., Eds.; Springer: London, UK, 2014; pp. 161–181. [Google Scholar]
  110. IPG Dataset. Available online: http://www.gpiv.upv.es/kinect_data/ (accessed on 15 May 2018).
  111. Albiol, A.; Albiol, A.; Oliver, J.; Mossi, J.M. Who is who at different cameras: people re-identification using depth cameras. IET Comput. Vis. 2012, 6, 378–387. [Google Scholar] [CrossRef]
  112. VAP Trimodal People Segmentation Dataset. Available online: http://www.vap.aau.dk/ (accessed on 15 May 2018).
  113. Li, G.L.; Wang, X. AvgM-D algorithm. Unpublished work. 2017. [Google Scholar]
  114. Kim, Y. Kim Algorithm. Unpublished work. 2017. [Google Scholar]
  115. SBM-RGBD Challenge Results. Available online: http://rgbd2017.na.icar.cnr.it/SBM-RGBDchallengeResults.html (accessed on 15 May 2018).
  116. Bouwmans, T.; Maddalena, L.; Petrosino, A. Scene background initialization: A taxonomy. Pattern Recognit. Lett. 2017, 96, 3–11. [Google Scholar] [CrossRef]
  117. Kajo, I.; Kamel, N.; Ruichek, Y.; Malik, A.S. SVD-Based Tensor-Completion Technique for Background Initialization. IEEE Trans. Image Process. 2018, 27, 3114–3126. [Google Scholar] [CrossRef]
  118. Maddalena, L.; Petrosino, A. Background Model Initialization for Static Cameras. In Background Modeling and Foreground Detection for Video Surveillance; Bouwmans, T., Porikli, F., Hoferlin, B., Vacavant, A., Eds.; Chapman & Hall/CRC: New York, NY, USA, 2014; pp. 3-1–3-6. [Google Scholar]
Figure 1. Background modeling issues related to depth data (highlighted by red ellipses). (a) Depth camouflage; (b) Depth shadows; (c) Specular materials; (d) Out of sensor range.
Figure 1. Background modeling issues related to depth data (highlighted by red ellipses). (a) Depth camouflage; (b) Depth shadows; (c) Specular materials; (d) Out of sensor range.
Jimaging 04 00071 g001
Figure 2. Wall video from the MULTIVISION Kinect dataset. (a) RGB image; (b) Depth image; (c) GT.
Figure 2. Wall video from the MULTIVISION Kinect dataset. (a) RGB image; (b) Depth image; (c) GT.
Jimaging 04 00071 g002
Figure 3. Suitcase video from the MULTIVISION Stereo dataset. (a) RGB image; Disparity estimated using: (b) Var [103]; (c) Phase [104]; and (d) SGBM [105]; (e) GT.
Figure 3. Suitcase video from the MULTIVISION Stereo dataset. (a) RGB image; Disparity estimated using: (b) Var [103]; (c) Phase [104]; and (d) SGBM [105]; (e) GT.
Jimaging 04 00071 g003
Figure 4. GenSeq video from the RGB-D Object Detection dataset. (a) RGB image; (b) Depth image; (c) GT.
Figure 4. GenSeq video from the RGB-D Object Detection dataset. (a) RGB image; (b) Depth image; (c) GT.
Jimaging 04 00071 g004
Figure 5. Despatx-ds video from the GSM dataset. (a) RGB image; (b) Depth image; (c) GT.
Figure 5. Despatx-ds video from the GSM dataset. (a) RGB image; (b) Depth image; (c) GT.
Jimaging 04 00071 g005
Figure 6. MultiPeople2 video from the SBM-RGBD dataset (OutOfRange category). (a) RGB image; (b) Depth image; (c) GT.
Figure 6. MultiPeople2 video from the SBM-RGBD dataset (OutOfRange category). (a) RGB image; (b) Depth image; (c) GT.
Jimaging 04 00071 g006
Table 1. Summary of background subtraction methods for RGBD videos.
Table 1. Summary of background subtraction methods for RGBD videos.
Authors & Ref.Used DataDepth DataModelNo. of Models
Eveland et al. (1998) [6]DStereoSingle Gaussian1
Gordon et al. (1999) [42]RGBDStereoMoG1
Ivanov et al. (2000) [43]RGBDStereoGeometric1
Harville et al. (2001) [16]RGBDStereoMoG1
Kolmogorov et al. (2005) [44]RGBDStereoMoG1
Crabb et al. (2008) [45]RGBDToFThresholding2
Guomundsson et al. (2008) [11]DToFSingle Gaussian1
Wu et al. (2008) [46]DIRThresholding1
Frick et al. (2009) [7]DToFMoG1
Leens et al. (2009) [47]RGBDToFViBe2
Stormer et al. (2010) [48]RGBDToFMoG2
Wang et al. (2010) [49]RGBDToF & StereoMoG + Single Gaussian2
Dondi et al. (2011) [50]DToFThresholding1
Frick et al. (2011) [51]RGBDToFThresholding1
Kawabe et al. (2011) [52]RGBDStereoMoG1
Mirante et al. (2011) [53]RGBDToFFrame diff. + region growing2
Rougier et al. (2011) [54]DKinectSingle Gaussian1
Schiller and Koch (2011) [55]RGBDToFMoG + avg.2
Stone and Skubic (2011) [56]DKinect[d m ,d M ]1
Han et al. (2012) [9]DKinectFrame difference1
Clapés et al. (2013) [57]RGBDKinectSingle Gaussian1
Fernandez-Sanchez et al. (2013) [58]RGBDKinectCodebook1
Mahbub et al. (2013)  [10]DKinectFrame difference1
Ottonelli et al. (2013) [59]RGBDStereoViBe2
Zhang et al. (2013) [60]DKinectSingle Gaussian1
Braham et al. (2014) [61]DKinectSingle Gaussian2
Camplani and Salgado (2014) [17]RGBDKinectMoG2
Camplani et al. (2014) [62]RGBDKinectMoG2
Chattopadhyay et al. (2014) [63]RGBDKinectSOBS2
Fernandez-Sanchez et al. (2014) [15]RGBDStereoCodebook2
Gallego and Pardás (2014) [18]RGBDKinectMoG2
Giordano et al. (2014) [64]RGBDKinectKDE1
Murgia et al. (2104) [65]RGBDKinectCodebook1
Song et al. (2014)  [66]RGBDKinectMoG2
Boucher et al. (2015) [67]RGBDAsusMean1
Cinque et al. (2015) [68]DKinectThresholding1
Huang et al. (2015)  [69]RGBDKinectThresholding1
Javed et al. (2015)  [70]RGBDStereoRPCA2
Nguyen et al. (2015)  [71]RGBDKinectMoG2
Sun et al. (2105) [72]RGBDKinectMoG + Single Gaussian2
Tian et al. (2015) [73]RGBDKinectRPCA1
Huang et al. (2016)  [19]RGBDKinectViBe2
Liang et al. (2016)  [20]RGBDKinectMoG2
Palmero et al. (2016) [74]DKinectMoG1
Chacon et al. (2017)  [75]RGBDKinectFuzzy frame diff.2
De Gregorio and Giordano (2017) [76]RGBDKinectWiSARD2
Javed et al. (2017)  [77]RGBD, DKinectRPCA1
Maddalena and Petrosino (2017) [78]RGBDKinectSOBS2
Minematsu et al.(2017) [79]RGBDKinectViBe2
Moyá-Alcover et al. (2017) [32]RGBDKinectKDE1
Trabelsi et al. (2017) [80]RGBDKinect & StereoKDE1
Zhou et al. (2017) [81]RGBDKinectViBe1
Table 2. Metrics frequently adopted for evaluating background subtraction methods in RGBD videos.
Table 2. Metrics frequently adopted for evaluating background subtraction methods in RGBD videos.
NameAcronymComputed asBetter if
Similarity (or Jaccard index)SiTP/(TP + FP + FN)
Similarity in Ω Si Ω TP/(TP + FP + FN) in Ω
RecallRecTP/(TP + FN)
SpecificitySpTN/(TN + FP)
False Positive RateFPRFP/(FP + TN)
False Negative RateFNRFN/(TP + FN)
Percentage of Wrong ClassificationsPWC100 × (FP + FN)/(TP + FN + FP + TN)
PrecisionPrecTP/(TP + FP)
F-MeasureF 1 (2 × Prec × Rec)/(Prec + Rec)
Table 3. Some publicly available RGBD datasets for background subtraction.
Table 3. Some publicly available RGBD datasets for background subtraction.
Name & Refs.SourceGT MasksNo. of VideosAdopted byMain Application
GSM [32,88]KinectYes7[32,80]Background subtraction
Kinect [89,90]KinectNo9 [90]Background subtraction
MICA-FALL [71,91]KinectNo240 [71]Analysis of human activities
MULTIVISION Kinect [58,92]KinectYes4 [19,58,75,80]Background subtraction
MULTIVISION Stereo [15,93]StereoYes4 [15,70,80]Background subtraction
Princeton Tracking Benchmark [94,95]KinectNo100 [81]Tracking
RGB-D Object Detection [17,96]Kinect Yes4[17,20,62,69,71,75,80]Background subtraction
RGB-D People [97,98]KinectNo3 [62]People tracking
SBM-RGBD [33,99]KinectYes33[76,77,78,79,100,101,102]Background subtraction
Table 4. Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Kinect dataset. and σ F 1 are the mean and the standard deviation over four GT masks for each video. In boldface, the best results for each metric and each sequence.
Table 4. Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Kinect dataset. and σ F 1 are the mean and the standard deviation over four GT masks for each video. In boldface, the best results for each metric and each sequence.
VideoMethodF 1 σ F 1 VideoMethodF 1 σ F 1
CB0.8470.057HallwayCB0.5550.189
C-KDE0.8810.062 C-KDE0.6320.052
CB-D0.8540.058 CB-D0.7700.097
D-KDE0.9330.047 D-KDE0.9230.072
ChairBoxDECB0.9140.027 DECB0.7830.187
CB4D0.8860.050 CB4D0.6170.190
RSBS0.895- RSBS0.843-
RGBD-KDE0.9620.032 RGBD-KDE0.8730.061
SAFBS0.910- SAFBS0.745-
CB0.6990.192WallCB0.8430.108
C-KDE0.8850.012 C-KDE0.9180.054
CB-D0.8350.137 CB-D0.5950.414
D-KDE0.7090.110 D-KDE0.6650.178
ShelvesDECB0.8480.128 DECB0.9380.029
CB4D0.7110.205 CB4D0.8680.054
RSBS0.900- RSBS--
RGBD-KDE0.9210.033 RGBD-KDE0.8860.009
SAFBS0.894- SAFBS0.930-
Table 5. Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Stereo dataset, using depth from three disparity estimation algorithms (Var [103], Phase [104], and SGBM [105]) or using only color (RGB). F 1 and σ F 1 are the mean and the standard deviation over four GT masks for each video. In boldface, the best results for each metric and each video.
Table 5. Performance results of various background subtraction methods on the RGBD videos of the MULTIVISION Stereo dataset, using depth from three disparity estimation algorithms (Var [103], Phase [104], and SGBM [105]) or using only color (RGB). F 1 and σ F 1 are the mean and the standard deviation over four GT masks for each video. In boldface, the best results for each metric and each video.
VideoMethodVarPhaseSGBMRGB
F 1 σ F 1 F 1 σ F 1 F 1 σ F 1 F 1 σ F 1
CB 0.5200.261
C-KDE 0.6880.087
CB-D0.7280.1060.4090.1180.5600.381
D-KDE0.7550.0680.7010.1020.7830.080
SuitcaseDEOR-PCA0.826-0.431-0.413-
DECB0.7660.0890.4990.1470.7900.105
DECB-LF0.7500.1160.7240.1160.7650.143
RGBD-KDE0.8470.0840.7690.0570.8650.091
CB 0.7450.132
C-KDE 0.7230.074
CB-D0.4000.2260.5350.0990.0940.115
D-KDE0.6660.1720.6000.1990.7820.090
LCDScreenDEOR-PCA0.764-0.684-0.668-
DECB0.7840.0840.6390.0320.8200.061
DECB-LF0.8030.0740.6910.0710.8320.075
RGBD-KDE0.9040.0420.8200.0640.9110.045
CB 0.6530.112
C-KDE 0.7410.203
CB-D0.2780.3340.2590.2940.3870.366
D-KDE0.4790.2800.5070.1220.5380.252
CrossingDEOR-PCA0.906-0.620-0.416-
DECB0.7800.0820.6360.0510.8040.048
DECB-LF0.7910.0820.7650.1110.8510.038
RGBD-KDE0.8070.0890.8210.0420.8720.011
CB 0.6010.176
C-KDE 0.5270.081
CB-D0.3030.3100.2170.2830.2010.321
D-KDE0.5730.1440.4990.2970.4760.220
LabDoorDEOR-PCA0.780-0.547-0.572-
DECB0.5480.1900.6580.1370.6610.156
DECB-LF0.6740.1760.6730.1400.6910.145
RGBD-KDE0.6140.1910.5520.0990.7590.182
Table 6. Performance results of various background subtraction methods on the RGBD videos of the RGB-D Object Detection dataset. In boldface, the best results for each metric and each sequence.
Table 6. Performance results of various background subtraction methods on the RGBD videos of the RGB-D Object Detection dataset. In boldface, the best results for each metric and each sequence.
VideoMethodPWCFNRFPRSiSI Ω
CL C 2.380.16380.00630.720.55
C-KDE2.430.12080.00880.810.66
CL D 2.060.01770.02090.780.42
D-KDE0.750.00980.01070.910.61
MoG4D1.930.00630.02090.790.45
ViBeRGB+D12.390.00650.13850.440.12
MoGRGB+D2.030.17010.00160.790.61
CL W 1.300.01490.01270.830.53
GenSeqAMDF0.940.09560.0041--
RFBS0.520.03860.00450.92-
EC-RGBD---0.87-
MoG-RegPRE0.850.01280.00790.88-
GSM U B 1.380.01040.01440.830.78
GSM U F 1.300.04080.01300.830.78
RGBD-KDE0.850.01150.00580.870.65
SAFBS0.600.04280.00500.92-
CL C 39.020.82270.02270.220.37
C-KDE16.540.59720.02690.510.40
CL D 2.470.02580.02380.910.78
D-KDE2.680.01250.00900.940.77
MoG4D3.490.00380.06130.910.81
ViBeRGB+D6.940.00170.12690.810.74
ColCamSeqMoGRGB+D38.470.82870.00750.220.35
CL W 3.200.03520.02920.890.77
AMDF1.890.03870.0299--
EC-RGBD---0.95-
GSM U B 2.300.07100.03210.900.52
GSM U F 2.200.02940.04360.920.53
RGBD-KDE2.720.02990.01550.830.65
CL C 1.780.15600.00950.670.62
C-KDE2.270.06890.00740.740.59
CL D 3.380.48490.00640.400.39
D-KDE3.470.30120.01550.520.54
MoG4D2.110.15250.01310.610.61
1DCamSeqViBeRGB+D9.310.05480.09550.300.60
MoGRGB+D3.570.60870.00090.320.27
CL W 2.460.32210.00660.550.51
AMDF10.780.76860.0684--
GSM U B 1.740.20450.00460.640.54
GSM U F 1.650.22060.00610.650.55
RGBD-KDE1.620.10000.00490.760.61
CL C 5.370.18200.03230.670.63
C-KDE3.900.07580.02590.740.58
CL D 0.980.00950.00980.930.67
D-KDE0.570.00440.00410.880.70
MoG4D3.940.00590.04500.770.66
ViBeRGB+D7.150.00010.08340.660.54
MoGRGB+D3.430.23510.00080.750.58
1ShSeqCL W 0.810.01600.00680.940.71
AMDF1.460.08720.0047--
RFBS0.470.00720.00450.96-
EC-RGBD---0.91-
GSM U B 0.870.00980.00880.930.76
GSM U F 1.660.00140.01920.890.65
RGBD-KDE0.520.01220.00490.930.80
SAFBS0.650.00370.00700.95-
Table 7. Performance results of various background subtraction methods on the RGBD videos of the GSM dataset. In boldface, the best results for each metric and each sequence.
Table 7. Performance results of various background subtraction methods on the RGBD videos of the GSM dataset. In boldface, the best results for each metric and each sequence.
VideoMethodRecSpFPRFNRPWCF 1 Prec
C-KDE0.7200.780---0.7050.690
D-KDE0.7900.830---0.7950.800
Sleeping-dsGSM U B 0.8100.9800.0200.19010.3900.8900.980
GSM U F 0.9600.9600.0400.0403.9800.9600.950
RGBD-KDE0.8900.880---0.9000.910
C-KDE0.1500.480---0.2330.520
D-KDE0.3700.610---0.4690.640
TimeOfDay-dsGSM U B 0.0001.0000.0000.0000.1900.0000.000
GSM U F 0.0001.0000.0000.0000.3100.0000.000
RGBD-KDE0.4900.750---0.5700.680
C-KDE0.7500.750---0.7550.760
D-KDE0.9100.900---0.9250.940
Cespatx-dsGSM U B 0.9600.9900.0100.0402.8900.9700.990
GSM U F 0.9800.9900.0100.0201.4900.9900.990
RGBD-KDE0.9100.940---0.9390.970
C-KDE0.8800.910---0.9090.940
D-KDE0.7200.700---0.7390.760
Despatx-dsGSM U B 0.9400.9900.0100.0603.3900.9700.990
GSM U F 0.9701.0000.0000.0300.0000.9800.990
RGBD-KDE0.9100.940---0.9200.930
C-KDE0.9200.920---0.9350.950
D-KDE0.9800.990---0.9850.990
Shadows-dsGSM U B 0.9601.0000.0000.0401.8100.9801.000
GSM U F 0.9801.0000.0000.0201.0400.9900.990
RGBD-KDE1.0000.990---1.0001.000
LightSwitch-dsGSM U B 0.0001.0000.0000.0000.1100.0000.000
GSM U F 0.0001.0000.0000.0000.3400.0000.000
C-KDE0.8400.880---0.8600.880
D-KDE0.8700.940---0.9080.950
Bootstraping-dsGSM U B 0.7401.0000.0000.2606.9400.8500.980
GSM U F 0.8500.9900.0100.1503.9100.9100.980
RGBD-KDE0.9100.980---0.9531.000
Table 8. Average results of various background subtraction methods for each category of the SBM-RGBD dataset (Part 1). In boldface, the best results for each metric and each category.
Table 8. Average results of various background subtraction methods for each category of the SBM-RGBD dataset (Part 1). In boldface, the best results for each metric and each category.
MethodRecSpFPRFNRPWCPrecF 1
Bootstrapping
RGBD-SOBS0.88420.99250.00750.11582.32700.90800.8917
RGB-SOBS0.80230.98140.01860.19774.42210.81650.8007
SRPCA0.72840.99140.00860.27163.74090.91640.8098
AvgM-D0.45870.98610.01390.54137.19600.69410.5350
Kim0.88050.99650.00350.11951.52270.95660.9169
SCAD0.89970.99400.00600.10031.80150.93190.9134
cwisardH+0.57270.96160.03840.42738.13810.57870.5669
MFCN0.98660.99850.00150.01340.22860.98850.9876
ColorCamouflage
RGBD-SOBS0.95630.99270.00730.04371.21610.94340.9488
RGB-SOBS0.43100.97670.02330.569016.04040.80180.4864
SRPCA0.84760.93890.06110.15244.31240.83670.8329
AvgM-D0.90010.97930.02070.09992.07190.80960.8508
Kim0.97370.99270.00730.02630.73890.97540.9745
SCAD0.98750.99040.00960.01250.70370.96770.9775
cwisardH+0.95330.98490.01510.04671.19310.95020.9510
MFCN0.98590.99770.00230.01410.42720.98930.9876
DepthCamouflage
RGBD-SOBS0.84010.99850.00150.15990.97780.96820.8936
RGB-SOBS0.97250.98560.01440.02751.58090.83540.8935
SRPCA0.86790.97780.02220.13212.99440.78500.8083
AvgM-D0.83680.99220.00780.16321.69430.88600.8538
Kim0.87020.99680.00320.12980.98200.94330.9009
SCAD0.98410.99630.00370.01590.44320.94470.9638
cwisardH+0.68210.99490.00510.31792.40490.90160.7648
MFCN0.98700.99860.00140.01300.21340.97410.9804
Table 9. Average results of various background subtraction methods for each category of the SBM-RGBD dataset (Part 2). In boldface, the best results for each metric and each category.
Table 9. Average results of various background subtraction methods for each category of the SBM-RGBD dataset (Part 2). In boldface, the best results for each metric and each category.
MethodRecSpFPRFNRPWCPrecF 1
IlluminationChanges
RGBD-SOBS0.45140.99550.00450.04860.93210.47370.4597
RGB-SOBS0.43660.97150.02850.06343.50220.47590.4527
SRPCA0.47950.98160.01840.02051.91710.41590.4454
AvgM-D0.33920.98580.01420.16083.07170.41880.3569
Kim0.44790.99350.00650.05211.13950.45870.4499
SCAD0.46990.99270.00730.03010.97150.45670.4610
cwisardH+0.47070.99140.00860.02931.07540.45040.4581
MFCN0.49860.99870.00130.00140.12550.49120.4949
IntermittentMotion
RGBD-SOBS0.89210.99700.00300.10790.86480.95440.9202
RGB-SOBS0.92650.90280.09720.07359.38770.40540.5397
SRPCA0.88930.96290.03710.11073.70260.72080.7735
AvgM-D0.89760.99120.00880.10241.46030.91150.9027
Kim0.94180.99380.00620.05820.92130.93850.9390
SCAD0.95630.99140.00860.04370.86160.92430.9375
cwisardH+0.80860.95580.04420.19145.08510.59840.6633
MFCN0.99060.99870.00130.00940.24660.98360.9870
OutOfRange
RGBD-SOBS0.91700.99750.00250.08300.56130.93620.9260
RGB-SOBS0.89020.98960.01040.10981.36100.82370.8527
SRPCA0.87850.98780.01220.12151.61000.74430.8011
AvgM-D0.63190.98600.01400.36812.76630.63600.6325
Kim0.90400.99610.00390.09600.82280.92160.9120
SCAD0.92860.99650.00350.07140.57110.93570.9309
cwisardH+0.89590.99560.00440.10410.87310.90380.8987
MFCN0.99170.99820.00180.00830.20180.96130.9763
Shadows
RGBD-SOBS0.93230.99700.00300.06770.70010.97330.9500
RGB-SOBS0.93590.98810.01190.06411.51280.91400.9218
SRPCA0.75920.97680.02320.24084.06020.81280.7591
AvgM-D0.88120.98760.01240.11881.93300.89270.8784
Kim0.92700.99340.00660.07301.07710.94040.9314
SCAD0.96650.99100.00900.03351.00930.92760.9458
cwisardH+0.95180.98770.01230.04821.39420.90620.9264
MFCN0.98930.99830.00170.01070.21780.98420.9867

Share and Cite

MDPI and ACS Style

Maddalena, L.; Petrosino, A. Background Subtraction for Moving Object Detection in RGBD Data: A Survey. J. Imaging 2018, 4, 71. https://0-doi-org.brum.beds.ac.uk/10.3390/jimaging4050071

AMA Style

Maddalena L, Petrosino A. Background Subtraction for Moving Object Detection in RGBD Data: A Survey. Journal of Imaging. 2018; 4(5):71. https://0-doi-org.brum.beds.ac.uk/10.3390/jimaging4050071

Chicago/Turabian Style

Maddalena, Lucia, and Alfredo Petrosino. 2018. "Background Subtraction for Moving Object Detection in RGBD Data: A Survey" Journal of Imaging 4, no. 5: 71. https://0-doi-org.brum.beds.ac.uk/10.3390/jimaging4050071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop