Point-Plane SLAM Using Supposed Planes for Indoor Environments

Zhang, Xiaoyu; Wang, Wei; Qi, Xianyu; Liao, Ziwei; Wei, Ran

doi:10.3390/s19173795

Open AccessArticle

Point-Plane SLAM Using Supposed Planes for Indoor Environments

¹

Robotics Institute, Beihang University, Beijing 100191, China

²

Beijing Evolver Robotics Technology Co., Ltd., Beijing 100192, China

^*

Author to whom correspondence should be addressed.

Sensors 2019, 19(17), 3795; https://0-doi-org.brum.beds.ac.uk/10.3390/s19173795

Submission received: 30 July 2019 / Revised: 20 August 2019 / Accepted: 29 August 2019 / Published: 2 September 2019

(This article belongs to the Special Issue Mobile Robot Navigation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Simultaneous localization and mapping (SLAM) is a fundamental problem for various applications. For indoor environments, planes are predominant features that are less affected by measurement noise. In this paper, we propose a novel point-plane SLAM system using RGB-D cameras. First, we extract feature points from RGB images and planes from depth images. Then plane correspondences in the global map can be found using their contours. Considering the limited size of real planes, we exploit constraints of plane edges. In general, a plane edge is an intersecting line of two perpendicular planes. Therefore, instead of line-based constraints, we calculate and generate supposed perpendicular planes from edge lines, resulting in more plane observations and constraints to reduce estimation errors. To exploit the orthogonal structure in indoor environments, we also add structural (parallel or perpendicular) constraints of planes. Finally, we construct a factor graph using all of these features. The cost functions are minimized to estimate camera poses and global map. We test our proposed system on public RGB-D benchmarks, demonstrating its robust and accurate pose estimation results, compared with other state-of-the-art SLAM systems.

Keywords:

SLAM; RGB-D camera; factor graph; planes; plane edges; structural constraints; indoor environments

1. Introduction

Simultaneous localization and mapping (SLAM) develops quickly in recent years and becomes a fundamental problem for various applications including mobile robots, augmented and virtual reality. Various sensors can be used for SLAM such as laser-range finders [1,2] and cameras [3]. Laser-range finders provide accurate information about the environments but they are too expensive to be widely adopted. Cameras can also provide abundant information, which are much cheaper. With the availability of cheap RGB-D cameras [4,5], the depth of the scenes can also be measured more easily, especially for indoor environments.

Most of the existing methods for SLAM are based on a collection of points and use points to describe the scenes and estimate the camera poses. Points can be described by simple mathematical expressions and applied in both indoor and outdoor environments. But these methods encounter various problems in practice application, such as low-texture environments and changing light. Besides, the correct data association is also a challenge for point-based SLAM to obtain reliable estimation results. Direct methods are based on image intensities, which can be affected by changing light or viewing angles. Feature-based methods generally search corresponding points based on descriptors, so their results depend on the reliance of detecting and matching of feature points. The error from points measurement noise and data association will accumulate, especially in large scenes. These problems are hard to solve using only points.

For indoor environments, there are lots of other high-level features, such as lines and planes. Indoor environments are also common working scenes for mobile robots. These high-level features ensure faster and more accurate data association, which can be extracted easily using RGB-D cameras. The planes calculating from many points are more robust and accurate, because of less affection from measurement noise. Therefore using these high-level features helps to improve the performance of SLAM. In indoor environments, there are various man-made objects and structures, which have lots of parallel and perpendicular planes. Using these kinds of structural constraints can also help to achieve a long-term association for planes, resulting in smaller accumulated error. The plane is usually described as an infinite plane in mathematics. However, the real planes in working environments have limited size, contours or edges. Therefore, these features can also be exploited to add constraints for robust pose estimation.

In this paper, we propose a SLAM system using both points and planes to achieve robust and accurate estimation results. Using a RGB-D camera, we detect and match feature points in RGB images and generate point clouds from depth image to extract planes. Unlike other plane-based SLAM using infinite planes, we try to make use of plane edges. For indoor environments, a plane edge is generally an intersecting line of two perpendicular planes. In order to add the constraints of plane edges, we calculate and generate their perpendicular planes even when they are not seen. We also use contour points of planes to achieve robust data association. Then we use all of these points and planes to solve the poses of the camera and generate a map consisting of points and planes. Besides, we add parallel and perpendicular constraints for planes, which help reduce drift errors in indoor environments.

In summary, the contributions of our work are as follows:

We exploit plane edge constraints by generating supposed perpendicular planes from them.
We achieve robust data association for planes using their contour points.
We add perpendicular and parallel constrains for planes, which reduce drift errors in indoor environments.
We evaluate our proposed system on public datasets and achieve state-of-the-art performance, which also performs nearly in real time.

2. Related Work

Many different SLAM algorithms have been proposed in recent years and most of them formulate SLAM problem as a nonlinear least-squares problem [6]. The point-based SLAM tracks features across frames and builds a global map consisting of points. ORB-SLAM [7] tracks ORB feature points and uses reprojection error to estimate camera poses. Direct methods [8] use intensities error to track poses. Some other point-based SLAM methods can also obtain a sparse map [9], semi-dense map [10] or even dense map [11,12]. But all of these works may have difficulty in data association and work poorly in low-texture environments or large scenes.

In recent years, planes are also exploited to refine the performance of SLAM algorithm. Some earliest works [13,14] add planes into the extended Kalman filter (EKF) state vectors, which are computational cost because of the growing size of the dense covariance matrix [15]. Therefore, they are limited to some small scenes. Gostar et al. [16] also discuss the transition model of plane features. Taguchi et al. [17] present a framework for registration combining points and planes. CPA-SLAM [18] proposes a novel formulation for tracking camera motion using global planes in an expectation-maximization (EM) framework. Although they use soft labeling to reduce the effect of incorrect plane association, it can still be wrong in global optimization. Our proposal method for hard labeling works well to deal with these problems. Kaess et al. [19] introduce a minimal representation for infinite planes which is suitable for the least-squares estimation without encountering singularities. They also develop a fast dense planar SLAM algorithm [20]. EFs [21] proposes a novel method of reformulation of plane estimation and optimizing trajectories without explicit parametrization of planes. All of these works mentioned above use infinite planes of three degrees of freedom to simplify the representation of real planes. They may work poorly in some scenes, where the plane features are not plentiful. Considering the limited size of real planes, our work exploits the edges of planes to add more useful constraints, achieving more accurate and robust estimation. Besides, previous works ignore the structural constraints for planes, which are very useful for indoor environments.

In indoor environments, spatial structure is utilized to help simplify pose estimation or even make it more robust and accurate. Yang et al. [22] propose pop-up 3D plane model to generate plane landmark measurements in SLAM. Most of the indoor scenes are based on the Cartesian coordinate system, defined as Manhattan World (MW) [23]. Zhou et al. [24] utilize mean-shift to track dominant directions of MW and achieve drift-free rotation by decoupling the estimation of rotation and translation. Some other works [25,26,27] also exploit planes of MW to estimate drift-free rotation. These algorithms work well in some specific scenes, but they are also easy to fail because the MW assumption is not valid for some scenes. They give us an idea to use parallel or perpendicular constraints instead of three dominant directions, which can work in more scenes in indoor environments.

3. Proposed Methods

3.1. System Overview

In this subsection, we provide an overview of our proposed point-plane SLAM using supposed planes from edges, which is shown in Figure 1. Like other modern SLAM systems, ours can also be divided into two functional parts: (1) frond-end, the tracking part extracts and matches features for new captured frame, and estimates the camera pose by minimizing the error function constituted by the tracked features in the map; (2) back-end, the map management part estimates and optimizes landmarks in the environment. The theories for point-based SLAM are thorough enough, so we augment publicly available ORB-SLAM2 [7] RGB-D implementation to build our system, and focus on exploiting planes.

Global Map. The global map consists of a set of keyframes and detected landmarks, including both points and planes. The keyframes contain observed feature points and their descriptors, observed planes and supposed planes. For a point landmark, we store a list of observations and representative descriptor. Besides observations, a plane landmark also contains its contour points, edge lines and corresponding parallel or perpendicular planes, which will be explained in Section 3.3.

Tracking. The RGB-D camera (for example, Microsoft Kinect v2) provides RGB images and depth images. We extract ORB features [28] from RGB images and match them by descriptors. From depth images, we construct 3D point clouds and then extract planes. For indoor environments, to achieve more accurate and robust implementation, we also detect edges of the plane to calculate supposed planes and add constraints between parallel or perpendicular planes. With matched points and planes from the last frame or local map, the camera pose can be estimated.

Map Management. From keyframes, a local map consisting of plans and points is constructed and updated during the local mapping. We also process a global optimization after loop closing to construct a consistent global map. The loops are detected using the bag of words based on ORB features.

3.2. Preliminaries

We represent the pose of the frame k with respect to the world coordinate system w by

T_{k w} \in SE (3)

, which is also a rigid transformation that transforms a 3D point

P_{w}

from the world to the camera coordinate system:

P_{k} = T_{k w} P_{w}

(1)

The point in Equation (1) is represented by homogeneous coordinates

P = {(p_{1}, p_{2}, p_{3}, p_{4})}^{⊤} \in P^{3}

, and the corresponding Euclidean point is

p = {(p_{1} / p_{4}, p_{2} / p_{4}, p_{3} / p_{4})}^{⊤} \in R^{3}

. When a 3D point

p

is observed by the camera, there is a corresponding 2D pixel

u = {(u, v)}^{⊤}

in the image. Here, u and v define the position of the pixel in the image. For aligned RGB and depth images, the same point locates at the same position. The projection of

p

onto the image is

u = ρ (p)

, and the back-projection is

p = ρ^{- 1} (u)

.

We parametrize planes using the Hessian form

π = {(n^{⊤}, d)}^{⊤}

, where

n = {(n_{x}, n_{y}, n_{z})}^{⊤}

is the unit vector representing the plane’s orientation and d is the distance of the plane from the origin [29]. A point

p

lies on the plane

π

gives:

n^{⊤} p + d = 0

(2)

Similarly, a plane in the world frame can also be transformed into the camera frame:

π_{k} = T_{k w}^{- ⊤} π_{w} .

(3)

3.3. Plane Features

We construct a global map that consists of all plane features in the scenes. Every plane is segmented from organized point clouds generated from depth images. Considering the limited size of real planes, we also exploit constraints of plane edges. We calculate a supposed plane from the plane edge, which may also be observed by other frames. We try to match the plane with all planes in the map and use plane-to-plane constraints to estimate and refine the camera poses.

3.3.1. Plane Segmentation

The RGB-D camera provides RGB images and aligned depth images. In a depth image, each pixel relates to a distance between the image plane and the corresponding object in the RGB image. So we can recover the structure using the camera model to back-project the pixel and we use the pinhole camera model [29] in our work:

(\begin{matrix} x \\ y \\ z \end{matrix}) = d (\begin{matrix} f_{x}^{- 1} & 0 & - c_{x} f_{x}^{- 1} \\ 0 & f_{y}^{- 1} & - c_{y} f_{y}^{- 1} \\ 0 & 0 & 1 \end{matrix}) (\begin{matrix} u \\ v \\ 1 \end{matrix}),

(4)

where

u = {(u, v)}^{⊤}

is the valid pixel in depth image and d is the value in the depth image.

p = {(x, y, z)}^{⊤}

is the corresponding 3D point.

f_{x}

and

f_{y}

are focal length of the camera, and

(c_{x}, c_{y})

is the camera center coordinate.

The point clouds generated from depth images are organized, having an image like grid structure. Organized structure enables fast plane segmentation from the point cloud. We follow the work of [30], which segments point clouds from RGB-D data in near real-time. In this process, we can also obtain the contour of the segmented plane. The contour will be useful for calculating supposed planes and obtaining robust data association of planes.

3.3.2. Supposed Plane Serving as Edge Constraint

As we have described, the plane can be parametrized by simple mathematical expression

π = {(n^{⊤}, d)}^{⊤}

.

π

represents an infinite plane, which has three degrees of freedom. It means the plane can slide along the vertical direction of its normal. But the real planes in the scenes are not infinite and they have boundaries and edge lines. Therefore, using only these four parameters without edges loses other information from the real planes. Moreover, usually few planes can be observed in one frame, resulting in insufficient constraints for pose estimation. So we need to exploit more planes or constraints from their edges to estimate camera poses in real scenes.

In indoor environments, most man-made objects or structures have regular shapes, especially those objects that have large enough plane features. Therefore, it becomes easy to extract edge lines from these segmented planes. The edge of a plane can also be seen as an intersecting line with another plane. Besides, these two planes are generally perpendicular to each other. To add constraints from plane edges, we calculate and generate a supposed perpendicular plane from every edge line, instead of adding line-based constraints directly. For every captured frame, we not only segment planes from the depth image, but also calculate such supposed planes if there are valid plane edges. Note that, supposed planes may be also observed in other frames.

When segmenting planes from organized point clouds, we can also acquire the contour of segmented planes. We extract edge lines from contour using RANSAC [31]. If the inliers are sufficient (more than 15 percent of the contour points in our experiments), the extracted line is valid. Then we examine the position of the line to avoid the border of the image. We also examine the points near the lines to remove those lines extracted from shadow borders. A valid edge line can be represented by

l = {(p_{l}^{⊤}, n_{l}^{⊤})}^{⊤} = {(p_{x}, p_{y}, p_{z}, n_{x}, n_{y}, n_{z})}^{⊤},

(5)

p_{l}^{⊤}

is a point on this line, and

n_{l}^{⊤}

is the direction vector of the line.

The supposed perpendicular plane is calculated using the plane and its edge line. Having the representation of the plane

π_{i} = {(n_{i}^{⊤}, d_{i})}^{⊤}

and its edge line

l_{j} = {(p_{j}^{⊤}, n_{j}^{⊤})}^{⊤}

, the supposed plane is:

π_{s u p p o s e d} = (\begin{matrix} n_{i j} \\ d_{i j} \end{matrix}) = (\begin{matrix} n_{i} \times n_{j} \\ - n_{i j}^{⊤} p_{j} \end{matrix})

(6)

The process of generating supposed planes is shown in Figure 2. It is a frame from the sequence ‘freiburg3_cabinet’ of Technical University of Munich (TUM) RGB-D [32] dataset. There is a cabinet in the image, and three planes of the cabinet are extracted. We know there are five planes which can be observed, except for the bottom plane. But only three planes can be observed at most in one frame, although we can imagine the other two planes. To exploit more constraints of such common man-made objects, we suppose planes (the green planes in Figure 2c) from the plane edges. We remove those repeating planes. When the camera goes to observe in the opposite direction, these supposed planes will be observed.

We can treat supposed planes as ordinary planes, like those observed by the camera directly. Plane landmarks are also created from supposed planes and may be observed by other frames. Therefore, supposed planes increase the number of planes and add constraints from plane edges. Besides, supposed planes have a perpendicular orientation to the corresponding planes. Planes of different orientations provide more sufficient constraints for accurate pose estimation.

3.3.3. Data Association

Fast and accurate data association is the guarantee for robust and accurate pose estimation. Feature points can be associated with descriptors. For planes, previous work [18,19] find their correspondences just by normal

n^{⊤}

and distance d. This method does work for simple environments. But it also depends on an accurate pose estimation of the camera, which is not often satisfied, especially when the current pose is predicted from the previous pose. When the angle of the plane has some noise, the distance usually fluctuates largely, leading to association failure.

We implement a novel data association algorithm for matching planes. First, we find those intersecting planes on the map. In indoor environments, intersecting planes are corresponding planes with a small-angle difference because of noise, or different planes (usually perpendicular planes) with a large-angle difference. Therefore, we just need to check the angle of these intersecting planes to find correspondence, and the angle threshold can be a little larger (30° in our experiments).

To find intersecting planes in the global map, we calculate the distances from the points on the observed planes to the plane landmarks

π_{i} = {(n_{i}^{⊤}, d_{i})}^{⊤}

. We only examine the points

p_{j}

from the contour

C_{m}

acquired during plane segmentation to reduce the calculating amount:

s = min {| n_{i}^{⊤} p_{j} + d_{i} |}, p_{j} \in C_{m} .

(7)

If s is smaller than the distance threshold (0.1 m in our experiments), these two planes are considered intersecting planes. Then we examine the angle of these two planes to determine the corresponding plane. For every plane observation, we try to find a corresponding plane with the smallest s in the global map. If a plane observation fails to be associated, it is added to the global map as a new plane landmark.

Besides corresponding planes, we also exploit parallel and perpendicular planes. For every plane, we find a parallel plane if their intersection angle is small enough (10° in our experiments) but the distance s between them is large (larger than 0.1 m in our experiments). If there are several planes, we choose the plane having the smallest intersection angle. Similarly, we find a perpendicular plane if the angle between them is large enough (80° in our experiments).

Therefore, for every plane observation, we try to find one corresponding plane, one parallel plane and one perpendicular plane on the global map.

3.4. Tracking Using Points and Planes

It is well known that SLAM problem can be represented as a factor graph

G (V, E)

[33]. The vertices

V

represent the variables to estimate, such as camera poses and landmarks. The edges

E

between vertices represent the constraints. The factor graph enables an insightful visualization of SLAM problem. The process of solving SLAM problem is to construct such a factor graph and minimize the errors of all involved factors.

3.4.1. Factor Graph Containing Points and Planes

As mentioned before, we extract ORB features in the RGB image for the current frame and find matched point landmarks using ORB descriptors. A simple point-based SLAM can be represented as a factor graph in Figure 3a. The factors are constraints of the reprojection error:

f_{z} (p_{w}, T_{c w}) = {∥u_{m} - ρ (T_{c w} P_{w})∥}_{Σ_{z}},

(8)

T_{c w} P_{w}

is the 3D point in the camera coordinate system and

u_{m}

is the corresponding pixel in the current frame.

{∥ x ∥}_{Σ}

is the mahalanobis norm, which equals

x^{⊤} Σ^{- 1} x

, and

Σ

is the corresponding covariance matrix.

Likewise, planes can also be added into the factor graph as landmarks. As shown in Figure 3b, m means the direct observation of planes and

s p

denotes the plane is supposed in the frame. Therefore,

π_{2}

is a supposed plane in frame

T_{2}

, but it can be observed in frame

T_{3}

and

T_{4}

.

Now we need to define the factor connecting the vertices

V (T)

and

V (π)

. Notice that the Hessian form

π = {(n^{⊤}, d)}^{⊤}

is an over-parameterization of planes, because a 3D plane has only three degrees of freedom. Therefore, the Hessian form requires extra constraints to ensure the unit length of the plane normal vector, adding additional computation in optimization. To overcome this problem, we choose minimal parameterization of planes in optimization

τ = (ϕ, ψ, d)

, where

ϕ

and

ψ

are the azimuth and elevation angle of the normal respectively,

τ = q (π) = {(ϕ = arctan \frac{n_{y}}{n_{x}}, ψ = arcsin n_{z}, d)}^{⊤},

(9)

The azimuth and elevation should be restricted in

(- π, π]

to avoid the singularities of the minimal representation in optimization. Now we can define the factor for planes:

f_{m} (π_{w}, T_{c w}) = {∥q (π_{m}) - q (T_{c w}^{- ⊤} π_{w})∥}_{Σ_{m}},

(10)

where

T_{c w}^{- ⊤} π_{w}

is the 3D plane in the camera coordinate system.

π_{m}

is the corresponding plane observation in the current frame. Note that the covariance should be a little larger for those supposed planes, like

s p_{1}

in Figure 3b.

Besides, we also add the structure factors

E (S)

for indoor environments. Structure factors, parallel or perpendicular constraints between planes, add more edges between planes, as shown in Figure 3c. The structure factors only add normal constraints of planes. For parallel constraints:

f_{s p} (π_{w}, T_{c w}) = {∥q_{n} (n_{m}) - q_{n} (R_{c w} n_{w})∥}_{Σ_{s p}},

(11)

where

q_{n}

means the azimuth and elevation angle in Equation (9) and

n

is the normal of the plane.

R_{c w}

is the rotation of the current frame. Note that, if normal vectors point to opposite directions, we need to rotate them to the same direction first.

Similarly, the factors for perpendicular planes:

f_{s o} (π_{w}, T_{c w}) = {∥q_{n} (R_{⊥} n_{m}) - q_{n} (R_{c w} n_{w})∥}_{Σ_{s o}} .

(12)

The only difference is an additional rotation matrix

R_{⊥}

to rotate the normals to the same direction.

3.4.2. Pose Estimation

For every new captured frame, we extracted feature points from the RGB image, planes and supposed planes from the depth image. If the last frame is tracked successfully, we use a constant velocity model to predict the pose of the current frame and search point landmarks observed in the last frame. Corresponding planes, parallel and perpendicular planes can also be searched in the global map. Because there are not too many planes in scenes, this search simple method is efficient enough. With matched points and planes, the current pose is then optimized. We also try to search more point correspondences in the local map. The local map contains several keyframes sharing point and plane landmarks with the current frame. The current pose is finally optimized with all points and planes in the local map.

With tracked points and planes, pose

T_{c w}

can be computed by solving:

T_{c w} = \underset{T_{c w}}{arg min} (\sum H_{z} (f_{z}) + \sum H_{m} (f_{m}) + \sum H_{s p} (f_{s p}) + \sum H_{s o} (f_{s o})),

(13)

where

H (x)

is the Huber robust cost function and f are factors described in Section 3.4.1. We use the Levenberg–Marquadt method implemented in g2o [34] to solve this equation. The Jacobians of these components in Equation (13) are described detailedly in Appendix A. To remove bad observations, we check the error value of every factor after optimization and delete those larger than the threshold.

3.4.3. Keyframe Decision

The keyframes are used to construct the local and global map. Besides, the loop detection and relocalization are implemented based on keyframes. We utilize the new keyframe criteria in [7]. If a successfully tracked frame observes a new plane landmark or tracks less than 90% points of the last keyframe, it will be labeled as “keyframe”. If

N_{f}

frames have passed from the last keyframe, it will also be constructed as a keyframe. We set

N_{f}

smaller than the output frequency of the RGB-D camera to ensure at least one keyframe will be inserted in one second. Those redundant keyframes will also be deleted later according to the number of landmarks observed.

3.5. Map Management

Map management is the back-end part of the whole system. It constructs a consistent global map and the camera poses are further optimized.

3.5.1. Local Mapping

Every time the tracking part inserts a keyframe, new point landmarks are created by triangulation. A new plane landmark is also created if a new plane is observed or supposed. The local optimization optimizes the camera poses and landmarks on the local map. The local map contains a set of keyframes sharing landmarks with the currently processed keyframe, all of the landmarks observed by these keyframes and the parallel or perpendicular planes in the map. Other keyframes that also observe the landmarks in the local map are also included in the local optimization to provide sufficient constraints but remain fixed. The cost function is also created using all of the factors described in Section 3.4.1.

After local optimization, we delete those redundant keyframes, whose 90% of the observed points and all of the observed planes have been observed by other keyframes. The bad points and planes that have large errors are also deleted.

3.5.2. Loop Closing and Gobal Optimization

We compute the bags of words representation of every keyframe to detect loops. The loop detection part is implemented using DBow2 [35]. Once a loop is detected, we perform a pose graph optimization. All the camera poses, points and planes in the global map will also be optimized in the global optimization. The global optimization uses the same factors as local optimization but has all the vertices in the global map. The drift error is reduced after the global optimization.

4. Experiments

We evaluate our proposed SLAM system using the benchmarks TUM RGB-D dataset [32] and ICL-NUIM deteset [36]:

TUM RGB-D dataset is a famous benchmark for evaluating vSLAM/VO systems. It contains a large set of image sequences recorded from a RGB-D camera. The RGB and depth images are captured with a 640 × 480 resolution at the video frame rate (30 Hz). The ground truth camera poses are also provided. The dataset covers various indoor scenes and we choose those having obvious plane landmarks to evaluate our SLAM system.
ICL-NUIM dataset is a synthetical benchmark. The images are captured within synthetically generated indoor environments, including living room and office scenes. These scenes are suitable for our point-plane SLAM system. It also provides ground truth like the TUM RGB-D dataset, which is convenient to evaluate the results.

For implementation, our system augments the RGB-D variant of ORB-SLAM2 [7]. We rely on the underlying ORB-SLAM2 for points extraction and matching. We also use the methods in ORB-SLAM2 to maintain a local map and detect loops. The focus of our implementation is on the plane extraction, supposed plane calculation, plane matching, and pose estimation using points and planes. We construct a global map consisting of points and planes, including supposed planes. All experiments run on a laptop computer with i7-7700HQ 2.80GHz CPU, 16GB RAM, without GPU.

We compare our proposed SLAM system with some other RGB-D SLAM systems. ORB-SLAM2 [7] is the state-of-the-art feature-point based visual SLAM system and it has a RGB-D implementation. L-SLAM [25] is a RGB-D SLAM system using planes and MW constraints. Note that we test ORB-SLAM2 using the open-source code provided by the author and we include the results of L-SLAM from [25] directly. To demonstrate the benefit of supposed planes, we compare the results of different versions of our proposed SLAM system, which is point-plane SLAM (PP), point-plane SLAM using structural constraints (PP+S) and point-plane SLAM adding supposed planes (PP+SS).

4.1. ICL-NUIM Dataset

The ICL-NUIM dataset contains several sequences from two kinds of scenes, office and living room, as shown in Figure 4. Four sequences are recorded in each scene. We run experiments on all of these sequences.

We use the root mean square error (RMSE) of the absolute trajectory error (ATE) to evaluate the performance of different SLAM systems. We report the results in Table 1. The smallest error for every sequence is labeled by the bold number. The comparison of the RMSE is also shown in Figure 5.

We first add plane-based constraints only (PP). Because there are enough planes in these indoor sequences, these additional useful constraints already help to improve the accuracy of the camera poses. When there are only a few feature points can be tracked, like sequence `living_room_1’, point-based SLAM (ORB-SLAM2) performs poorly. But the planes improve the performance of the pose estimation. The planes in these indoor scenes are usually parallel or perpendicular, therefore these structural constraints (PP+S) improve the accuracy further.

Besides, we also add supposed planes (PP+SS). Table 2 shows the average number of observed planes for every frame. In some sequences, such as ‘living_room_0’ and ‘office_room_1’, the estimated trajectories are similar to those of ‘PP+S’, because few supposed planes can be utilized. In these sequences, the plane features are generally extracted from the wall, or planes from the furniture are too small to be extracted because the camera is far from them. Therefore, only a small number of supposed planes can be exploited. For other sequences, the supposed planes generated in the tracking process are shown in Figure 6. The supposed planes are generally estimated from table edges in these scenes. Some supposed planes may not be observed directly because of their small size. Therefore, supposed planes exploit more plane features in the scenes. Besides, adding supposed planes increases the number of plane observation for every frame and add more constraints. Therefore, our proposed method using supposed planes significantly reduces the estimation error, better than PP and PP+S.

Some comparisons of the estimated trajectories of our proposed methods are shown detailedly in Figure 7. They are all close to the ground truth. But the trajectories of PP have larger drift errors. This also demonstrates that structural constraints help to reduce errors effectively. Supposed planes improve estimation accuracy further.

L-SLAM utilizes MW constraints to obtain the drift-free rotation motion of the camera. Unlike our proposed method, they extract the planes parallel with manhattan axis. The ‘office_room’ sequences are suitable and L-SLAM performs best on these sequences. But our proposed method also works well and is suitable for more scenes, because lots of real scenes can not meet the MW assumption. We also calculate the weighted average error, and the number of frames serves as the weighting factor. The average error of our method is the smallest. Therefore, our proposed SLAM system improves a lot over the state-of-the-art point-based SLAM system and achieves more robust and accurate estimations in different scenes.

Some trajectories and reconstruction results of our proposed method are shown in Figure 8. Those large planes are extracted and associated accurately. The estimated trajectories are also close to the ground truth.

4.2. TUM RGB-D Dataset

We select several sequences containing enough plane features from the TUM RGB-D dataset to evaluate our proposed SLAM system. Because some sequences only have a few features, the original ORB-SLAM2 fails to initialize or track camera poses. Therefore, we modify some thresholds to make ORB-SLAM2 work normally, although the performance may be somewhat worse. We also calculate RMSE to compare the estimation results of different systems, as shown in Table 3. The comparison of the RMSE is also shown in Figure 9. Table 4 shows the average number of observed planes for every frame.

For ‘fr1/desk’ and ‘fr1/xyz’, abundant point landmarks ensure the good performance of ORB-SLAM2, while our proposed methods also work well. Because only few planes satisfy the structural constraints, our methods have similar results. For the other sequences, using planes reduces the drift error. It is a challenge for point-based SLAM in low-texture scenes, such as ‘fr3/str_notex_far’ and ‘fr3/str_notex_near’. Besides, there is only one cabinet in the scene in ‘fr3/cabinet’. ORB-SLAM2 drifts significantly and even fails to track poses for the entire sequence. But planes provide sufficient constraints. For every frame, the camera can observe three planes at most. Besides, when adding supposed planes, at least one supposed plane can be utilized for every frame, on average. The supposed planes also help to add constraints for other sequences, as shown in Table 4.

As shown in Figure 10, those red planes are supposed planes generated in the tracking process. These supposed planes are associated with corresponding real planes in the scene, which increase the accuracy of pose estimation. The comparison of the estimated trajectories for some sequences is shown detailedly in Figure 11. Using planes ensures the success of tracking in the entire sequence. Structural constraints reduce the drift error. Besides, the system adding more constraints from supposed planes achieves the best estimation results.

The trajectories and reconstruction results of our proposed method are also shown in Figure 12. It is clear that structural constraints and supposed planes improve the accuracy and robustness of the SLAM system.

4.3. Runtime Analysis

Table 5 shows the runtime analysis of our system, running on the TUM and ICL-NUIM datasets. All the testing codes are implemented in C++. All the experiments run on a laptop computer with i7-7700HQ 2.80 GHz CPU, 16 GB RAM, without GPU and the operating system is Ubuntu 18.04.

The main difference of our system from other point-based SLAM systems is adding plane features. Therefore, we evaluate the runtime of these additional components. Segmenting planes in organized point clouds is very fast, which takes about 6.8 ms for every frame. Besides, it takes about 7 ms to extract edge lines and generate supposed planes. The runtime of other components is almost the same as point-based SLAM’s. Therefore, the additional runtime of the system is about 14 ms, compared with point-based SLAM. It takes about 38 ms to process every frame, which performs nearly in real-time.

5. Conclusions

In this work, we propose a novel point-plane SLAM system for indoor environments. To reduce the drift error, we add structural constraints for those parallel or perpendicular planes. Unlike other plane-based SLAM work, we exploit plane edges to add more reliable constraints. We calculate a supposed perpendicular plane according to the plane and its edge line and treat it like other plane observations. Then the camera poses can be estimated by all of these features. Our proposed algorithm is tested on public benchmarks and achieves more robust and accurate estimation results. Structural constraints can add more constraints between planes, even these planes can not be observed by the same frame. Therefore structural constraints can reduce the drift error for indoor environments. Supposed planes increase the number of plane observations in one frame and add additional sufficient constraints to solve accurate camera poses. Therefore, Our proposed SLAM algorithm is suitable to use in those indoor applications.

Currently, loop detection and relocalization are performed by using feature points. Future work will exploit planes for faster loop detection and relocalization. We will also consider adding constraints between points and planes to remove those bad landmarks.

Author Contributions

Conceptualization, X.Z. and W.W.; methodology, X.Z.; software, X.Z.; validation, X.Z., X.Q. and Z.L.; formal analysis, X.Z.; investigation, X.Z.; resources, W.W.; data curation, X.Z. and Z.L.; writing—original draft preparation, X.Z.; writing—review and editing, W.W. and X.Q.; visualization, Z.L.; supervision, W.W. and R.W.; project administration, X.Z.; funding acquisition, W.W. and R.W.

Funding

This work was supported by the Natural Science Foundation of China under grant number 91748101.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

As we have shown in the main paper, the problem of SLAM can be represented by a factor graph. To estimate the camera poses, we need to solve a nonlinear least-squares problem, such as Equation (13). All iterative methods for nonlinear optimization, whether Levenberg-Marquardt or Gauss-Newton, need Jacobians of the error functions to update the variables.

We use the Lie algebra

ϕ \in so (3)

to represent the orientation, and

ξ \in se (3)

to represent the pose during the optimization process. Then the rotation matrice

R \in SO (3)

can be associated by the exponential map

R = \exp (ϕ^{\land})

.

{(\cdot)}^{\land}

is an operator producing an antisymmetric matrix. Similarly, the transformation matrice

T \in SE (3)

can also be associated by the exponential map

T = \exp (ξ^{\land})

.

Using the perturbation scheme [37], the Jocabian of reprojection error

e_{z}

in Equation (8) is given by

\begin{matrix} \frac{\partial e_{z}}{\partial δ ξ} & = \frac{\partial e_{z}}{\partial P_{c}} \frac{\partial P_{c}}{\partial δ ξ} = \frac{\partial e_{z}}{\partial P_{c}} \frac{\partial \exp (ξ_{c w}^{\land}) P_{w}}{\partial δ ξ} \end{matrix}

(A1)

\begin{matrix} = - [\begin{matrix} \frac{f_{x}}{z_{c}} & 0 & - \frac{f_{x} x_{c}}{z_{c}^{2}} \\ 0 & \frac{f_{y}}{z_{c}} & - \frac{f_{y} y_{c}}{z_{c}^{2}} \end{matrix}] [\begin{matrix} I & p_{c}^{\land} \end{matrix}] \end{matrix}

(A2)

\begin{matrix} = [\begin{matrix} - \frac{f_{x}}{z_{c}} & 0 & \frac{f_{x} x_{c}}{z_{c}^{2}} & \frac{f_{x} x_{c} y_{c}}{z_{c}^{2}} & - f_{x} - \frac{f_{x} x_{c}^{2}}{z_{c}^{2}} & \frac{f_{x} y_{c}}{z_{c}} \\ 0 & - \frac{f_{y}}{z_{c}} & \frac{f_{y} y_{c}}{z_{c}^{2}} & f_{y} + \frac{f_{y} y_{c}^{2}}{z_{c}^{2}} & - \frac{f_{y} x_{c} y_{c}}{z_{c}^{2}} & - \frac{f_{y} x_{c}}{z_{c}} \end{matrix}] \end{matrix}

(A3)

p_{c} {(x_{c}, y_{c}, z_{c})}^{⊤}

is the corresponding Euclidean point of

P_{c}

.

Similarly, the Jocabian of the error

e_{m}

in Equation (10) is given by

\begin{matrix} \frac{\partial e_{m}}{\partial δ ξ} & = \frac{\partial e_{m}}{\partial π_{c}} \frac{\partial π_{c}}{\partial δ ξ} = \frac{\partial e_{m}}{\partial π_{c}} \frac{\partial \exp {(ξ_{c w}^{\land})}^{- ⊤} π_{w}}{\partial δ ξ} \end{matrix}

(A4)

\begin{matrix} = - [\begin{matrix} - \frac{n_{c y}}{n_{c x}^{2} + n_{c y}^{2}} & \frac{n_{c x}}{n_{c x}^{2} + n_{c y}^{2}} & 0 & 0 \\ 0 & 0 & \frac{1}{\sqrt{1 - n_{c z}^{2}}} & 0 \\ 0 & 0 & 0 & 1 \end{matrix}] [\begin{matrix} 0 & - {(n_{c})}^{\land} \\ - {(n_{c})}^{⊤} & 0 \end{matrix}] \end{matrix}

(A5)

\begin{matrix} = [\begin{matrix} 0 & 0 & 0 & \frac{n_{c x} n_{c z}}{n_{c x}^{2} + n_{c y}^{2}} & \frac{n_{c y} n_{c z}}{n_{c x}^{2} + n_{c y}^{2}} & - 1 \\ 0 & 0 & 0 & - \frac{n_{c y}}{\sqrt{1 - n_{c z}^{2}}} & \frac{n_{c x}}{\sqrt{1 - n_{c z}^{2}}} & 0 \\ n_{c x} & n_{c y} & n_{c z} & 0 & 0 & 0 \end{matrix}] \end{matrix}

(A6)

π_{c} {(n_{c x}, n_{c y}, n_{c z}, d_{c})}^{⊤}

is the plane represented in the camera coordinate system.

The structural constraints only add constraints of orientation. The Jocabians of the error

e_{s p}

in Equation (11) and

e_{s o}

in Equation (12) are the same:

\begin{matrix} \frac{\partial e_{s}}{\partial δ ϕ} & = \frac{\partial e_{s}}{\partial n_{c}} \frac{\partial n_{c}}{\partial δ ϕ} = \frac{\partial e_{s}}{\partial n_{c}} \frac{\partial \exp (ϕ_{c w}^{\land}) n_{w}}{\partial δ ϕ} \end{matrix}

(A7)

\begin{matrix} = - [\begin{matrix} - \frac{n_{c y}}{n_{c x}^{2} + n_{c y}^{2}} & \frac{n_{c x}}{n_{c x}^{2} + n_{c y}^{2}} & 0 \\ 0 & 0 & \frac{1}{\sqrt{1 - n_{c z}^{2}}} \end{matrix}] [- {(n_{c})}^{\land}] \end{matrix}

(A8)

\begin{matrix} = [\begin{matrix} \frac{n_{c x} n_{c z}}{n_{c x}^{2} + n_{c y}^{2}} & \frac{n_{c y} n_{c z}}{n_{c x}^{2} + n_{c y}^{2}} & - 1 \\ - \frac{n_{c y}}{\sqrt{1 - n_{c z}^{2}}} & \frac{n_{c x}}{\sqrt{1 - n_{c z}^{2}}} & 0 \end{matrix}] \end{matrix}

(A9)

References

Hess, W.; Kohler, D.; Rapp, H.; Andor, D. Real-time loop closure in 2D LIDAR SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 1271–1278. [Google Scholar]
Li, J.; Zhong, R.; Hu, Q.; Ai, M. Feature-Based Laser Scan Matching and Its Application for Indoor Mapping. Sensors 2016, 16, 1265. [Google Scholar] [CrossRef] [PubMed]
Klein, G.; Murray, D. Parallel Tracking and Mapping for Small AR Workspaces. In Proceedings of the IEEE and ACM International Symposium on Mixed and Augmented Reality, Nara, Japan, 13–16 November 2007; pp. 1–10. [Google Scholar]
Wang, R.; Wan, W.; Wang, Y.; Di, K. A New RGB-D SLAM Method with Moving Object Detection for Dynamic Indoor Scenes. Remote Sens. 2019, 11, 1143. [Google Scholar] [CrossRef]
Meng, X.; Gao, W.; Hu, Z. Dense RGB-D SLAM with Multiple Cameras. Sensors 2018, 18, 2118. [Google Scholar] [CrossRef] [PubMed]
Grisetti, G.; Kummerle, R.; Stachniss, C.; Burgard, W. A tutorial on graph-based slam. IEEE Intell. Transp. Syst. Mag. 2011, 2, 31–43. [Google Scholar] [CrossRef]
Mur-Artal, R.; Tardós, J.D. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Trans. Robot. 2017, 33, 1255–1262. [Google Scholar] [CrossRef]
Engel, J.; Schöps, T.; Cremers, D. LSD-SLAM: Large-scale direct monocular SLAM. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
Engel, J.; Koltun, V.; Cremers, D. Direct Sparse Odometry. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 611–625. [Google Scholar] [CrossRef] [PubMed]
Forster, C.; Pizzoli, M.; Scaramuzza, D. SVO: Fast semi-direct monocular visual odometry. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 15–22. [Google Scholar]
Newcombe, R.A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A.J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In Proceedings of the 211 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), Basel, Switzerland, 26–29 October 2011; pp. 127–136. [Google Scholar]
Newcombe, R.A.; Lovegrove, S.J.; Davison, A.J. Dtam: Dense tracking and mapping in real-time. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2320–2327. [Google Scholar]
Weingarten, J.; Siegwart, R. EKF-based 3D SLAM for structured environment reconstruction. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Edmonton, AB, Canada, 2–6 August 2005; pp. 3834–3839. [Google Scholar]
Zureiki, A.; Devy, M. SLAM and data fusion from visual landmarks and 3D planes. IFAC Proc. Vol. 2008, 41, 14651–14656. [Google Scholar] [CrossRef] [Green Version]
Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; The MIT Press: Cambridge, MA, USA, 2006; pp. 330–332. [Google Scholar]
Gostar, A.K.; Fu, C.; Chuah, W.; Hossain, M.I.; Tennakoon, R.; Bab-Hadiashar, A.; Hoseinnezhad, R. State Transition for Statistical SLAM Using Planar Features in 3D Point Clouds. Sensors 2019, 19, 1614. [Google Scholar] [CrossRef] [PubMed]
Taguchi, Y.; Jian, Y.D.; Ramalingam, S.; Feng, C. Point-plane SLAM for hand-held 3D sensors. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, 6–10 May 2013; pp. 5182–5189. [Google Scholar]
Ma, L.; Kerl, C.; Stückler, J.; Cremers, D. CPA-SLAM: Consistent plane-model alignment for direct RGB-D SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 26–21 May 2016; pp. 1285–1291. [Google Scholar]
Kaess, M. Simultaneous localization and mapping with infinite planes. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 25–30 May 2015; Volume 1, p. 2. [Google Scholar]
Ming, H.; Westman, E.; Zhang, G.; Kaess, M. Keyframe-based dense planar SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017. [Google Scholar]
Ferrer, G. Eigen-Factors: Plane Estimation for Multi-Frame and Time-Continuous Point Cloud Alignment. Available online: http://sites.skoltech.ru/app/data/uploads/sites/50/2019/07/ferrer2019planes.pdf (accessed on 16 August 2019).
Yang, S.; Song, Y.; Kaess, M.; Scherer, S. Pop-up slam: Semantic monocular plane slam for low-texture environments. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016. [Google Scholar]
Coughlan, J.M.; Yuille, A.L. Manhattan World: Compass Direction from a Single Image by Bayesian Inference. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Kerkyra, Greece, 20–25 September 1999. [Google Scholar]
Yi, Z.; Kneip, L.; Rodriguez, C.; Li, H. Divide and Conquer: Efficient Density-Based Tracking of 3D Sensors in Manhattan Worlds. In Proceedings of the Asian Conference on Computer Vision (ACCV), Taipei, Taiwan, 21–23 November 2016. [Google Scholar]
Kim, P.; Coltin, B.; Jin Kim, H. Linear RGB-D SLAM for planar environments. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 333–348. [Google Scholar]
Wang, L.; Wu, Z. RGB-D SLAM with Manhattan Frame Estimation Using Orientation Relevance. Sensors 2019, 19, 1050. [Google Scholar] [CrossRef] [PubMed]
Guo, R.; Peng, K.; Fan, W.; Zhai, Y.; Liu, Y. RGB-D SLAM Using Point–Plane Constraints for Indoor Environments. Sensors 2019, 19, 2721. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G.R. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2003; pp. 65–68. [Google Scholar]
Trevor, A.J.B.; Gedikli, S.; Rusu, R.B.; Christensen, H.I. Efficient organized point cloud segmentation with connected components. In Proceedings of the 3rd Workshop on Semantic Perception Mapping and Exploration (SPME), Karlsruhe, Germany, 5 May 2013. [Google Scholar]
Fischler, M.A.; Bolles, R.C. Random Sample Consensus: A Paradigm for Model Fitting with Applications to Image Analysis and Automated Cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; Cremers, D. A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 7–12 October 2012; pp. 573–580. [Google Scholar]
Frank, D.; Michael, K. Factor Graphs for Robot Perception. Found. Trends Robot. 2017, 6, 1–139. [Google Scholar]
Grisetti, G.; Kümmerle, R.; Strasdat, H.; Konolige, K. g2o: A general Framework for (Hyper) Graph Optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar]
Gálvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Handa, A.; Whelan, T.; Mcdonald, J.; Davison, A.J. A benchmark for RGB-D visual odometry, 3D reconstruction and SLAM. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–5 June 2014; pp. 1524–1531. [Google Scholar]
Barfoot, T. State Estimation for Robotics; Cambridge University Press: Cambridge, UK, 2017; pp. 238–245. [Google Scholar]

Figure 1. The overview of our proposed point-plane simultaneous localization and mapping (SLAM) using supposed planes from edges: the RGB-D camera provides RGB images and depth images as inputs of the whole SLAM system. The front-end tracks camera pose using matched points and planes. The back-end constructs and updates a consistent map consisting of keyframes, points and planes.

Figure 2. The process of generating supposed planes: (a) Cabinet. A frame from sequence ‘freiburg3_cabinet’ of TUM RGB-D dataset. There is only one cabinet in the image. (b) Planes extracted from the cabinet. Three planes of the cabinet are extracted. Different colors denote different planes. The red points are those belong to valid plane edges. (c) Supposed planes. The green planes denote supposed planes generated from plane edges. Other repeating supposed planes are removed.

Figure 3. Factor graphs for SLAM. Circles denote vertices and black squares denote factors. (a) A simple point-based SLAM. The blue circles denote the keyframe poses and the green circles are point landmarks. (b) A point-plane-based SLAM. Yellow circles denote the plane landmarks.

s p

means supposed planes in the frame. (c) A point-plane-based SLAM adding structural constraints. More planes are included in the local map to add parallel or perpendicular constraints.

Figure 3. Factor graphs for SLAM. Circles denote vertices and black squares denote factors. (a) A simple point-based SLAM. The blue circles denote the keyframe poses and the green circles are point landmarks. (b) A point-plane-based SLAM. Yellow circles denote the plane landmarks.

s p

means supposed planes in the frame. (c) A point-plane-based SLAM adding structural constraints. More planes are included in the local map to add parallel or perpendicular constraints.

Figure 4. Scenes contained in the ICL-NUIM dataset. (a) The living room contains sofa, chairs table and some other common man-made objects. (b) The office room contains several tables, computers and pictures.

Figure 5. The comparison of the root RMSE on ICL-NUIM dataset.

Figure 6. Supposed planes generated in the tracking process. Red planes denote supposed planes.

Figure 7. Comparison of estimated trajectories. Different trajectories are plotted together.

Figure 8. Trajectories and reconstruction results. The first column is the point landmarks and the camera trajectories. The second column shows the plane landmarks. The third column shows the plane and point landmarks together. The fourth column is the comparison of the estimated trajectories and corresponding ground truth.

Figure 9. The comparison of the RMSE on TUM RGB-D dataset.

Figure 10. Supposed planes generated in the tracking process. Red planes denote supposed planes.

Figure 11. Comparison of estimated trajectories. Different trajectories are plotted together.

Figure 12. Trajectories and reconstruction results. The first column is the image showing the scene. The second column shows the point landmarks. The third column shows the plane and point landmarks together. The fourth column is the comparison of the estimated trajectories and corresponding ground truth.

Table 1. Evaluation results of translation absolute trajectory error (ATE) root mean squared error (RMSE) (unit: m) on ICL-NUIM dataset. PP, PP+S, PP+SS denote proposed point-plane SLAM, point-plane SLAM using structural constraints, point-plane SLAM using structural constraints and supposed planes, respectively. Bold numbers represent the best performances.

Sequence	ORB-SLAM2	L-SLAM	PP	PP+S	PP+SS	Frames
living_room_0	0.008578	0.012	0.008680	0.007982	0.008023	1509
living_room_1	0.200787	0.027	0.022113	0.013122	0.009793	966
living_room_2	0.031458	0.053	0.027399	0.024423	0.019249	881
living_room_3	0.016210	0.143	0.013832	0.013367	0.012473	1241
office_room_0	0.063775	0.02	0.038052	0.021606	0.019861	1508
office_room_1	0.084720	0.015	0.023280	0.022578	0.022546	966
office_room_2	0.030912	0.026	0.025042	0.024472	0.022009	881
office_room_3	0.034402	0.011	0.023617	0.019084	0.018483	1241
average	0.054680	0.038023	0.022517	0.017508	0.016106

Table 2. The number of plane landmarks and the average number of observed planes on ICL-NUIM dataset.

N_{l}

denotes the number of plane landmarks.

N_{p}

denotes the average number of observed planes for every frame. The parameter

N_{sp}

denotes the average number of supposed planes for every frame.

Table 2. The number of plane landmarks and the average number of observed planes on ICL-NUIM dataset.

N_{l}

denotes the number of plane landmarks.

N_{p}

denotes the average number of observed planes for every frame. The parameter

N_{sp}

denotes the average number of supposed planes for every frame.

Sequence	$N_{l}$	$N_{p}$	$N_{sp}$
living_room_0	17	2.67197	0.082152
living_room_1	19	2.91511	0.362319
living_room_2	18	3.79115	0.589103
living_room_3	21	2.70991	0.321515
office_room_0	15	2.78515	0.043103
office_room_1	12	2.69151	0.073520
office_room_2	14	2.87855	0.178104
office_room_3	11	2.86301	0.118082

Table 3. Evaluation results of translation ATE RMSE (unit: m) on TUM RGB-D dataset. PP, PP+S, PP+SS denote proposed point-plane SLAM, point-plane SLAM using structural constraints, point-plane SLAM using structural constraints and supposed planes, respectively. Bold numbers represent the best performances.

Sequence	ORB-SLAM2	L-SLAM	PP	PP+S	PP+SS	Frames
fr1/desk	0.020163	-	0.014597	0.014172	0.014341	573
fr1/xyz	0.009377	-	0.009433	0.009530	0.009326	792
fr3/cabinet	0.075824	0.291	0.043939	0.028968	0.016801	1112
fr3/large_cabinet	0.045575	0.14	0.034616	0.032573	0.029702	984
fr3/str_notex_far	0.035837	0.141	0.028480	0.024204	0.022037	794
fr3/str_notex_near	0.027648	0.066	0.017045	0.015473	0.012481	1054
fr3/str_tex_far	0.010693	0.212	0.010438	0.009843	0.009716	907
fr3/str_tex_near	0.012139	0.156	0.010039	0.010073	0.008481	1057
average	0.031386	0.16927	0.021929	0.018547	0.015391

Table 4. The number of plane landmarks and the average number of observed planes on TUM RGB-D dataset.

N_{l}

denotes the number of plane landmarks.

N_{p}

denotes the average number of observed planes for every frame. The parameter

N_{sp}

denotes the average number of supposed planes for every frame.

Table 4. The number of plane landmarks and the average number of observed planes on TUM RGB-D dataset.

N_{l}

denotes the number of plane landmarks.

N_{p}

denotes the average number of observed planes for every frame. The parameter

N_{sp}

denotes the average number of supposed planes for every frame.

Sequence	$N_{l}$	$N_{p}$	$N_{sp}$
fr1/desk	7	1.26876	0
fr1/xyz	4	1.45707	0
fr3/cabinet	6	3.06745	1.281470
fr3/large_cabinet	6	2.04268	0.357724
fr3/str_notex_far	6	2.85264	0.618388
fr3/str_notex_near	5	1.72486	0.299810
fr3/str_tex_far	6	2.94377	0.502756
fr3/str_tex_near	6	2.22422	0.284768

Table 5. Average runtime (unit: ms) of different components.

Main Components	Runtime (ms)
ORB Extraction	11.54862
Plane Segmentation	6.803255
Supposed Plane Generation	7.060935
Matching and Tracking Landmarks	11.87778
Local Optimization	141.1678
Global Optimization (128 KFs)	344.3516
Frame Tracking	38.07953

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Wang, W.; Qi, X.; Liao, Z.; Wei, R. Point-Plane SLAM Using Supposed Planes for Indoor Environments. Sensors 2019, 19, 3795. https://0-doi-org.brum.beds.ac.uk/10.3390/s19173795

AMA Style

Zhang X, Wang W, Qi X, Liao Z, Wei R. Point-Plane SLAM Using Supposed Planes for Indoor Environments. Sensors. 2019; 19(17):3795. https://0-doi-org.brum.beds.ac.uk/10.3390/s19173795

Chicago/Turabian Style

Zhang, Xiaoyu, Wei Wang, Xianyu Qi, Ziwei Liao, and Ran Wei. 2019. "Point-Plane SLAM Using Supposed Planes for Indoor Environments" Sensors 19, no. 17: 3795. https://0-doi-org.brum.beds.ac.uk/10.3390/s19173795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Point-Plane SLAM Using Supposed Planes for Indoor Environments

Abstract

1. Introduction

2. Related Work

3. Proposed Methods

3.1. System Overview

3.2. Preliminaries

3.3. Plane Features

3.3.1. Plane Segmentation

3.3.2. Supposed Plane Serving as Edge Constraint

3.3.3. Data Association

3.4. Tracking Using Points and Planes

3.4.1. Factor Graph Containing Points and Planes

3.4.2. Pose Estimation

3.4.3. Keyframe Decision

3.5. Map Management

3.5.1. Local Mapping

3.5.2. Loop Closing and Gobal Optimization

4. Experiments

4.1. ICL-NUIM Dataset

4.2. TUM RGB-D Dataset

4.3. Runtime Analysis

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI