Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera

Gao, Xiang; Xue, Hanjun; Liu, Xinghua

doi:10.3390/pr10102081

Open AccessArticle

Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera

by

Xiang Gao

,

Hanjun Xue

and

Xinghua Liu

^*

School of Electrical Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Processes 2022, 10(10), 2081; https://0-doi-org.brum.beds.ac.uk/10.3390/pr10102081

Submission received: 7 August 2022 / Revised: 2 October 2022 / Accepted: 10 October 2022 / Published: 14 October 2022

(This article belongs to the Special Issue Machine Learning and Optimization Algorithms for Data Analysis and Other Engineering Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

As a new type of vision sensor, the dynamic and active-pixel vision sensor (DAVIS) outputs image intensity and asynchronous event streams in the same pixel array. We present a novel visual odometry algorithm based on the DAVIS in this paper. The Harris detector and the Canny detector are utilized to extract an initialized tracking template from the image sequence. The spatio-temporal window is selected by determining the life cycle of the asynchronous event streams. The alignment on timestamps is achieved by tracking the motion relationship between the template and events within the window. A contrast maximization algorithm is adopted for the estimation of the optical flow. The IMU data are used to calibrate the position of the templates during the update process that is exploited to estimate camera trajectories via the ICP algorithm. In the end, the proposed visual odometry algorithm is evaluated in several public object tracking scenarios and compared with several other algorithms. The tracking results show that our visual odometry algorithm can achieve better accuracy and lower latency tracking trajectory than other methods.

Keywords:

event camera; visual odometry; tracking template; contrast maximization

1. Introduction

Visual odometry has played an important role in robot navigation [1], intelligent transportation [2], and intelligent inspection [3]. Although several decades of active research have led to a certain level of maturity, we still face challenges in scenes with high dynamics, low texture, or harsh lighting conditions [4]. Conventional visual odometry generally acquires images from frame-based cameras and estimates camera motion from several adjacent images [5]. However, for objects with a high dynamic range or high-speed motion, frame-based cameras cannot obtain clear images. Therefore, it is difficult for a frame-based camera to extract the feature points of the image and estimate the camera pose in this case. Furthermore, during the blind period between frames, the frame-based camera does not capture the precise motion of the feature information. Frame-based cameras capture redundant information in static scenes, resulting in a waste of storage and computational resources [6].

Bio-inspired event cameras, such as the dynamic vision sensor (DVS) [7], overcome the above-mentioned limitations of frame-based cameras [8]. As a new type of bio-heuristic vision sensor, the event camera has a completely different mode from the conventional camera [9], and event cameras only output the change in pixel-level brightness. For each pixel, when the intensity change reaches a certain threshold, an event is triggered; the event carries information on pixel coordinates, timestamps, and polarity [10]. Compared with the low frame rate, large delay, and insufficient dynamic response range in conventional visual cameras, an event camera has the characteristics of fast response speed, high dynamic range, and low power consumption [11]. Although traditional frame-based tracking algorithms have been vigorously developed, asynchronous event streams cannot be directly handled by current frame-based pose estimation methods [12]. As a result, event-based tracking algorithms are required. The dynamic and active-pixel vision sensor (DAVIS) is a sensor that combines a frame-based camera with an asynchronous event-based camera in the same pixel array. Based on the assumption that events are mainly triggered by high gradient edges in the image, the optimal motion parameters for events can be computed by maximizing contrast in the Image of Warped Events (IWE) [13]. Various reward functions for evaluating contrast were proposed and analyzed in recent work by Gallego et al. [14]. Contrast maximization algorithms for events have also been successfully used to solve various problems of event cameras, such as optical flow estimation [15,16], motion segmentation [17,18], 3D reconstruction [19], and motion estimation [20]. Since event streams cannot provide absolute brightness values and synchronized output with image frames, a contrast maximization algorithm is utilized to resolve event and image frame data associations in this paper, while the event-based feature tracking can be further simplified.

In this paper, we present a novel visual odometry algorithm based on template edges using a DAVIS camera. In our tracking approach, we leverage a combination of event- and frame-based camera measurements. The tracked initial tracking template is extracted by the feature detector in the image, and the spatio-temporal windows are selected by determining the life cycle of the asynchronous event streams. The calibrated tracking templates are utilized to calculate the event camera trajectory. The main contributions are summarized as follows:

Compared with traditional event-by-event tracking methods [20,21,22], a new tracking mechanism is presented to resolve data associations. A contrast maximization method is adopted to calculate the displacement parameters of the events, and the IMU data are used to calibrate the rotation parameters of the events, which greatly enhances the calculation speed and accuracy of the event stream.
Since the ICP algorithm is highly dependent on the depth of the scene, a robust Beta-Gaussian distribution depth filter is presented to obtain a more accurate depth of the tracking template than depth estimation with only triangulation [23,24].
We successfully apply our method to evaluation experiments in several different scenarios on public event camera datasets. Compared with the visual odometry algorithm [7,9], the proposed algorithm can achieve better performance and obtain a lower-latency camera trajectory.

The rest of this article is organized as follows. The related work on event-based camera visual odometry is presented in Section 2. The realization steps of fusion events and image frame tracking are described in Section 3. The effectiveness of the algorithm is verified on several DAVIS datasets in Section 4. Finally, conclusions are given in Section 5.

2. Related Work

Feature detection and tracking methods for frame-based cameras are well established. However, these frame-based methods cannot track the blind time between two adjacent frames, and frame-based cameras still capture information in all pixel arrays, even in scenes without motion [17]. In contrast, an event-based camera only acquires information about the areas of the scene where the intensity value has changed, and it fills the blind spots between adjacent frames with a higher asynchronous response rate. The advantages of event-based cameras are better suited for applications such as driverless vehicles and motion tracking [22,25].

Event-based visual odometry can be divided into two main methods: one is to use traditional methods to accumulate event information for tracking, and the other is to track directly based on asynchronous event streams. Gallego et al. [26] presented the pose tracking of a 6-DoF event camera from an existing photometric depth map (intensity + depth information). Stoffregen et al. introduced event accumulation frames with edge contrast maximization for motion segmentation [27]. Alzugaray and Chli [28] proposed a purely event-based corner detector and a new corner tracker, proving that it is possible to directly detect and track corners in event streams. Although there is a great deal of research devoted to event-based feature detection, very little work has been done to consider the problem of event tracking. Some approaches for localization and mapping with event cameras have several similarities with our method. Gallego et al. [29] detected independent moving objects by tracking corners detected in event images integrated over short time windows. Mueggler et al. [30] proposed a DVS-based ego-motion estimation method that uses a continuous-time framework to directly integrate the information transmitted by the camera. The pose trajectory of the DVS is estimated based on the observed events. Kim et al. [31] estimated 6-DoF camera motion, log intensity gradient, and inverse depth and used three decoupled probabilistic filters in real time.

Recently, Mueggler et al. [32] extended a frame and event tracker for DAVIS cameras and integrated events from event cameras for high-speed object tracking. In contrast to the above-mentioned approach, we consider the connection between event stream and images in our modeling process, and higher-rate trajectory tracking is provided.

3. Main Methods

This paper proposes a visual odometry algorithm based on DAVIS. The main methods are divided into two parts: feature detection and feature tracking. As shown in Figure 1, the entire process was divided into six steps. We detect features in the image sequence. The contrast maximization algorithm is then utilized to match the event stream in the corresponding spatio-temporal windows and calculate the optical flow of the event. The estimated motion parameters and the IMU measurement are exploited to calibrate the tracking template. We then detect the depth values of the tracking template using a depth filter. Finally, the 6-DoF event camera pose can be estimated by the ICP algorithm.

3.1. Feature Detection

Since events in edge regions of the scene are triggered more frequently than those in low-texture regions, we devise suitable features for tracking. As shown in Figure 2, the method first extracts feature points and edge maps from image frames by the Harris detector [9] and the Canny detector [20] in the feature detection stage. Then, it selects the edge in the specified area around each feature point as the template edge of the feature point. All regions are square in the same size, which is a tunable parameter. Our method does not need to provide frames at a constant rate, as they are only used to initialize features. Keyframes can be added to replace missing features.

3.2. Feature Tracking

Assuming that feature point

x

is detected from the image frame at time

t_{0}

, the motion of feature point

x

can be described as follows:

x (t) = x (t_{0}) + \int_{t_{0}}^{t} \dot{x} (s) d s

(1)

where

\dot{x} (s)

represents the position differential of the feature point

x

at time

s

.

A set of events is selected from the event stream, the time of feature point

x

is selected as the spatio-temporal window corresponding to initial time

t_{0}

, and

W

is the set of events in the spatio-temporal window.

W = {\{e_{i} | t_{0} < t_{e_{i}} < t_{1}\}}_{i = 1}^{n}

(2)

Here,

e_{i}

represents the

i - t h

event in the spatio-temporal window, and

t_{e_{i}}

represents the time when event

e_{i}

occurs.

n

represents the number of events. Since

[t_{0}, t_{1}]

is the first sub-time interval, the value of

t_{1}

can be calculated by setting the size of the spatio-temporal window. For example, let

W

contain 10,000 events, that is,

n =

10,000.

3.2.1. Choice of Spatio-Temporal Windows

In order to achieve asynchrony of the feature point tracking method, the size of the sub-time interval is determined by the method during real-time operation. The faster the scene moves, the smaller the sub-time interval and the faster the update frequency of feature points.

The specific calculation process is as follows. After the optical flow of all feature points in this iteration is obtained, the size of the next sub-time interval is calculated from the optical flow. We define

x_{i}

to represent the

i - t h

feature point,

i = \{1, \dots, m\}

, and the number of feature points is

m

.

θ_{i}^{n}

is the optical flow of feature point

x_{i}

in the

n - t h

sub-time interval

[t_{n - 1}, t_{n}]

. Given the optical flow of all feature points

{\{θ_{i}^{n}\}}_{i = 1}^{m}

in the sub-time interval

[t_{n - 1}, t_{n}]

, the next sub-time interval

[t_{n}, t_{n + 1}]

can be calculated:

\{\begin{array}{l} θ_{a v e r a g e}^{n} = (\sum_{i = 1}^{m} θ_{i}^{n}) / m \\ t_{n + 1} = t_{n} + 3 / θ_{a v e r a g e}^{n} \end{array}

(3)

where the unit of the number 3 is pixels, and

θ_{a v e r a g e}^{n}

represents the average optical flow of all feature points in the sub-time interval

[t_{n - 1}, t_{n}]

. The

t_{n + 1}

value is calculated through the formula, and the time required for the feature points in the previous time interval to move by 3 pixels on average is used as the estimated value of the current interval

[t_{n}, t_{n + 1}]

.

3.2.2. Maximizing IWE Contrast

After the spatio-temporal window is determined by corresponding to the feature point, we use the contrast maximization algorithm to match the event set

W

around the feature point

x

with the template point set. It is assumed that all template points have the same optical flow as feature point

x

(the optical flow

θ

of all pixels in the region is the same), and the optical flow of feature point

x

is constant in the sub-time interval

[t_{0}, t_{1}]

. Let the optical flow

θ

of the feature point

x

in the time interval

[t_{0}, t_{1}]

be defined as

v

. For event

e_{i}

in

W

, as shown in Figure 3, Image of Warped Events (IWE) is used to calculate its position

X_{k}^{'}

at time

t_{0}

. The formula is as follows:

X_{k}^{'} = [\begin{matrix} x_{k}^{'} \\ y_{k}^{'} \end{matrix}] = [\begin{matrix} x_{k} \\ y_{k} \end{matrix}] + [\begin{matrix} v_{1} \\ v_{2} \end{matrix}] • (t_{0} - t_{ref})

(4)

The weighted IWE is defined as:

I_{j} (x) = \sum_{k = 1}^{N_{e}} P_{k j} δ (X - X_{k j}^{'})

(5)

where

X_{k j}^{'}

is the position of the

k - t h

event after it is distorted along the

j - t h

optical flow

θ

,

N_{e}

represents the number of events,

δ

represents the Dirac function,

P_{k j}

represents the probability that the

k - t h

event belongs to the

j - t h

optical flow, and

I_{j} (x)

represents the IWE corresponding to the

j - t h

optical flow.

Events are aligned by image contrast, which is defined by a sharpness/dispersion metric, such as the variance:

V a r (I_{j}) = \int_{Ω} {(I_{j} (x) - μ_{j})}^{2} d x

(6)

where

Ω

is the image plane and

μ_{j}

is the mean of the

j - t h

Image of Warped Event.

θ \leftarrow θ + μ \nabla_{θ} (\sum_{j = 1}^{N_{l}} V a r (I_{j}))

(7)

where,

μ

represents the step size, and

N_{l}

represents the number of clusters.

3.2.3. Template Edge Update

After the optical flow of the feature points is obtained, the feature points and template edges are updated. However, when the camera is rotated, the template edges and template points move at significantly different speeds, and the points farther from the center of rotation move faster. Therefore, if the template points are updated by only using optical flow, the position of feature points will quickly deviate from the true value due to the existence of rotation factors. The process of updating the stencil edge is divided into two steps. First, the position of the stencil edge is updated by using the optical flow, and then the position of the stencil edge is corrected by using the IMU data.

The optical flow

θ

is used to update the position of feature point

x

and the corresponding position of the template edge

x_{j}

. Assuming that the optical flow of the feature point is constant in the sub-time interval

[t_{0}, t_{1}]

, the optical flow

θ

is then utilized to update the position of feature point

x

; the formula is as follows:

x (t_{1}) = x (t_{0}) + θ \cdot (t_{j} - t_{0})

(8)

x (t_{0})

is the position of the initially extracted feature point,

x (t_{j})

is the position of the updated feature point, and

t_{j}

is the time of the tracking template in the sub-time interval

[t_{0}, t_{1}]

.

In order to eliminate the influence of rotation on the template edge update, we introduce IMU data to correct the template edge position. As a result, the corrected position is closer to the true value of the template edge position. The relative position of the template points relative to the feature point is calculated at time

t_{j}

, and then the IMU data are used to correct the relative position. For the template point

x_{j}

, we define its relative position:

x_{j}^{r e l a t i v e} = x_{j} (t) - x (t)

(9)

Since the data in the IMU coordinate system and the tracking template from the camera coordinate system are not in the same system, we need to transform the data. The IMU measurements contain the accelerometer measures and gyroscope measures for each axis, but what we need during the update process is the rotation matrix between the templates. So, we convert linear acceleration and angular velocity into Euler angles and then convert the Euler angles into a rotation matrix. The accelerometer can calculate the roll angle

α_{a c c}

and pitch angle

β_{a c c}

at the time of rest. The gyroscope is the angular velocity integral in the time interval and can calculate three angles: the roll angle

α_{g y r o}

, pitch angle

β_{g y r o}

, and yaw angle

γ_{g y r o}

. The Euler angle fusion formula can be expressed as:

\{\begin{matrix} α = α_{g y r o} + (α_{a c c} - α_{g y r o}) \cdot k \\ β = β_{g y r o} + (β_{a c c} - β_{g y r o}) \cdot k \\ γ = γ_{g y r o} \end{matrix}

(10)

where

α

,

β

, and

γ

are obtained after the complementary fusion of the accelerometer and gyroscope pose;

k

represents the proportional coefficient, which needs to be adjusted according to the actual situation, such as by choosing 0.3. The conversion of Euler angles to a rotation matrix is expressed as:

R_{w i} = [\begin{matrix} \cos γ & - \sin γ & 0 \\ \sin γ & \cos γ & 0 \\ 0 & 0 & 1 \end{matrix}] * [\begin{matrix} \cos β & 0 & \sin β \\ 0 & 1 & 0 \\ - \sin β & 0 & \cos β \end{matrix}] * [\begin{matrix} 1 & 0 & 0 \\ 0 & \cos α & - \sin α \\ 0 & \sin α & \cos α \end{matrix}]

(11)

where

R_{w i}

is the rotation matrix in the IMU coordinate system.

Then, we solve the transformation of the rotation matrix between the camera coordinate system and the IMU coordinate system. The transformation relation is denoted as:

R_{w c} = R_{w i} \cdot R_{i c}

(12)

where

R_{w c}

is the rotation matrix in the camera coordinate system, and

R_{i c}

represents the transformation matrix between the IMU and camera coordinate systems during camera calibration. The IJRR dataset takes the first frame of the camera as the world coordinate system. We assume that the pose transformation of the camera is linear. The rotation matrix of the tracking template can be obtained by linear interpolation of the time scale. The process is represented as follows:

R_{j} = i n v (R_{w c}^{t_{0}}) \cdot R_{w c}^{t_{1}} \cdot \frac{t_{j} - t_{1}}{t_{0} - t_{1}}

(13)

where

R_{w c}^{t_{0}}

is the rotation matrix in the camera coordinate system at time

t_{0}

.

R_{w c}^{t_{1}}

is the rotation matrix in the camera coordinate system at time

t_{1}

, and it is calculated from the rotation matrix

R_{w i}^{t_{1}}

in the IMU coordinate system at time

t_{1}

and the transformation matrix

R_{i c}

between the camera and IMU;

R_{j}

is the interpolated rotation matrix between

t_{0}

and

t_{j}

.

The symbols

X_{j}

and

X

respectively represent the 3D coordinates corresponding to the template point

x_{j}

and the feature point

x

in the camera coordinate system at time

t_{0}

.

At time

t_{0}

:

\{\begin{matrix} s_{x_{j}}^{0} x_{j} (t_{0}) = K X_{j} \\ s_{x}^{0} x (t_{0}) = K X \end{matrix}

(14)

At time

t_{j}

:

\{\begin{array}{l} s_{x_{j}}^{1} x_{j} (t_{j}) = K (R_{j} X_{j} + t) \\ s_{x}^{1} x (t_{j}) = K (R_{j} X + t) \end{array}

(15)

where

t

is the translation vector between

t_{0}

and

t_{j}

,

R_{j}

is the interpolated rotation matrix between

t_{0}

and

t_{j}

, and

K

represents the internal parameter matrix of the camera.

Substituting the above formula into (9) and normalizing the 3D coordinates, we obtain:

x_{j}^{r e l a t i v e} (t_{j}) = N o r (K R_{j} K^{- 1} x_{j} (t_{0})) - N o r (K R_{j} K^{- 1} x (t_{0}))

(16)

At this time, the position of

x_{j}

at time

t_{j}

is obtained by adding the position of the feature point

x

and the relative position of

x_{j}^{r e l a t i v e}

at time

t_{j}

:

x_{j} (t_{j}) = x_{j}^{r e l a t i v e} (t_{j}) + x (t_{j})

(17)

Finally, the updating of all template points can be completed through the formula.

3.2.4. Depth Estimation

After the corresponding position of the edge map of each frame is obtained, the 3D position coordinates of the edge in the space can be recovered through the matching relationship of the pixels on the edge between different edge maps. The triangulation method is used, and the coordinates of the 3D point in space are recovered by using the pixel positions of the 3D points observed from different viewing angles. The same 3D point can be observed in multiple frames, and the corresponding spatial 3D point coordinates can be calculated for any two frames. The strategy of our method is using the properties of the Gaussian distribution to fuse the current observations, assuming that the depth values follow a Gaussian distribution. In the final estimation of the same 3D point, the strategy is based on a uniform Gaussian mixture distribution.

First, a triangulation method is used to recover the depth of pixels in the edge map. Assuming that a certain pixel

p_{0}

of the edge graph and pixel point

p_{1}

of the edge graph are a pair of matching points, and the two pixels correspond to the same 3D point

P

in the space, the formula holds as follows:

s_{0} K^{- 1} p_{0} = s_{1} R K^{- 1} p_{1} + t

(18)

where

R

and

t

are the rotation matrix and translation vector of the event camera from the edge graph to the edge graph, respectively.

s_{0}

and

s_{1}

represent the depths of the 3D point

P

in the coordinate system of the event camera at times

t_{i}

and

t_{j}

, respectively. Equation (18) is further simplified to obtain:

s_{1} = \frac{- (K^{- 1} p_{0})^t}{(K^{- 1} p_{0})^R K^{- 1} p_{1}}

(19)

Then, a depth value

y_{i}

can be calculated from the corresponding points in the two images. The distribution of

y

can be jointly represented by a Gaussian distribution and a uniform distribution:

p (y_{i} | \hat{y}, π) = π N (y_{i} | \hat{y}, τ_{i}^{2}) + (1 - π) U (y_{i} | y_{m i n}, y_{m a x})

(20)

where

N (y_{i} | \hat{y}, τ_{i}^{2})

is a Gaussian distribution centered on the true value

\hat{y}

, and

τ_{i}^{2}

is its variance.

π

is the probability of estimating the correct depth; the closer the depth measurement is to the true value, the closer

π

is to 1. As shown in Figure 4,

y_{m i n}

and

y_{\max}

are the upper and lower limits of the uniform distribution.

Equation (20) can be approximated by a Beta-Gaussian distribution. The step 4 of Algorithm 1 summarizes the depth estimation steps of our algorithm.

p (Z, π | a_{i}, b_{i}, μ_{i}, σ_{i}) = Beta (π | a_{i}, b_{i}) N (Z | u_{i}, σ_{i}^{2})

(21)

3.2.5. Pose Estimation

According to Algorithm 1, we can calculate corresponding 3D point sets

P = \{p_{1}, p_{2}, \dots, p_{n}\}

and

Q = \{q_{1}, q_{2}, \dots, q_{n}\}

through Section 3.2.3 and Section 3.2.4, where the number of corresponding point pairs is

n

. Then, the optimal coordinate transformation is calculated by the ICP algorithm, that is, the rotation matrix

R

and the translation vector

t

; the problem can be described as follows:

(R, t) = \underset{R \in S O (3), t \in ℝ^{3}}{\arg \min} \sum_{i = 1}^{n} w_{i} ∥ (R p_{i} + t) - q_{i} ∥^{2}

(22)

where

w_{i}

represents the weight of each point.

R

and

t

are our required rotation matrix and translation vector. To reduce the reprojection error during tracking, the Tukey weight function is used as follows:

w_{i} = \{\begin{matrix} {(1 - \frac{x^{2}}{b^{2}})}^{2} & |x| \leq |b| \\ 0 & o t h e r w i s e \end{matrix}

(23)

where

x = ‖ p_{w}^{-} - p_{w} ‖

and

b = 5

pixels. Additionally, we find that with

w_{i}

multiplied by the internal probability

π

, the estimated results are significantly better. Template points that are well tracked get the highest weight, while template points whose matching does not converge are usually removed due to their errors being too large.

Algorithm 1: Pseudo-code of pose estimation.

Step 1: Extracting feature points and edge points

1: Extract feature points by Harris detector

2: Extract edge points by Canny detector

3: Pick edge points around feature points as template edges

Step 2: Computing the optical flow and the spatio-temporal windows of events

4: Input: template edges P and event point set W

Output: optical flow

θ

Initialize optical flow

θ =

0

for i = 1 : m

do

for

j = 1 : n

do

I (x) = \sum_{j = 1}^{n} δ (X_{i} - X_{j}^{'}); Var (I) = \int_{Ω} {(I (x) - μ)}^{2} dx; θ \leftarrow θ + μ \nabla_{θ} (\sum_{k = 1}^{N_{l}} {Var (I}_{k}))

end for

5: Update the spatio-temporal window:

t_{n + 1} = t_{n} + 3 / θ_{a v e r a g e}^{n}

Step 3: Updating the template edges using the optical flow and IMU data

6: Update the postion of feature points:

x_{i} (t_{1}) = x_{i} (t_{0}) + v . (t_{1} - t_{0})

7: Update the relative position:

x_{j}^{r e l a t i v e} (t_{1}) = N o r (K R_{j} K^{- 1} x_{j} (t_{0})) - N o r (K R_{j} K^{- 1} x_{i} (t_{0}))

8: Update the postion of edge points:

x_{j} (t_{1}) = x_{j}^{r e l a t i v e} (t_{1}) + x (t_{1})

Step 4: Computing the depth value of template edges using a depth filter

9: Triangulation depth:

y_{i}

10: Depth filter:

q (y, π | a_{i}, b_{i}, μ_{i}, σ_{i}) = Beta (π {| a}_{i}, b_{i}) N (y ∣ u_{i}, σ_{i}^{2})

Input: Triangulation depth:

y_{i}

Output: depth value

y

for i = 1 : n

do

μ_{i}^{'} = C_{1}^{'} m_{i} + C_{2}^{'} μ_{i}

;

{σ_{i}^{'}}^{2} = C_{1}^{'} (y_{i}^{2} + m_{i}^{2}) + C_{2}^{'} (σ_{i}^{2} + μ_{i}^{2})

f_{i} = C_{1}^{'} \frac{a_{i} + 1}{a_{i} + b_{i} + 1} + C_{2}^{'} \frac{a_{i}}{a_{i} + b_{i} + 1}

e_{i} = C_{1}^{'} \frac{(a_{i} + 1) (a_{i} + 2)}{(a_{i} + b_{i} + 1) (a + b_{i} + 2)} + C_{2}^{'} \frac{a_{i} (a_{i} + 1)}{(a_{i} + b_{i} + 1) (a_{i} + b_{i} + 2)}

a_{i}^{'} = \frac{e_{i} - f_{i}}{f_{i} - \frac{e_{i}}{f_{i}}}

;

b_{i}^{'} = \frac{1 - f_{i}}{f_{i}} \cdot a_{i}^{'}

;

\frac{1}{y_{i}^{2}} = \frac{1}{σ_{i}^{2}} + \frac{1}{τ_{i}^{2}}

;

m_{i} = y_{i}^{2} (\frac{μ_{i}}{σ_{i}^{2}} + \frac{x_{i}}{τ_{i}^{2}})

end for

Step 5: Pose estimation using the ICP agorithm

11: ICP agorithm:

(R, t) = \underset{R \in S O (3), t \in ℝ^{3}}{\arg \min} \sum_{i = 1}^{n} w_{i} ∥ (R p_{i} + t) - q_{i} ∥^{2}

Output

6 - DoF event camera pose translation \in R^{3}

,

rotation \in S O (3)

.

4. Experiments

In order to demonstrate the performance of our proposed algorithm, some widely used event datasets were utilized to estimate camera trajectories. These datasets were generated using the DAVIS240C from iniLabs, and they contain events, images, and IMU measurements. The algorithm was evaluated by selecting some challenging sequences from the IJRR event camera datasets [33]. The sequences are from several indoor and outdoor scenes with different lighting conditions. Experiment 1 and Experiment 2 were evaluated indoors. Compared with Experiment 1, Experiment 2 had richer scene textures. In Experiment 3, the camera trajectory was evaluated outdoors. In several experiments, the camera trajectories obtained by our method were compared with frame-tracking-based (ORB) visual odometry and event-tracking-based (EVO) visual odometry, and we also compared the trajectories of our method with and without IMU data. Among the methods tested, only ours uses IMU data to calibrate rotation. The absolute pose error was calculated between the poses estimated by other algorithms and our method and the ground truth.

In the IJRR datasets [33], we selected representative scenes for trajectory estimation. We tested different visual odometry methods on three sequences: shapes_6dof, boxes_6dof, and outdoors_6dof. Figure 5, Figure 6 and Figure 7 show the test results of the different algorithms on the three sequences. These figures show the trajectories produced by visual odometry such as Our_method, ORB, and EVO. Table 1 shows the position error, orientation error, and absolute pose error of different methods compared to the ground truth in different scenarios. The position error was calculated as the Euclidean distance between the estimated value of the camera trajectory and the true value. Rotation error was computed as the angular error based on the estimated and true values of the rotation matrix. The absolute pose error compares the estimated trajectory and the reference trajectory and calculates the statistics of the entire trajectory, which is suitable for testing the global consistency of the trajectory. Furthermore, to reflect the contribution of the IMU to the estimation calibration, we obtained the results of our method on the shapes_6dof, boxes_6dof, and outdoors_6dof sequences without fusing the IMU data.

In Experiment 1 (Figure 5), we tracked simple shapes using different visual odometry methods. As the camera movement speed increases, the resolution of the captured image gradually decreases. The ORB method does not extract enough feature points, which leads to tracking failure. Thus, we chose a tracking duration of only 35 s. The position errors of the different methods were 15.62 cm, 8.45 cm, 3.91 cm, and 4.96 cm for Our_method, ORB, Our_method(+IMU), and EVO, respectively. The rotation errors of the different methods were 2.46 deg, 1.83 deg, 0.64 deg, and 0.91 deg for Our_method, ORB, Our_method(+IMU), and EVO, respectively. In Experiment 2 (Figure 6), the natural textures in the box scene are richer and also generate more event information. Since the EVO algorithm tracks the camera trajectory through pure event information, too many event streams cause the tracking time to be longer. The position errors of the different methods were 22.55 cm, 9.56 cm, 4.36 cm, and 5.86 cm for Our_method, ORB, Our_method (+IMU), and EVO, respectively. The rotation errors of the different methods were 3.54 deg, 1.69 deg, 0.54 deg, and 0.51 deg for Our_method, ORB, Our_method (+IMU), and EVO, respectively. In Experiment 3 (Figure 7), we evaluated several methods on outdoor datasets. The position errors of the different methods were 17.28 cm, 7.35 cm, 3.25 cm, and 2.65 cm for Our_method, ORB, Our_method (+IMU), and EVO, respectively. The rotation errors of the different methods were 3.47 deg, 1.60 deg, 0.45 deg, and 0.73 deg for Our_method, ORB, Our_method (+IMU), and EVO, respectively. The mean scene depth was 3 m. Overall, our algorithm performed slightly better than EVO in tracking performance and far outperformed ORB, but our tracking time was much faster than that of EVO in complex textured scenes. Our algorithm provides low-latency pose updates and preserves the nature of event-based data. These results represent the success of fusing frames and events for tracking in natural scenes and 6-DoF motion.

5. Conclusions

In this paper, we presented a novel visual odometry algorithm based on the DAVIS. The initial tracking template was extracted from the image sequence. A contrast maximization algorithm was presented to estimate the optical flow estimated by minimizing the distance between the event and the templates in the spatio-temporal windows. Then, IMU data were used to calibrate the rotational position of the tracking template. A Beta-Gaussian distribution depth filter was presented to update the depth of each pixel on the edge of templates. These templates were used to achieve lower-latency camera trajectories by the ICP algorithm. We tested our method on several scenes with different textures in the DAVIS dataset. Compared with the visual odometry algorithm ORB and EVO, the proposed algorithm showed more advantageous performance in terms of accuracy and robustness.

Author Contributions

Formal analysis, H.X.; investigation, X.G.; methodology, X.G.; project administration, X.L.; writing—original draft, H.X.; writing—review & editing, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant U2003110, and in part by the Key Laboratory Project of Shaanxi Provincial Department of Education (No. 20JS110), Xi’an Science and Technology Planning Project (No. 2020KJRC0084).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, J.; Qin, J.H.; Wang, J.; Li, J. OpenStreetMap-Based Autonomous Navigation for the Four Wheel-Legged Robot Via 3D-Lidar and CCD Camera. IEEE Trans. Ind. Electron. 2022, 69, 2708–2717. [Google Scholar] [CrossRef]
Liu, D.; Yang, T.L.; Zhao, R.; Wang, J.; Xie, X. Lightweight Tensor Deep Computation Model with Its Application in Intelligent Transportation Systems. IEEE Trans. Intell. Transp. Syst. 2022, 23, 2678–2687. [Google Scholar] [CrossRef]
Rong, S.; He, L.; Du, L.; Li, Z.; Yu, S. Intelligent Detection of Vegetation Encroachment of Power Lines with Advanced Stereovision. IEEE Trans. Power Deliv. 2021, 36, 3477–3485. [Google Scholar] [CrossRef]
Leng, C.; Zhang, H.; Li, B.; Cai, G.; Pei, Z.; He, L. Local feature descriptor for image matching: A Survey. IEEE Access 2019, 7, 6424–6434. [Google Scholar] [CrossRef]
Tedaldi, D.; Gallego, G.; Mueggler, E.; Scaramuzza, D. Feature Detection and Tracking with the Dynamic and Active-pixel Vision Sensor (DAVIS). In Proceedings of the Second International Conference on Event-Based Control, Communication, and Signal Processing (EBCCSP), Krakow, Poland, 13–15 June 2016. [Google Scholar]
Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A survey of neuromorphic computing and neural networks in hardware. arXiv 2017, arXiv:1705.06963. [Google Scholar]
Rebecq, H.; Horstschaefer, T.; Gallego, G.; Scaramuzza, D. EVO: A Geometric Approach to Event-Based 6-DOF Parallel Tracking and Mapping in Real Time. IEEE Robot. Autom. Lett. 2017, 2, 593–600. [Google Scholar] [CrossRef] [Green Version]
Chen, S.; Guo, M. Live demonstration: CELEX-V: A 1m pixel multi-mode event-based sensor. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Zhang, Z.; Wan, W. DOVO. Mixed Visual Odometry Based on Direct Method and Orb Feature. In Proceedings of the 2018 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China, 16–17 July 2018. [Google Scholar]
Guo, M.; Huang, J.; Chen, S. Live demonstration: A 768 × 640 pixels 200Meps dynamic vision sensor. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017. [Google Scholar]
Posch, C.; Matolin, D.; Wohlgenannt, R. A QVGA 143 DB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain cds. IEEE J. Solid-State Circuits 2011, 46, 259–275. [Google Scholar] [CrossRef]
Li, Y.; Li, J.; Yao, Q.; Zhou, W.; Nie, J. Research on Predictive Control Algorithm of Vehicle Turning Path Based on Monocular Vision. Processes 2022, 10, 417. [Google Scholar] [CrossRef]
Stoffregen, T.; Gallego, G.; Drummond, T.; Kleeman, L.; Scaramuzza, D. Event-Based Motion Segmentation by Motion Compensation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October 2019. [Google Scholar]
Stoffregen, T.; Kleeman, L. Event Cameras, Contrast Maximization and Reward Functions: An Analysis. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Gallego, G.; Delbruck, T.; Orchard, G.M.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.; Conradt, J.; Daniilidis, K.; et al. Event-based Vision: A Survey. arXiv 2020, arXiv:1904.08405. [Google Scholar]
Chiang, M.L.; Tsai, S.H.; Huang, C.M.; Tao, K.T. Adaptive Visual Serving for Obstacle Avoidance of Micro Unmanned Aerial Vehicle with Optical Flow and Switched System Model. Processes 2021, 9, 2126. [Google Scholar] [CrossRef]
Duo, J.; Zhao, L. An Asynchronous Real-Time Corner Extraction and Tracking Algorithm for Event Camera. Sensors 2022, 22, 1475. [Google Scholar] [CrossRef] [PubMed]
Mitrokhin, A.; Fermuller, C.; Parameshwara, C.; Aloimonos, Y. Event-based moving object detection and tracking. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018. [Google Scholar]
Li, K.; Shi, D.; Zhang, Y.; Li, R.; Qin, W.; Li, R. Feature Tracking Based on Line Segments with the Dynamic and Active-Pixel Vision Sensor (DAVIS). IEEE Access 2019, 7, 110874–110883. [Google Scholar] [CrossRef]
Iaboni, C.; Patel, H.; Lobo, D.; Choi, J.W.; Abichandani, P. Event Camera Based Real-Time Detection and Tracking of Indoor Ground Robots. IEEE Access 2022, 9, 166588–166602. [Google Scholar] [CrossRef]
Zhu, A.Z.; Chen, Y.; Daniilidis, K. Realtime time synchronized event-based stereo. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
Ozawa, T.; Sekikawa, Y.; Saito, H. Accuracy and Speed Improvement of Event Camera Motion Estimation Using a Bird’s-Eye View Transformation. Sensors 2022, 22, 773. [Google Scholar] [CrossRef]
Kim, H.; Kim, H.J. Real-Time Rotational Motion Estimation with Contrast Maximization Over Globally Aligned Events. IEEE Robot. Autom. Lett. 2022, 6, 6016–6023. [Google Scholar] [CrossRef]
Pal, B.; Khaiyum, S.; Kumaraswamy, Y.S. 3D point cloud generation from 2D depth camera images using successive triangulation. In Proceedings of the 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), Bengaluru, India, 21–23 February 2017. [Google Scholar]
Umair, M.; Farooq, M.U.; Raza, R.H.; Chen, Q.; Abdulhai, B. Efficient Video-based Vehicle Queue Length Estimation using Computer Vision and Deep Learning for an Urban Traffic Scenario. Processes 2021, 9, 1786. [Google Scholar] [CrossRef]
Gallego, G.; Lund, J.E.A.; Mueggler, E.; Rebecq, H.; Delbruck, T.; Scaramuzza, D. Event-based, 6-DOF camera tracking from photometric depth maps. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2402–2412. [Google Scholar] [CrossRef] [Green Version]
Shao, F.; Wang, X.; Meng, F.; Rui, T.; Wang, D.; Tang, J. Real-time traffic sign detection and recognition method based on simplified Gabor wavelets and CNNs. Sensors 2018, 18, 3192. [Google Scholar] [CrossRef] [Green Version]
Alzugaray, I.; Chli, M. Asynchronous corner detection and tracking for event cameras in real time. IEEE Robot. Autom. Lett. 2018, 3, 3177–3184. [Google Scholar] [CrossRef] [Green Version]
Zhou, Y.; Gallego, G.; Shen, S. Event-Based Stereo Visual Odometry. IEEE Trans. Robot. 2022, 37, 1433–1450. [Google Scholar] [CrossRef]
Mueggler, E.; Gallego, G.; Scaramuzza, D. Continuous-Time Trajectory Estimation for Event-based Vision Sensors. In Proceedings of the Robotics: Science and Systems, Rome, Italy, 17 July 2015. [Google Scholar]
Kim, H.; Leutenegger, S.; Davison, A.J. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 16 October 2016; Springer: Cham, Switzerland, 2016. [Google Scholar]
Mueggler, E.; Gallego, G.; Rebecq, H.; Scaramuzza, D. Continuous-Time Visual-Inertial Odometry for Event Cameras. IEEE Trans. Robot. 2018, 34, 1425–1444. [Google Scholar] [CrossRef] [Green Version]
Mueggler, E.; Rebecq, H.; Gallego, G.; Delbruck, T.; Scaramuzza, D. The Event Camera Dataset and Simulator: Event-Based Data for Pose Estimation, Visual Odometry, and SLAM. Int. J. Robot. Res. 2017, 36, 142–149. [Google Scholar] [CrossRef]

Figure 1. An overview of the visual odometry method.

θ

is the optical flow of the event in the corresponding spatio-temporal window.

Figure 1. An overview of the visual odometry method.

θ

is the optical flow of the event in the corresponding spatio-temporal window.

Figure 2. Detection of the tracking template. Using the feature points image in (a) and the edge points image in (b), we can obtain the tracking template. As shown in (c), we visualize the events of the first spatio-temporal window.

Figure 3. Updating the tracking template. Figure (a) is the update process of feature points and template points in a fixed time interval, while Figure (b) is an optical flow visualization of the tracking template. As shown in Figure (c), the events obtain the optical flow in the first spatio-temporal window. Figure (d) presents variance convergence curves. Variance 1 and Variance 2 represent the variance change in IWE after events are warped along with different optical flows.

Figure 4. Depth filter measurement model. 3D space projection is performed between the current template position

T_{w c}

and the previous template

T_{w c}^{-}

corresponding to the 2D coordinates. As the number of corresponding template matching increases, the distribution of depth gradually converges.

Figure 4. Depth filter measurement model. 3D space projection is performed between the current template position

T_{w c}

and the previous template

T_{w c}^{-}

corresponding to the 2D coordinates. As the number of corresponding template matching increases, the distribution of depth gradually converges.

Figure 5. VO Experiment 1 (shapes_6dof): (a) 3D event camera trajectories estimated by different algorithms; (b) Comparison of absolute pose errors between different algorithms; (c) The translation position of the camera estimated by different algorithms; (d) The rotation position of the camera estimated by different algorithms.

Figure 6. VO Experiment 2 (boxes_6dof): (a) 3D event camera trajectories estimated by different algorithms; (b) Comparison of absolute pose errors between different algorithms; (c) The translation position of the camera estimated by different algorithms; (d) The rotation position of the camera estimated by different algorithms.

Figure 7. VO Experiment 3 (outdoors_6dof): (a) 3D event camera trajectories estimated by different algorithms; (b) Comparison of absolute pose errors between different algorithms; (c) The translation position of the camera estimated by different algorithms; (d) The rotation position of the camera estimated by different algorithms.

Table 1. Absolute pose error (m/s), position error (m/s), and rotation error (deg/s) test results in different scenarios with different methods [33].

	Sequences	Our_Method	ORB	Our_Method (+IMU)	EVO
Absolute pose error	shapes_6dof	0.0815	0.0780	0.0315	0.0435
	boxes_6dof	0.0434	0.0369	0.0204	0.0344
	outdoors_6dof	0.1416	0.2357	0.0461	0.0227
Position error	shapes_6dof	0.1562	0.0845	0.0391	0.0496
	boxes_6dof	0.2255	0.0956	0.0436	0.0586
	outdoors_6dof	0.1728	0.0735	0.0325	0.0265
Rotation error	shapes_6dof	2.4562	1.8342	0.6443	0.9093
	boxes_6dof	3.5432	1.6901	0.5426	0.5084
	outdoors_6dof	3.4734	1.6042	0.4453	0.7253

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Xue, H.; Liu, X. Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera. Processes 2022, 10, 2081. https://0-doi-org.brum.beds.ac.uk/10.3390/pr10102081

AMA Style

Gao X, Xue H, Liu X. Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera. Processes. 2022; 10(10):2081. https://0-doi-org.brum.beds.ac.uk/10.3390/pr10102081

Chicago/Turabian Style

Gao, Xiang, Hanjun Xue, and Xinghua Liu. 2022. "Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera" Processes 10, no. 10: 2081. https://0-doi-org.brum.beds.ac.uk/10.3390/pr10102081

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contrast Maximization-Based Feature Tracking for Visual Odometry with an Event Camera

Abstract

1. Introduction

2. Related Work

3. Main Methods

3.1. Feature Detection

3.2. Feature Tracking

3.2.1. Choice of Spatio-Temporal Windows

3.2.2. Maximizing IWE Contrast

3.2.3. Template Edge Update

3.2.4. Depth Estimation

3.2.5. Pose Estimation

4. Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI