A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments

Ramezani Dooraki, Amir; Lee, Deok-Jin

doi:10.3390/machines10070500

Open AccessArticle

A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments

by

Amir Ramezani Dooraki

and

Deok-Jin Lee

^*

School of Mechanical Design Engineering, Jeonbuk National Univeristy, Jeonju 54896, Korea

^*

Author to whom correspondence should be addressed.

Machines 2022, 10(7), 500; https://0-doi-org.brum.beds.ac.uk/10.3390/machines10070500

Submission received: 23 April 2022 / Revised: 1 June 2022 / Accepted: 14 June 2022 / Published: 22 June 2022

(This article belongs to the Topic Advances in Mobile Robotics Navigation)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we introduce a self-trained controller for autonomous navigation in static and dynamic (with moving walls and nets) challenging environments (including trees, nets, windows, and pipe) using deep reinforcement learning, simultaneously trained using multiple rewards. We train our RL algorithm in a multi-objective way. Our algorithm learns to generate continuous action for controlling the UAV. Our algorithm aims to generate waypoints for the UAV in such a way as to reach a goal area (shown by an RGB image) while avoiding static and dynamic obstacles. In this text, we use the RGB-D image as the input for the algorithm, and it learns to control the UAV in 3-DoF (x, y, and z). We train our robot in environments simulated by Gazebo sim. For communication between our algorithm and the simulated environments, we use the robot operating system. Finally, we visualize the trajectories generated by our trained algorithms using several methods and illustrate our results that clearly show our algorithm’s capability in learning to maximize the defined multi-objective reward.

Keywords:

reinforcement learning; autonomous navigation; obstacle avoidance; deep learning; multi-objective

1. Introduction

By reflecting on nature, one might see a variety of intelligent animals, each with some degree of intelligence, where all of them learn by experience to convert their potential capabilities into some skill. This fact inspired the machine learning community to devise and develop machine learning approaches inspired by nature, within the field of bio-inspired artificial intelligence [1]. Further, it motivated us to use the idea of learning by reinforcement signal in this paper. Reinforcement signals (such as positive rewards and punishments) are some of the primary sources of learning in intelligent creatures. This type of learning is implemented and applied by the reinforcement learning family of algorithms in machine learning.

While learning can be considered a type of intelligence, it can only manifest itself when applied to an agent or when used to control one. An agent could be any actual or simulated machine, robot, or application with sensors and actuators. Considering several categories of robots, Unmanned Aerial Vehicles (UAVs) are among the most critical agents to be controlled because they can fly, which means they can move to locations and positions that are not possible for Unmanned Ground Vehicles (UGV) and Unmanned Surface Vehicles (USV). One of the essential considerations in controlling a UAV is its navigation and control algorithm. There are several kinds of methods used for navigation in UAVs, some of them are dependent on some external guiding system (such as GPS), and a couple of them are based on rule-based controllers, which are different from an alive creature.

In this paper, we introduce a self-trained controller for autonomous navigation (SCAN) (Algorithm 1) in static and dynamic challenging environments using a deep reinforcement learning-based algorithm trained by Depth and RGB images as input, and multiple rewards. We control the UAV using continuous actions (a sample generated trajectory by our algorithm is shown in Figure 1) and train our RL algorithm in a multi-objective way in order to be able to reach a defined goal area while learning to avoid static and dynamic obstacles. We train our robot in a simulated environment using Gazebo sim., and for the communication between the algorithm and simulated agent, we use Robot Operating System (ROS).

In the rest of this text, we first discuss the related literature and then explain our main work with a detailed explanation for each part of our proposed algorithm, such as enhancements, neural networks, obstacle detection method, reward calculation, and trajectory visualization (In this research, to visualize the 3D trajectory in a 3D environment, we used 3D-SLAM. However, SLAM is used only for visualization purposes, and it is not used in any way for training, playing, or control of the UAV). In the next step, we present our results and discuss some crucial aspects of our work before the conclusion. Before starting our literature review, we want to emphasize the contributions of our paper here:

First, we proposed a new RL-based self-learning algorithm for controlling a multi-rotor drone in 3-DoF (x, y, and z) in highly challenging environments;
Second, our proposed algorithm simultaneously trains the policy and value networks using several rewards to create a multi-objective agent capable of learning to traverse through an environment to reach a goal location while avoiding the obstacles;
Third, we verified our algorithm using two highly challenging environments, including static and dynamic obstacles (such as moving walls and nets);
Fourth, our algorithm is an end-to-end self-learning controller in which the RGB-D image is the input to the policy network, and three-dimensional continuous action (gas (moving forward), rotation (to left or right), and elevation) are the output of our algorithm;
Fifth, we used an onboard method (that is, defined within our algorithm, not received from external sources such as simulator) to calculate the reward related to obstacle avoidance and prediction of collisions, which makes our method capable of further learning after deployment on a real UAV.

2. Related Work

In this section, we first review works that used a rule-based controller such as PID, SMD, MPC, or LQR for control of the robot (UAV), then focus on works that used deep learning, or deep reinforcement learning, or reinforcement learning for low-level and high-level control of the UAV.

Looking at the literature, there are a couple of approaches that used rule-based methods for autonomous navigation. For example, the authors of [2] proposed a path planning algorithm using ant colony clustering algorithm. In [3] is proposed a method for generating a navigation path considering the energy efficiency. In [4], Singh et al. proposed an approach for obstacle avoidance in challenging environments. An approach for navigation and guidance proposed in Mina et al. [5] capable of multi-robot control and obstacle avoidance. Another approach for swarm robot control in terms of guidance and navigation is proposed in [6].

In addition, reinforcement learning and deep RL are used in several works to create intelligent autonomous robots by acting as the low-level and high-level flight controllers in Unmanned Aerial Vehicles (UAVs). In the high-level case, the controller generates the waypoints or trajectories and passes them to the low-level controller, and in the low-level case, the controller receives the waypoints or trajectories and executes them on the quad-copter either by sending speeds to the motors or by generating the proper forces. In terms of the low-level flight controllers, in work by Ng et al. [7], authors captured the flight data of a helicopter and used it to create a transition model for training low-level flight controllers for helicopters. Linear quadratic regulator (LQR) and differential dynamic programming (DDP) were used in [8] to develop an RL-based flight controller. Hwangbo et al. [9] proposed a low-level flight controller capable of generating motor speeds using reinforcement learning and proportional-derivative (PD). Molchanov et al. [10] proposed an approach focused on developing a low-level RL-based flight controller that automatically learns to reduce the gap between the simulation and actual quadcopters. The authors of [11] develop a method of combining PD controller and reinforcement learning-based controller for controlling a quadcopter in faulty conditions and use LSTM networks to detect the fault in different conditions.

In terms of high-level control using reinforcement learning, [12] proposed a method to control the drone using Deep RL for exploration and obstacle avoidance. In [13] by Zhang et al., guided policy search (GPS) is used after training in a supervised learning approach using data provided by the model predictive controller (MPC) and Vicon system for accurate position and attitude data. Liu et al. [14] used virtual potential field to develop a multi-step reinforcement learning algorithm. In [15], deep RL is used for obstacle avoidance in challenging environments. Long et al. [16] used a deep neural network in an end-to-end approach for efficient distributed multi-agent navigation, and their algorithm is capable of obstacle avoidance during navigation. In [17], deep RL is used to create a controller capable of path following while interacting with the environment. Zhou et al. [18] analyzed deep RL in challenging environments for formation path planning and obstacle avoidance. Liu et al. [19] used RL and probability map to create a search algorithm and improve the detection ability of the algorithm. Finally, deep RL is used for local motion planning in an unknown environment in [20], and for trajectory tracking and altitude control in [21].

In contrast to many rule-based and traditional methods that perform weakly in complex and dynamic environments and high dimensional state spaces, deep reinforcement learning algorithms can control the system by learning to generate the best action considering the defined reward and high dimensional sensor data as their state value. Further, RL-based algorithms have the advantage of autonomous learning from their own experiences, making them highly flexible and capable of learning to control, a feature which is not available in rule-based control algorithms or is limited.

In a nutshell, our work differs from rule-based algorithms and methods, mainly because our algorithm learns to control, but in a rule-based control algorithm, everything is designed and calculated beforehand. Further, it is different from a couple of Deep Learning and Deep RL works where it learns from high dimensional state space to control the UAV by generating waypoints in the XYZ axes. Many of the works reviewed in the literature either use only two dimensions for the control or do not use the high-dimensional state space of RGB-D as we used. For example, in [15] they focused on obstacle avoidance from RGB image, while our algorithm addresses obstacle avoidance and reaching to goal location from RGB-D image. Moreover, while some works such as [17] focused on following a path, our algorithm generates a path. Further, compared to works such as [18] where a Deep Q-Network capable of generating discrete action is used, in our work, a policy gradient-based reinforcement learning algorithm capable of generating continuous action is used. Compared to another similar work [20], while we use RGB-D data as the input to our Deep RL and generate 3D-action, they use 2D lidar data and generates 2D action. Further, our goal is an image, and their goal is a point fed to the algorithm. In addition, our method calculates the distance with obstacles and predicts the collision internally, making it possible for our algorithm to continue training even after deployment on an actual drone.

3. Self-Trained Controller for Autonomous Navigation (SCAN)

We use an infinite-horizon Markov Decision Process (MDP), defined by the tuple

(S, A, P, r, D, γ)

, to formulate our problem of learning to navigate in challenging environments using a multi-objective reward. Let S be a finite set of states, A be a finite set of actions,

P : S \times A \times S \mapsto R

be the transition probability distribution,

r : S \mapsto R

be the reward function,

D : S \mapsto R

be the distribution of the initial state

s_{0}

, and

γ \in (0, 1)

be the discount factor. Our goal is to train a reinforcement learning-based agent to learn to control a UAV drone to navigate through challenging environments. Our state space is 64px by 64px, depth image or RGB image:

\begin{matrix} S_{t_{d e p t h}} = {64, 64, 1}, S_{t_{R G B}} = {64, 64, 3} \end{matrix}

Our action space is a three-dimensional continuous space vector, including controlling the height (Elevation), turning left or right (Rotation), and moving forward (Gas).

For solving the problem of learning a high-level flight controller we used proximal policy optimization [22], a policy gradient-based reinforcement learning algorithm, with some enhancements mentioned at the end of this section for training and optimization of our the algorithm. We use the following conventions:

\begin{matrix} Q_{π} (s_{t}, a_{t}) = E_{s_{t + 1}, a_{t + 1}, \dots} [\sum_{l = 0}^{\infty} γ^{l} r (s_{t + 1})], \\ V_{π} (s_{t}) = E_{a_{t}, s_{t + 1}, \dots} [\sum_{l = 0}^{\infty} γ^{l} r (s_{t + 1})], \\ A_{π} (s, a) = Q_{π} (s_{t}, a_{t}) - V_{π} (s_{t}), \end{matrix}

we define

π

to be a stochastic policy

π : S \times A \mapsto [0, 1]

, and

η (π)

to be its expected discounted reward. Then, we define the following expression as the expected return of another policy

\tilde{π}

in terms of the advantage over

π

:

η (\tilde{π}) = η (π) + E_{s_{0}, a_{0}, \dots, \sim \tilde{π}} [\sum_{t = 0}^{\infty} γ^{t} A_{π} (s_{t}, a_{t})]

(1)

Let

ρ_{π}

be the discounted visitation frequencies:

ρ_{π} (s) = P (s_{0} = s) + γ P (s_{1} = s) γ^{2} P (s_{2} = s) + \dots,

where

s_{0} \sim ρ_{0}

, and the actions are chosen according to

π

. We can rewrite Equation (1) with a sum over states instead of time steps:

η (\tilde{π}) = η (π) + \sum_{s} ρ_{\tilde{π}} (s) \sum_{a} \tilde{π} (a ∣ s) A_{π} (s, a)

(2)

With attention to the complex dependency of

ρ_{\tilde{π}} (s)

on

\tilde{π}

, we used the following approximation instead or the one mentioned in the previous equation:

L_{π} (\tilde{π}) = η (π) + \sum_{s} ρ_{π} (s) \sum_{a} \tilde{π} (a ∣ s) A_{π} (s, a)

(3)

Considering

π_{θ} (a ∣ s)

to be a differentiable function of parameter vector

θ

, then

L_{π}

matches

η

to the first order. That is, for any parameter value

θ_{0}

,

\begin{matrix} L_{π_{θ_{0}}} = η (π_{θ_{0}}), \\ \nabla_{θ} L_{π_{θ_{0}}} (π_{θ}) ∣_{θ = θ_{0}} = \nabla_{θ} η_{π_{θ_{0}}} ∣_{θ = θ_{0}} \end{matrix}

(4)

Further, as mentioned in the following, we use an improvement bound (Kullback–Leibler (KL) divergence,

D_{K L}

[23]) Additionally, we overload our previous notation to use functions of

θ

rather than

π

, e.g.,

η (θ) : = η (π_{θ}), L_{θ} (θ) : = L_{π_{θ}}, and D_{K L} (θ ∣ ∣ θ) : = D_{K L} (π_{θ} ∣ ∣ π_{θ})

.

θ_{o l d}

is referring to the previous policy.

Next, we use policy

π_{θ_{o l d}}

to generate a trajectory

s_{0}, a_{0}, s_{1}, a_{1}, \dots, s_{T - 1}, a_{T - 1}, s_{T}, a_{T}

where T is the length of an episode. We also calculate

q (a ∣ s) = π_{θ_{o l d}}

.

Q_{θ_{o l d}} (s, a)

should be calculated at each state–action pair

(s_{t}, a_{t})

by taking the discounted sum of future rewards along the trajectory.

\begin{matrix} max_{θ} L_{θ_{o l d}} (θ) \\ subject to {\bar{D}}_{K L}^{ρ_{θ_{o l d}}} (θ_{o l d}, θ) \leq δ \end{matrix}

(5)

Using Equation (3), we obtain the following equation:

\begin{matrix} max_{θ} \sum_{s} ρ_{θ_{o l d}} (s) \sum_{a} π_{θ} (a ∣ s) A_{θ_{o l d}} (s, a) \\ subject to {\bar{D}}_{K L}^{ρ_{θ_{o l d}}} (θ_{o l d}, θ) \leq δ \end{matrix}

(6)

For each

s_{n}

, we can calculate the first term by:

\begin{matrix} \sum_{a} π_{θ} (a ∣ s_{n}) A_{θ_{o l d}} (s_{n}, a) = E_{a \sim q} [\frac{π_{θ} (a ∣ s_{n})}{q (a ∣ s_{n})} A_{θ_{o l d}} (s_{n}, a)] \end{matrix}

In the next step, we try to solve the following constrained optimization problem and update policy parameters:

\begin{matrix} max_{θ} E_{s \sim ρ_{θ_{o l d}}, a \sim q} [\frac{π_{θ} (a ∣ s)}{q (s ∣ s)} Q_{θ_{o l d}} (s, a)] \\ subject to E_{s} \sim ρ_{θ_{o l d}} [D_{K L} (π_{θ_{o l d}} (. ∣ s) ∣ ∣ π_{θ} (. ∣ s))] \leq δ \end{matrix}

(7)

We generate a sequence of states using

s_{0} \sim ρ_{0}

and

π_{θ_{o} l d}

for n steps, and generate a trajectory

s_{0}, a_{0}, s_{1}, a_{1}, \dots, s_{T - 1}, a_{T - 1}, s_{t}

, and re-calculate

q (a ∣ s) = π_{θ_{o l d}}

.

Q_{θ_{o l d}} (s, a)

. The following equation is used to make a general policy:

\begin{matrix} max_{θ} \hat{E_{t}} [\frac{π_{θ} (a ∣ s)}{π_{θ_{o l d}} (s ∣ s)} \hat{A_{t}}] \\ subject to \hat{E_{t}} [K L [π_{θ_{o l d}} (. ∣ s), π_{θ} (. ∣ s)]] \leq δ \end{matrix}

(8)

To solve the optimization problem without a constraint and using a penalty, it is possible to use the following formula:

\begin{matrix} max_{θ} \hat{E_{t}} [\frac{π_{θ} (a ∣ s)}{π_{θ_{o l d}} (s ∣ s)} \hat{A_{t}} - β K L [π_{θ_{o l d}} (. ∣ s), π_{θ} (. ∣ s)]] \end{matrix}

(9)

Considering

r_{t} (θ) = \frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{o l d}} (a_{t} ∣ s_{t})}

, where

r (θ_{o l d}) = 1

, maximizing a surrogate objective would be as follows:

\begin{matrix} L^{C P I} (θ) = \hat{E_{t}} [\frac{π_{θ} (a_{t} ∣ s_{t})}{π_{θ_{o l d}} (a_{t} ∣ s_{t})} \hat{A_{t}}] = \hat{E_{t}} [r_{t} (θ) \hat{A_{t}}] \end{matrix}

(10)

where CPI is conservative policy iteration [24]. To maximize Equation (10) without a constraint the following equation is used to penalize changes to the policy that move

r_{t} (θ)

away from 1:

\begin{matrix} L^{C L I P} (θ) = \\ \hat{E_{t}} [m i n (r_{t} (θ)) \hat{A_{t}}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A_{t}}] \end{matrix}

(11)

where

ϵ

is a hyper-parameter equal to 0.2, for example. Further, instead of Equation (11), or in addition to it, it is possible to use the following Adaptive KL Penalty Coefficient loss:

\begin{matrix} L^{K L P E N} (θ) = \\ \hat{E_{t}} [\frac{π_{θ} (a ∣ s)}{π_{θ_{o l d}} (s ∣ s)} \hat{A_{t}} - β K L [π_{θ_{o l d}} (. ∣ s), π_{θ} (. ∣ s)]] \end{matrix}

(12)

where in each iteration, we use the following calculation to obtain the new

β

:

\begin{matrix} d = \hat{E_{t}} [K L [π_{θ_{o l d}} (. ∣ s), π_{θ} (. ∣ s)]] \\ If d < d_{t a r g} / 1.5, β = β / 2 \\ If d > d_{t a r g} \times 1.5, β = β \times 2 \end{matrix}

\begin{matrix} L_{t}^{C L I P + V F + S} (θ) = \\ \hat{E_{t}} [L_{t}^{C L I P} (θ) - c_{1} L_{t}^{V F} (θ) - c_{2} S [π_{θ}] (s_{t})] \end{matrix}

(13)

where

c_{1}

and

c_{2}

are coefficients, S is an entropy bonus, and

L_{t}^{V F}

is a squared-error loss

(V_{θ} (s_{t}) - V_{t}^{t a r g})

Further, we use the policy gradient implementation mentioned in Mnih et al. [25]:

{\hat{A}}_{t} = - V (s_{t}) + r_{t} + γ r_{t + 1} + \dots + γ^{T - t + 1} r_{T - 1} + γ^{T - t} V (s_{T})

(14)

Algorithm 1 Self-trained Controller for Autonomous Navigation (SCAN) Algorithm

Initialize neural network weights for policy and value networks

while True do

Generate state

s_{t}

Feed state

s_{t}

to neural network, generate action

a_{t}

, execute in the environment, receive the new RGB-D frame

Distance, Collision Occurred = Process the depth frame using the method explained in Section 3:

(1) Using RANSAC find the ground plane in the re-projected depth data;

(2) Remove all the points belonging to the ground plan;

(3) Consider the rest of the points as obstacle points, and calculate distances between the robot and the obstacle points;

(4) Return Calculated Distance and Collision Occurred.

if (Collision Occurred) then

Reset the UAV position using the enhancements explained in Section 3.1

Generate new state

s_{t}

end if

r_{t} = r_{M O} = - 1 If Collision Occured Else r_{M O} = r_{A} + r_{R G} + r_{M}

where

r_{A} = + 1 If Moved to the Next Area Else r_{A} = 0.0

r_{R G} = + 5 If Reached to the Goal Else r_{R G} = 0.0

r_{M} = α * ∣ p o s_{p r e v i o u s} - p o s_{c u r r e n t} ∣ - β

if TRAINING then

Store experience

s_{t}, a_{t}, s_{t + 1}, r_{t}

in experience storage

if EPISODE finished then

Update neural network using the episode experiences considering the following equation:

L_{t}^{C L I P + V F + S} (θ) =

\hat{E_{t}} [L_{t}^{C L I P} (θ) - c_{1} L_{t}^{V F} (θ) - c_{2} S [π_{θ}] (s_{t})]

Use enhancements explained in Section 3.1 to reset UAV position

end if

end while

Then, we use a truncated version of generalized advantage estimation as shown in the following:

\begin{matrix} {\hat{A}}_{t} = δ_{t} + (γ λ) δ_{t + 1} + \dots + {(γ λ)}^{T - t + 1} δ_{T - 1}, \\ where δ_{t} = r_{t} + γ V_{(s_{t + 1})} - V (s_{t}) \end{matrix}

(15)

Finally, in order to increase the learning speed and performance of our proposed algorithm, we introduce some enhancements that are similar to enhancements mentioned in [26] targeting the exploration mechanism of our algorithm.

A neural network with two convolutional layers enhanced by residual layers and one hidden layer of 256 neurons in each layer with ReLu activation function is used to train the model for SCAN, Figure 2. As we used PPO in this work for the RL algorithm, we have two neural networks. The policy and value networks have a similar shape except for the output layers, where the value network has one output, the state value, and the policy network has n outputs, which are the mean values. Mean values generate the actions after being summed with noise. The flow of information in our algorithm is shown in Figure 3, and the simulated UAV quad-copter is shown in Figure 4.

Figure 1. Trajectory of the UAV controlled by our proposed algorithm (SCAN). SCAN is able to control the UAV in three dimensions (X, Y, and Z). The image is generated by the SLAM approach (only used for visualization purpose) explained in the Section 4.3.1 of this paper.

We use our method published in [27], where a RANSAC-based algorithm is used to detect the ground plane and distinguish the obstacle points for obstacle detection. Getting so near to these obstacle points will be considered a collision. As a result, instead of using the simulator to report the collision, we calculate it on our algorithm side. In this way, it is possible to use the algorithm to detect the obstacle even after deploying the algorithm onboard a real drone and continue the training in the actual drone.

For training our SCAN algorithm, we defined several rewards (that is, a multi-objective reward called

r_{M O}

) where each one of them can be considered as an objective for the algorithm. The rewards used in our algorithm are explained in the following:

Collision Avoidance reward: Hitting the obstacles will give a negative reward of −1, that is, $r_{M O} = - 1.0$ .
Area reward ( $r_{A}$ ): Moving from one area to the next will return a reward of +1.
Reaching Goal reward ( $r_{R G}$ ): Reaching the goal location (the red sign at the end of the environment) will return a positive reward of +5
Motivation Reward ( $r_{M}$ ): In order to keep the agent motivated to move, we defined a small positive reward relative to the amount of the movement to the agent. Further, in order to increase the effectiveness of this reward, we added a $β$ coefficient as shown in the following formula, where depending on the $β$ , $r_{M}$ could have a positive or negative value.

As a result, we define the

r_{M O}

using the following equation:

\begin{matrix} r_{M O} = \{\begin{matrix} - 1 If Collision Occured \\ r_{A} + r_{R G} + r_{M} Else \end{matrix} \\ where \\ r_{A} = \{\begin{matrix} + 1 If Moved to the Next Area \\ 0.0 Else \end{matrix} \\ r_{R G} = \{\begin{matrix} + 5 If Reached to the Goal \\ 0.0 Else \end{matrix} \\ r_{M} = α ∣ p o s_{p r e v i o u s} - p o s_{c u r r e n t} ∣ - β where α is the scale factor a n d β is a constant \end{matrix}

(16)

3.1. Enhancements

In this section, we define two modification to the state exploration of the RL algorithm that highly increases the performance of the SCAN in learning to traverse through the environments.

3.1.1. Random Starting Position (RSP)

In order to give the drone the capability of starting from different positions in the middle of the trajectories, we define a new distribution called Random Starting Position (RSP). This new distribution will be used beside the initial starting state distribution (D) of the MDP. We define RSP to be:

\begin{matrix} R S P = {x_{p o s}, y_{p o s}, z_{p o s}} \\ where \\ x_{p o s} \in X_{d e s i r e d} \\ y_{p o s} \in Y_{d e s i r e d} \\ z_{p o s} \in Z_{d e s i r e d} \\ and \\ X_{d e s i r e d} = [- x_{m i n d}, + x_{m a x d}] \\ Y_{d e s i r e d} = [- y_{m i n d}, + y_{m a x d}] \\ Z_{d e s i r e d} = [- z_{m i n d}, + z_{m a x d}] \end{matrix}

(17)

Considering the fact that we are using a policy gradient algorithm that uses the trajectories of the same episode for training the policy and value network, the idea of RSP will help the agent to gather the sample trajectories from different positions on the map, which in turn will help the agent to learn faster, in a balanced way, and at the same time avoids catastrophic forgetting in neural network training, where emphasizing on a specific subset of samples will cause the network to diverge from what is already learned.

This idea is borrowed from how infants learn. Consider a parent who tries to help their infant learn to play with a toy. Every time the toy falls from the hand of the infant, the parents give the toy to the infant’s hand again and help their infant try and play with the toy.

3.1.2. Controlled Starting State (CSS)

In addition to the idea of RSP, where we reset the UAV to random positions in the environment, we define Controlled Starting State (CSS), which we use to measure the agent’s performance when it starts from each subset of the environment. In other words, we divide the starting positions in the environment into n subsets and initially give the value of 0 to each subset. In the next step, every time we reset the UAV to any of these subsets, we measure the value gathered from that subset until the terminal state. In other words, we collect the accumulated reward starting from that subset. Further, the next time the agent starts from this position, we add the new accumulated reward with the old CSS value related to that subset and divide it by 2. In other words, we always have the mean value of the latest two accumulated rewards gathered from that starting CSS.

Finally, we use CSS with RSP to alternatively start the UAV from a random position and the position with a minimum CSS value. Doing this will help the agent focus on learning to navigate from the positions in the environment that statistically was unable to gather a good amount of reward from them.

In the following, we define

C S S_{n_{v a l u e}}

to contain the value of the subset n of all the subsets defined in RSP:

\begin{matrix} C S S_{n_{v a l u e}} = \frac{1}{2} (C S S_{n_{v a l u e}} + \frac{1}{n} \sum_{S_{t}}^{S_{t + n}} r) \end{matrix}

(18)

4. Simulation Environment

Gazebo simulator [28] is used in this work for training and evaluation of our models. We use the package provided for AscTec Pelican. The benefit of using Gazebo sim. with the provided package is that we can easily connect it to our algorithm using ROS (Robot Operating System), which gives us the advantage of mobility; that is, we can seamlessly move our algorithm to other platforms. We send the actions through our wrapper environment to the Gazebo simulator through the topics provided by RotorS package [29] and their topics. Further, we receive the RGB-D data from the provided topics, as well.

Furthermore, we gather the RGB-D data and actions results, plus the accumulated reward of the agent throughout the training process, to prove our hypothesis in the result section.

4.1. Static and Dynamic Environments

In this work, as it can be seen in Figure 5 and Figure 6, we defined a static environment, and as it can be seen in Figure 7 and Figure 8, we added two moving parts to the previous environments (two moving walls, and moving nets) to create a dynamic environment. Regarding the dynamic obstacles, that is, the walls and the nets, they move in one direction until they reach the other walls in case of walls, and the ground or max height (that is 3.0 m height) in case of nets. Further, we arranged them to move slightly slower than the moving speed of the UAV.

We used these environments to train our agent. Further, as it can be seen in Figure 9 and Figure 10, we divided the environment into three virtual areas used for tracing the agent as well as calculating our several rewards for training our agent.

4.2. A New Unknown Environment

After training our agent in the static and dynamic environment mentioned in the previous section, we test it without any further training in a new unknown environment. This environment consists of many rooms and has two floors, and the only way that connect these two floors is a staircase.

4.3. Navigation Visualization

Considering that our algorithm is a high-level controller that generates the proper sequence of the waypoints to control the drone in an environment with obstacles, the main point would be the trajectory of the waypoints. As a result, we implement some methods to visualize the sequence (trajectory) of waypoints generated by our algorithm in the top-view map of the environments and the XZ, XY, and YZ planes.

4.3.1. SLAM for Visualizing the Trajectory in 3D Map

In order to clearly illustrate the trajectory of the UAV controlled by our proposed algorithm, considering the fact that our algorithm is controlling the quad-copter in 3-dimension, the best solution from our point of view was to use SLAM. We used SLAM since it can localize the agent in an environment and at the same time does the mapping, and as a result, we can see the trajectory of the agent movement in the environment. As one can see, the images generated by the SLAM we used in this paper are focused on localization and not on mapping. In other words, the mapping result seen in our SLAM images in this paper are the result of moving through the shown trajectory from the starting point of that trajectory to the ending point of the trajectory; that is, we did not focus on the mapping part to generate a perfect map. Finally, here we again emphasize that the SLAM only is used for visualization purposes, and the result of the SLAM algorithm is not used in any way in the training, playing, and control of our algorithm and the agent shown in this paper. We used a SLAM package called RTAB-Map [30] and set it to use the RGB-D image and odometry data coming from the simulated agent in the Gazebo sim. to generate the mapping and localization information.

5. Experimental Result

5.1. Learning Diagrams

By looking at our navigation results in Figure 11, Figure 12 and Figure 13 (un-trained algorithm) and comparing it with Figure 14, Figure 15 and Figure 16 (trained algorithm), it is possible to see that our algorithm learns the shortest path for reaching to the goal position while avoiding the obstacles. The figure also shows the capability of our algorithm to control the UAV in the Z dimension. For example, the UAV is controlled to be at the height of about 1.5 m in the pipe area to pass through it. However, our algorithm increases and decreases the height of the UAV position several times in the area with several trees. Further, to pass the nets (while they are moving up and down), the UAV should be capable of optimal control over the Z dimension. Finally, the data in Table 1 shows the Success Rate of our trained algorithm, where Success Rate is equal to reaching to the goal location, starting from the beginning of the environment.

Besides that, as our result in Figure 17 and Figure 18 show, our algorithm can learn to maximize the rewards (several rewards that are defined) and also avoid the obstacles. Further, it can learn to reach the final position (the goal position), where the goal position is shown using a JPG image (red H) at the end of the environment.

5.2. Trajectory of the UAV Controlled by SCAN

Figure 19 shows the trajectory of the UAV at the beginning of Area 1. Figure 20, Figure 21 and Figure 22 clearly show that the SCAN algorithm can learn to control the trajectory of the UAV to pass through Areas 1, 2, and 3 and maximize the defined rewards while avoiding obstacles. Further, Figure 23 and Figure 24 illustrate that the agent is learned to reach the final goal at the end of the challenging environment. Figure 25 shows the capability of the algorithm in controlling the UAV on the Z-axis, and Figure 26 shows that the algorithm is able to control the UAV inside the narrow and curved area of a pipe.

6. Discussion

The algorithm we describe in this work is a high-level controller algorithm. That is, it calculates the desired waypoints and sends them to the low-level controller of the UAV. As a result, here, we focus on the issues related to the high-level controller and consider that if the drone can move from point (a) to point (b) in the simulator, having the same command in the actual UAV, it will behave in the same way.

As mentioned earlier, our algorithm as a high-level controller uses the RGB and depth sensors. Further, we use these data to generate the collision signal and calculate the UAV’s distance from the opposite obstacle. While using this method slightly increases the calculation time in each simulation frame, making the online training possible in an actual UAV. If we use the simulator’s physics engine to detect the obstacles and collisions, we would be unable to detect the obstacles on a real drone and use them as new signals for onboard training after deploying the model trained in the simulator.

In this paper, considering the fact that we focused on the capability of the algorithm in terms of being multi-objective and learning to avoid obstacles while being able to finish the environments and reach the goal area, we did not discuss the computational time of our algorithm. However, our algorithm is a model-less Deep RL algorithm which is among the fastest and lightest, considering that from RGB-D image to continuous action commands is just a forward pass through the DNN. Further, it does not use localization methods such as SLAM or methods based on the point cloud, which is proven to be computational-heavy. Finally, our method’s computational load is also related to how many frames per second we need to generate new actions or way-points, which depends on the UAV application. In this research, we used Python and Tensorflow for creating and running our algorithm and generating the results. Further, we used a computer with an i9 CPU and Nvidia 1080ti GPU. This setup takes about 40 ms to generate a new waypoint from the RGB-D image, which means approximately about 25 frames per second. Nonetheless, in the deployment time, the Deep Neural Network weights should be extracted and loaded in a C/C++ code, where the computation is much faster, which compensates for the slower computational power of an onboard system (such as Nvidia TX2 or Nvidia Xavier NX for regular size drones) in an UAV.

Compared to many famous algorithms for planning or obstacle avoidance, where the RGB-D image should be passed through several steps to be converted to useful data such as trees and graphs, our algorithm is an end-to-end algorithm that receives the raw RGB-D image and directly generates the optimal actions or next waypoint. Further, as mentioned earlier, it learns to generate the optimal action by itself; in other words, our algorithm is an autonomous learner agent.

Finally, regarding the dynamic obstacles, they are moving at a constant speed. For every three moves of the UAV, dynamic obstacles move one step. Every time the UAV hits an obstacle or reaches the goal position, obstacles will be re-positioned in a random position along the direction of movement.

6.1. UAV Trajectory in the Unknown Environment

The second unknown environment consists of many rooms and trees, and also it comprises two floors, where the only way to move to the second environment is through the staircase at the end of the first floor (Figure 27 and Figure 28). Considering the trajectory illustrated in Figure 29, Figure 30, Figure 31, Figure 32 and Figure 33, related to the unknown environment, the UAV visits one of the rooms on the first floor, then returns to the starting location, and then continues moving through the corridors to the staircase, climbs the staircase, and moves through the corridors, passes through the trees, passes a similar obstacle to the nets, and flies to the target location. To explain the behavior of the UAV in the unknown environment, we believe the model trained in the static and dynamic challenging environment learned a couple of things. It learned to pass through the window and pipe, and perhaps this best explains why the UAV prefers to move through the corridors where they have a similar depth image. Further, it learned to ascend and descend when it wanted to pass the static and dynamic nets, which explains why it is easy for it to climb the staircase. Perhaps one question could be why the UAV chooses to go to the first room on the first floor instead of choosing the corridor, which we believe is due to the Gaussian noises we used to generate the final actions, where it gives a kind of random sense to the controller-generated actions.

6.2. Simulation to Real World Gap

The main issue that can presenr some difficulties for implementing this algorithm (model trained in the simulator) on an actual UAV (quad-copter) would be related to the RGB-D camera noises.

In order to ameliorate this issue, it is possible to: (1) Create an artificial noise in the simulated camera in Gazebo to simulate it as much as possible similar to an actual RGB-D sensor; and (2) Train a Variational Auto-encoder (VAE) network that converts the simulated data to real data before being used for the training of the SCAN algorithm.

7. Conclusions

In this paper, we introduced a self-trained controller for autonomous navigation (SCAN) in a static and dynamic (with moving walls and nets) challenging environment (including trees, nets, windows, and pipe) using deep reinforcement learning trained by Depth and RGB images. We trained our algorithm in the Gazebo simulator in dynamic and static environments, and by illustrating our results, we proved that our algorithm is capable of learning an optimal policy for high-level control of a UAV quadcopter. In this work, we trained our UAV using several rewards, including Obstacle Avoidance reward, Motivation reward, Reaching Goal reward, and Area reward, in a multi-objective fashion with two primary purposes: (1) Avoiding the obstacles (static and dynamic); (2) Learning to reach the desired goal position, that is, where the red sign is shown in the environment. Further, we used an onboard method to calculate the distance of obstacles and predict the collision without using the simulator physics engine.

In this work, we designed and trained our algorithm for controlling the 3 DoF to have a practical algorithm for autonomous navigation of UAVs. Further, we defined our algorithm to use ROS for the communication, which gives the algorithm the advantage of mobility; that is, it can be moved to an actual drone with minimum changes and effort.

Our research result concludes that Deep RL is a capable algorithm in terms of learning to control a UAV and navigate through challenging environments to reach a goal that is shown as a photo and not a single point. Further, it is possible to train the Deep RL algorithm to avoid the obstacles simultaneously in a multi-objective approach. While other algorithms that are not based on machine learning methods are also able to achieve the same goals, many of them are based on localization methods, such as SLAM or point-cloud, which are proven to be computationally heavier than deep RL.

In future work, it is possible to develop the algorithm further to control multiple agents at the same time, while the agents maintain a reasonable formation while following the leader.

Author Contributions

Writing—original draft, A.R.D. and D.-J.L.; Writing—review and editing, A.R.D. and D.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Unmanned Vehicles Core Technology Research and Development Program through the National Research Foundation of Korea (NRF), Unmanned Vehicle Advanced Research Center (UVARC) funded by the Ministry of Science and ICT, the Republic of Korea (2020M3C1C1A01082375) (2020M3C1C1A02084772), and also was supported by DNA+Drone Technology Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (No. NRF-2020M3C1C2A01080819), and also was supported by the Technology Innovation Program (or Industrial Strategic Technology Development Program-Development and demonstration of automatic driving collaboration platform to overcome the dangerous environment of databased commercial special vehicles) (20018904, Development and field test of mobile edge for safety of special purpose vehicles in hazardous environment) funded By the Ministry of Trade, Industry & Energy (MOTIE, Korea).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DRL	Deep Reinforcement Learning
ANN	Artificial Neural Network
SLAM	Simultaneous Localization And Mapping
ROS	Robot Operating System
SCAN	Self-trained Controller for Autonomous Navigation

References

Floreano, D.; Mattiussi, C. Bio-Inspired Artificial Intelligence: Theories, Methods, and Technologies; The MIT Press: Cambridge, MA, USA, 2008. [Google Scholar]
Liu, X.; Li, Y.; Zhang, J.; Zheng, J.; Yang, C. Self-Adaptive Dynamic Obstacle Avoidance and Path Planning for USV Under Complex Maritime Environment. IEEE Access 2019, 7, 114945–114954. [Google Scholar] [CrossRef]
Niu, H.; Lu, Y.; Savvaris, A.; Tsourdos, A. An energy-efficient path planning algorithm for unmanned surface vehicles. Ocean Eng. 2018, 161, 308–321. [Google Scholar] [CrossRef] [Green Version]
Singh, Y.; Sharma, S.K.; Sutton, R.; Hatton, D.; Khan, A. A constrained A* approach towards optimal path planning for an unmanned surface vehicle in a maritime environment containing dynamic obstacles and ocean currents. Ocean Eng. 2018, 169, 187–201. [Google Scholar] [CrossRef] [Green Version]
Mina, T.; Singh, Y.; Min, B.C. Maneuvering Ability-Based Weighted Potential Field Framework for Multi-USV Navigation, Guidance, and Control. Mar. Technol. Soc. J. 2020, 54, 40–58. [Google Scholar] [CrossRef]
Singh, Y.; Bibuli, M.; Zereik, E.; Sharma, S.; Khan, A.; Sutton, R. A Novel Double Layered Hybrid Multi-Robot Framework for Guidance and Navigation of Unmanned Surface Vehicles in a Practical Maritime Environment. J. Mar. Sci. Eng. 2020, 8, 624. [Google Scholar] [CrossRef]
Ng, A.Y.; Coates, A.; Diel, M.; Ganapathi, V.; Schulte, J.; Tse, B.; Berger, E.; Liang, E. Autonomous Inverted Helicopter Flight via Reinforcement Learning. In Proceedings of the Experimental Robotics IX, Singapore, 18–21 June 2006; Ang, M.H., Khatib, O., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 363–372. [Google Scholar]
Abbeel, P.; Coates, A.; Quigley, M.; Ng, A.Y. An Application of Reinforcement Learning to Aerobatic Helicopter Flight. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, Vancouver, BC, Canada, 4–7 December 2006; MIT Press: Cambridge, MA, USA, 2006; pp. 1–8. [Google Scholar]
Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a Quadrotor With Reinforcement Learning. IEEE Robot. Autom. Lett. 2017, 2, 2096–2103. [Google Scholar] [CrossRef] [Green Version]
Molchanov, A.; Chen, T.; Hönig, W.; Preiss, J.A.; Ayanian, N.; Sukhatme, G.S. Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors. arXiv 2019, arXiv:1903.04628. [Google Scholar]
Arasanipalai, R.; Agrawal, A.; Ghose, D. Mid-flight Propeller Failure Detection and Control of Propeller-deficient Quadcopter using Reinforcement Learning. arXiv 2020, arXiv:2002.11564. [Google Scholar]
Sadeghi, F.; Levine, S. (CAD)2RL: Real Single-Image Flight without a Single Real Image. arXiv 2016, arXiv:1611.04201. [Google Scholar]
Zhang, T.; Kahn, G.; Levine, S.; Abbeel, P. Learning Deep Control Policies for Autonomous Aerial Vehicles with MPC-Guided Policy Search. arXiv 2015, arXiv:1509.06791. [Google Scholar]
Liu, J.; Qi, W.; Lu, X. Multi-step reinforcement learning algorithm of mobile robot path planning based on virtual potential field. In Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators, Taiyuan, China, 17–20 September 2021; Springer: Berlin/Heidelberg, Germany, 2017; pp. 528–538. [Google Scholar]
Wang, W.; Luo, X.; Li, Y.; Xie, S. Unmanned surface vessel obstacle avoidance with prior knowledge-based reward shaping. Concurr. Comput. Pract. Exp. 2021, 33, e6110. [Google Scholar] [CrossRef]
Long, P.; Liu, W.; Pan, J. Deep-learned collision avoidance policy for distributed multiagent navigation. IEEE Robot. Autom. Lett. 2017, 2, 656–663. [Google Scholar] [CrossRef] [Green Version]
Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
Zhou, X.; Wu, P.; Zhang, H.; Guo, W.; Liu, Y. Learn to navigate: Cooperative path planning for unmanned surface vehicles using deep reinforcement learning. IEEE Access 2019, 7, 165262–165278. [Google Scholar] [CrossRef]
Liu, Y.; Peng, Y.; Wang, M.; Xie, J.; Zhou, R. Multi-usv system cooperative underwater target search based on reinforcement learning and probability map. Math. Probl. Eng. 2020, 2020, 7842768. [Google Scholar] [CrossRef]
Doukhi, O.; Lee, D.J. Deep Reinforcement Learning for End-to-End Local Motion Planning of Autonomous Aerial Robots in Unknown Outdoor Environments: Real-Time Flight Experiments. Sensors 2021, 21, 2534. [Google Scholar] [CrossRef] [PubMed]
Barzegar, A.; Lee, D.J. Deep Reinforcement Learning-Based Adaptive Controller for Trajectory Tracking and Altitude Control of an Aerial Robot. Appl. Sci. 2022, 12, 4764. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
Schulman, J.; Levine, S.; Moritz, P.; Jordan, M.I.; Abbeel, P. Trust Region Policy Optimization. arXiv 2015, arXiv:1502.05477. [Google Scholar]
Kakade, S.; Langford, J. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of the Nineteenth International Conference on Machine Learning, ICML’02, Sydney, Australia, 8–12 July 2002; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2002; pp. 267–274. [Google Scholar]
Mnih, V.; Badia, A.P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous Methods for Deep Reinforcement Learning. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Balcan, M.F., Weinberger, K.Q., Eds.; PMLR: New York, NY, USA, 2016; Volume 48, pp. 1928–1937. [Google Scholar]
Dooraki, A.R.; Lee, D. Multi-rotor Robot Learning to Fly in a Bio-inspired Way Using Reinforcement Learning. In Proceedings of the 2019 16th International Conference on Ubiquitous Robots (UR), Jeju, Korea, 24–27 June 2019; pp. 118–123. [Google Scholar] [CrossRef]
Dooraki, A.R.; Hooshyar, D.; Yousefi, M. Innovative algorithm for easing VIP’s navigation by avoiding obstacles and finding safe routes. In Proceedings of the 2015 International Conference on Science in Information Technology (ICSITech), Yogyakarta, Indonesia, 27–28 October 2015; pp. 357–362. [Google Scholar] [CrossRef]
Gazebo Simulator. Available online: http://gazebosim.org/ (accessed on 16 March 2017).
Furrer, F.; Burri, M.; Achtelik, M.; Siegwart, R. Chapter RotorS—A Modular Gazebo MAV Simulator Framework. In Robot Operating System (ROS): The Complete Reference (Volume 1); Springer International Publishing: Cham, Switzerland, 2016; pp. 595–625. [Google Scholar]
Labbé, M.; Michaud, F. RTAB-Map as an open-source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. J. Field Robot. 2019, 36, 416–446. [Google Scholar] [CrossRef]

Figure 2. This diagram shows neural networks used for the policy and value network of the SCAN algorithm.

Figure 3. The flow of the information in our proposed algorithm). Three-dimensional continuous Action including Gas, Rotation around Z-axis, and elevation in Z-axis is send to the environment for controlling the UAV. The result is the new state (RGB-D image), and the collection of the rewards calculated for our multi-objective agent.

Figure 4. The simulated Ascending Technology Pelican quad-copter carrying a depth sensor in Gazebo sim. controlled by our proposed algorithm in our challenging environment explained in Section 4 of this paper.

Figure 5. Training environment (3D view).

Figure 6. Training environment (top view).

Figure 7. Dynamic training environment (top view).

Figure 8. Dynamic training environment (3D view).

Figure 9. This diagram shows the top view of the environment and the sections of the environment used for generating reward and resetting the UAV.

Figure 10. This diagram shows the top view of the dynamic environment and the sections of the environment.

Figure 11. This diagram illustrates the navigation of UAV using untrained SCAN in the challenging environment. The top left image shows the top-view navigation trajectory of UAV projected on the top-view of the environment map.

Figure 12. This diagram illustrates the navigation of UAV using untrained SCAN in the challenging environment (XY plane).

Figure 13. These diagrams illustrate the navigation of UAV using untrained SCAN in the challenging environment. The left image shows the trajetory of UAV navigation in the YZ plane. The right image shows the trajetory of UAV navigation in the XZ plane.

Figure 14. This diagram illustrates the navigation of UAV using trained SCAN in the challenging environment. The top left image shows the top-view navigation trajectory of UAV projected on the top view of the environment map.

Figure 15. This diagram illustrates the navigation of UAV using untrained SCAN in the challenging environment (XY plane).

Figure 16. These diagrams illustrate the navigation of UAV using untrained SCAN in the challenging environment. The left image shows the trajectory of UAV navigation in the YZ plane. The right image shows the trajectory of UAV navigation in the XZ plane.

Figure 17. This figure shows that our algorithm is able to learn to maximize several rewards (multi-objective) in the first 250 episodes.

Figure 18. This figure shows the number of collisions and number of approaches to the goal position by the agent, where the agent is capable of decreasing the collision and increasing the reaching to the goal position by the time.

Figure 19. Using SLAM for visualizing the trajectory of UAV controlled by the SCAN algorithm (Beginning of Area 1). In this paper, SLAM is used only for visualization purposes.

Figure 20. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1). The left image is 3D-SLAM and the right image is 2D SLAM.

Figure 21. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1 and Area 3).

Figure 22. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1, Area 2, and Area 3).

Figure 23. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1, Area 2, and Area 3) in rviz, 3D view.

Figure 24. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1, Area 2, and Area 3) in rviz, top view.

Figure 25. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1, Area 2, and Area 3), side-view.

Figure 26. Visualization of the trajectory of UAV controlled by the SCAN algorithm (Area 1, Area 2, and Area 3), side view.

Figure 27. Unknown environment (top view of first floor). The green H is the starting location, similar to the previous environments.

Figure 28. Unknown environment (top view of second floor). The red H is the destination location, similar to the previous environments.

Figure 29. This figure illustrates the trajectory of the UAV from the starting location to the target location in the unknown environment from the starting location side (3D view).

Figure 30. This figure illustrates the trajectory of the UAV from the starting location to the target location in the unknown environment from the staircase side (3D view).

Figure 31. This figure illustrates the trajectory of the UAV from the starting location to the target location in the unknown environment (top view).

Figure 32. This figure illustrates the trajectory of the UAV from the starting location to the target location in the unknown environment (side view).

Figure 33. This figure illustrates parts of the trajectory of the UAV related to the start and end of the unknown environment.

Table 1. Success and Collision Rates.

Environment	Success Rate
Static	98%
Dynamic	82%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ramezani Dooraki, A.; Lee, D.-J. A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments. Machines 2022, 10, 500. https://0-doi-org.brum.beds.ac.uk/10.3390/machines10070500

AMA Style

Ramezani Dooraki A, Lee D-J. A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments. Machines. 2022; 10(7):500. https://0-doi-org.brum.beds.ac.uk/10.3390/machines10070500

Chicago/Turabian Style

Ramezani Dooraki, Amir, and Deok-Jin Lee. 2022. "A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments" Machines 10, no. 7: 500. https://0-doi-org.brum.beds.ac.uk/10.3390/machines10070500

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Objective Reinforcement Learning Based Controller for Autonomous Navigation in Challenging Environments

Abstract

1. Introduction

2. Related Work

3. Self-Trained Controller for Autonomous Navigation (SCAN)

3.1. Enhancements

3.1.1. Random Starting Position (RSP)

3.1.2. Controlled Starting State (CSS)

4. Simulation Environment

4.1. Static and Dynamic Environments

4.2. A New Unknown Environment

4.3. Navigation Visualization

4.3.1. SLAM for Visualizing the Trajectory in 3D Map

5. Experimental Result

5.1. Learning Diagrams

5.2. Trajectory of the UAV Controlled by SCAN

6. Discussion

6.1. UAV Trajectory in the Unknown Environment

6.2. Simulation to Real World Gap

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI