Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation

Pi, Chen-Huan; Ye, Wei-Yuan; Cheng, Stone

doi:10.3390/app11073257

Open AccessArticle

Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation

by

Chen-Huan Pi

,

Wei-Yuan Ye

and

Stone Cheng

^*

Department of Mechanical Engineering, National Yang Ming Chiao Tung University, Hsinchu City 30010, Taiwan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(7), 3257; https://0-doi-org.brum.beds.ac.uk/10.3390/app11073257

Submission received: 9 March 2021 / Revised: 31 March 2021 / Accepted: 3 April 2021 / Published: 5 April 2021

(This article belongs to the Special Issue Unmanned Aerial Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, a novel control strategy is presented for reinforcement learning with disturbance compensation to solve the problem of quadrotor positioning under external disturbance. The proposed control scheme applies a trained neural-network-based reinforcement learning agent to control the quadrotor, and its output is directly mapped to four actuators in an end-to-end manner. The proposed control scheme constructs a disturbance observer to estimate the external forces exerted on the three axes of the quadrotor, such as wind gusts in an outdoor environment. By introducing an interference compensator into the neural network control agent, the tracking accuracy and robustness were significantly increased in indoor and outdoor experiments. The experimental results indicate that the proposed control strategy is highly robust to external disturbances. In the experiments, compensation improved control accuracy and reduced positioning error by 75%. To the best of our knowledge, this study is the first to achieve quadrotor positioning control through low-level reinforcement learning by using a global positioning system in an outdoor environment.

Keywords:

external disturbance; quadrotor; reinforcement learning

1. Introduction

A quadrotor is an underactuated, nonlinear coupled system. Because quadrotors have various applications, researchers have long been focusing on the problems of attitude stabilization and trajectory tracking in quadrotors. Many control methods are used for quadrotors. Proportional–integral–derivative (PID) controllers are widely used in consumer quadrotor products and is often treated as a baseline controller for comparison with other controllers [1]. In practice, the tuning of the PID controller’s gain often requires expertise, and the gain is selected intuitively by trial and error. Advanced control strategies using model-based methods have been applied to improve the flight performance of quadrotors. Methods such as feedback linearization [2], model predictive control (MPC) [3,4], robust control [5], sliding mode control (SMC) [6,7,8], and adaptive control [9,10] have been applied to optimize the flight performance of quadrotors. However, the performance and the robustness of the aforementioned strategies are highly related to the accuracy of the manually developed dynamic model.

During outdoor flight, quadrotors are susceptible to wind gust, which affects the flight performance or even leads to system instability [3]. Although quadrotors are sensitive to external disturbances [11], designers of most controllers have not accounted for this problem. Some active disturbance rejection methods have been proposed to estimate disturbances, and these methods perform well in cases of a sustained disturbance. Chovancova et al. [12] designed proportional–derivative (PD), linear quadratic regulator (LQR), and backstepping controllers for position tracking and compared their performance in a simulation. A disturbance observer with a position estimator was designed to improve controller positioning performance, which was evaluated when external disturbance was applied in simulations. The active disturbance rejection control (ADRC) algorithm treats the total disturbance as a new state variable and estimates it through an extended state observer (ESO). Moreover, the ADRC algorithm does not require the exact mathematical model of the overall system to be known. Therefore, this algorithm has become an attractive technique for the flight control of quadrotor unmanned aerial vehicles (UAVs) [13,14]. Yang et al. proposed the use of ADRC and PD control in a dual closed-loop control framework [15]. An ESO was used to estimate the perturbations of gust wind as dynamic disturbances in the inner loop control. A quadrotor flight controller with a sliding mode disturbance observer (SMC-SMDO) was used in [16]. The SMC-SMDO is robust to external disturbances and model uncertainties without the use of high control gain. Chen et al. [17] constructed a nonlinear disturbance observer that considers external disturbances from wind model uncertainties separately from the controller and compensates for the negative effects of the disturbances. In [18], a nonlinear observer based on an unscented Kalman filter was developed for estimating the external force and torque. This estimator reacted to a wide variety of disturbances in the experiment conducted in [18].

Reinforcement learning (RL) has solved many complicated quadrotor control problems in many studies. RL outperforms other optimization approaches and does not require a predefined controller structure, which limits the performance of an agent. In [19], a quadrotor with a deep neural network (DNN)-based controller was proposed for following trails in an unstructured outdoor environment. In [20], RL and MPC were used to enable a quadrotor to navigate unknown environments. MPC enables vehicle control, whereas RL is used to guide a quadrotor through complex environments. In addition to high-level planning and navigation problems, RL control has been used for achieving robust attitude and position control [21,22]. The control policy generated through the RL training of a neural network achieves low-level stabilization and position control and the policy can control the quadrotor directly from the quadrotor state inputs to four motor outputs. The aforementioned studies have implemented their proposed control strategies in simulations and real environments. Although quadrotors with RL controller exhibit stability under disturbance, the control policy cannot eliminate the steady-state error caused by wind or modeling error and the performance of the controller can be improved. In [23], an integral compensator was used to enhance tracking accuracy. The effect of this compensator on the tracking accuracy of the controller was verified by introducing a constant horizontal wind that flowed parallel to the ground in a simulation. Although the aforementioned integral compensator can eliminate the steady-state error, it slows down the controller response and has a large overshoot.

This paper presents a unique disturbance compensation RL (DCRL) framework that includes a disturbance compensator and an RL controller. The external disturbance observer in this framework is based on the work of [24]. The rest of this paper is organized as follows. Section 2 introduces the dynamic model of a quadrotor and the basics of RL. Section 3 describes the proposed DCRL control strategy. Section 4 describes the training and implementation of the proposed DCRL strategy in a quadrotor experiment in indoor and outdoor environments. Finally, Section 5 concludes the paper.

2. Preliminary Information

This section briefly introduces the dynamic model of a quadrotor, the basics of RL and the use of RL in solving the quadrotor control problem.

2.1. Quadrotor Dynamic Model

In this paper, we assume that a quadrotor is a rigid and symmetrical body whose center of gravity coincides with its geometric center.

The vector

x = {[x, y, z]}^{T}

denotes the position of the quadrotor in an inertial frame. The translation dynamics of the quadrotor can be expressed as follows:

\begin{matrix} \dot{x} & = v \\ \dot{v} & = m^{- 1} (\sum_{i = 1}^{4} T_{i} b_{z} + F_{e x t}) - g i_{z}, \end{matrix}

(1)

where m and g are the mass of the quadrotor and the acceleration due to gravity, respectively;

R = [b_{x} b_{y} b_{z}] \in S O (3)

is the rotation matrix, which is used to transform a coordinate from the body-fixed reference frame to the inertial reference frame; and

T_{i}

is the thrust generated from motors and applied on the z-axis of the body frame

b_{z}

. Figure 1 displays the order of motors placement. Finally, the vector

F_{e x t}

represents the external disturbance force accounting for all other forces acting on the quadrotor.

For the rotation dynamics of system, we use a quaternion representation of quadrotor attitude to avoid gimbal lock and ensure better computational efficiency.

\begin{matrix} \dot{q} & = 0.5 \cdot q \otimes {[\begin{matrix} 0 \\ Ω \end{matrix}]}^{T} \\ \dot{Ω} & = J^{- 1} (μ - Ω \times J Ω), \end{matrix}

(2)

where

\dot{q} = {[q_{w}, q_{x}, q_{y}, q_{z}]}^{T}

is the normed quaternion attitude vector, ⊗ is the quaternion multiplication.

Ω

is the angular velocity of body-frame,

μ

is the control moment vector and J is the matrix of vehicle moment of inertia tensor. The rotation transformation between the quaternion q to the rotation matrix

R

can be expressed as follows:

\begin{matrix} R & = [\begin{matrix} 1 - 2 (q_{y}^{2} + q_{z}^{2}) & 2 (q_{x} q_{y} - q_{z} q_{w}) & 2 (q_{x} q_{z} + q_{y} q_{w}) \\ 2 (q_{x} q_{y} + q_{z} q_{w}) & 1 - 2 (q_{x}^{2} + q_{k}^{2}) & 2 (q_{y} q_{z} - q_{x} q_{w}) \\ 2 (q_{x} q_{z} - q_{y} q_{w}) & 2 (q_{y} q_{z} + q_{x} q_{w}) & 1 - 2 (q_{x}^{2} + q_{y}^{2}) \end{matrix}] \end{matrix}

(3)

Each thrust from the propeller axis is assumed to be aligned perfectly with the z-axis. The force

T_{i}

and motor moment

μ

produced at a motor spinning speed of

ω_{i}

can be expressed as follows:

\begin{matrix} T_{i} & = c_{f} {ω_{i}}^{2} \\ μ & = \frac{1}{\sqrt{2}} [\begin{matrix} - l c_{f} & l c_{f} & l c_{f} & - l c_{f} \\ l c_{f} & - l c_{f} & l c_{f} & - l c_{f} \\ c_{d} & c_{d} & - c_{d} & - c_{d} \end{matrix}] {[\begin{matrix} ω_{1} \\ ω_{2} \\ ω_{3} \\ ω_{4}, \end{matrix}]}^{2} \end{matrix}

(4)

where

i = 1, 2, 3, 4

.

Ω_{i}

is the speed of motors; l is the arm length of the quadrotor; and

c_{f}

,

c_{d}

are the coefficients of the generated force and z-axis moment, respectively.

The developed dynamic model is based on the following assumptions: (a) the quadrotor structure is rigid, (b) the center of mass of the quadrotor and the rotor thrusts are in the same plane, and (c) blade flapping and aerodynamics can be ignored.

2.2. Reinforcement Learning

The standard RL framework comprises a learning agent interacting with the environment according to a Markov decision process (MDP). A state transition has a Markov property if the probability of this transition is independent of its history. The MDP involves solving decision problems with the Markov property, and RL theories are based on the MDP. The standard MDP is defined by the tuple

(S, A, P_{s s^{'}}^{a}, r, ρ_{0}, γ)

, where

S

is the state space,

A

is the action space,

P_{s s^{'}}^{a} : S \times A \times S \to R +

is the transition probability density of the environment,

r : S \times A \to R

is the reward function,

ρ_{0} : S \to R^{+}

is the distribution of the initial state

s_{0}

, and

γ \in (0, 1)

is the discount factor. In modern deep RL conducted using neural networks, the agent selects an action at each time step according to the policy

π (a | s; θ) = P r (a | s; θ)

, where

θ \in R^{N_{θ}}

is the weight of the neural network.

The goal of the MDP is to find a policy

π (a \in A | s)

that can maximize the cumulative discounted reward.

\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t}) .

(5)

A state-dependent value function

V^{π}

that measures the expected discounted reward with respect to

π

can be defined as follows:

V^{π} (s) = E [\sum_{l \geq 0} γ^{l} r_{t + l} | s_{t} = s, π] .

(6)

The state-action-dependent value function can be defined as follows:

Q^{π} (s, a) = E [\sum_{l \geq 0} γ^{l} r_{t + l} | s_{t} = s, a_{t} = a, π] .

(7)

The advantage function can be defined as follows:

A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s),

(8)

where

A^{π}

is the difference between the expected value when selecting some specific action a. The advantage function can be used to determine whether the selected action is suitable with respect to policy

π

. Many basic RL algorithms, such as the policy gradient method [25], off-policy actor–critic algorithm [26], and trust region policy optimization [27] can be used to optimize a policy. To maximize the expected reward function

V^{π} (s)

, the neural network parameterized policy

π = π (a | s; θ)

is adjusted as follows:

θ \leftarrow θ + α \sum_{s \in S, a \in A} ρ^{π} (s) Q^{π} (s, a) \frac{\partial π (a | s)}{\partial θ},

(9)

where

ρ^{π} (s) = \sum_{t \geq 0} γ^{t} P r (s_{t} = s | t, a \sim π)

(10)

is the state occurrence probability and

α > 0

is the size of the learning step. Equation (9) is an expression for the policy gradient [28]. By using the state distribution

ρ

and state-action value function Q, a policy can be improved without any environmental information. The state distribution

ρ^{π} (s)

depends on the policy

π

, which indicates that it must be re-estimated when the policy is changed. In [26], the policy gradient was analyzed by replacing the original policy

π

with another policy

μ

; therefore, (9) had the following form in [26].

\sum_{s \in S, a \in A} ρ^{μ} (s) Q^{π} (s, a) \frac{\partial π (a | s)}{\partial θ} .

(11)

Equation (11) can still maximize

V^{π}

with distinct policy gradient strategies.

To solve the nonlinear dynamic control problem for a quadrotor, we used the proximal policy optimization algorithm (PPO) [29] and off-policy training method [22] to train the actor and critic functions. The following inequalities are valid for the actor function:

V^{π} (s) - V^{μ} (s) \geq 0, \forall (s, a) \in S \times A,

(12)

if

[π (a | s) - μ (a | s)] A^{μ} (s, a) > 0

(13a)

or

[π (a | s) - μ (a | s)] A^{π} (s, a) > 0 .

(13b)

Therefore, for a policy search iteration, (13) provides improvement criteria for the action policy under a certain state.

3. Disturbance Observation and Control Strategy

Figure 2 depicts a block diagram of the quadrotor control with the DCRL framework. The proposed DCRL strategy enhances the RL control policy with external disturbance observer and compensator to strengthen the system robustness. The compensation algorithm estimates the external forces and adjusts the input command for the RL controller. The RL controller then changes the motor thrusts accordingly. In Figure 2, the observer takes the attitude

q_{f}

and acceleration

a_{f}

as input, and outputs the estimated external disturbance.

F_{e x t}, {\hat{F}}_{e x t}

are the external disturbance and the external disturbance estimated by the observer. The disturbance compensator calculates the

q_{c o m p}

from quadrotor attitude

q_{f}

and

{\hat{F}}_{e x t}

. The original RL actor was trained to hover at the original point by receiving the state s which contains the position, velocity, attitude and angular velocity of the quadrotor, and output four motors thrust follows the policy

α (s)

. In DCRL, to make the quadrotor hover at the reference position

x_{r e f}

, the original point can be shifted with an off-set of reference command and as the input of position

x_{d e v}

to the RL actor. The DCRL generates thrust command with the sum of RL actor output and compensation force.

The RL controller in the DCRL was trained to recover and hover at the original point under an ideal simulation environment, while the external disturbances were not considered in RL control policy training. Several reasons exist for not adding external disturbances in RL training. First, the RL controller sometimes has superior performance to traditional methods if the simulation environment is highly similar to a real-world controller model. However, such a model has numerous uncertainties and is highly difficult to reproduce in a simulator. In this study, the sensor noise was one of the uncertain factors because each inertial measurement unit (IMU) sensor on the flight computer had different physical characteristics. Second, the sensor noise does not follow the assumption of the MDP in RL theory; thus, the final performance of the trained policy cannot be guaranteed to be suitable. Finally, the aforementioned traditional external disturbance estimation methods have been demonstrated to be effective. Therefore, we focused on eliminating known disturbances by using an RL controller with a traditional observation method for achieving a superior positioning performance in this study.

3.1. Disturbance Observer

In general, by rearranging the terms in (1), the external force may be calculated directly from the acceleration information as follows:

{\hat{F}}_{e x t} = m R^{T} a^{b} - \sum_{i = 1}^{4} T_{i} b_{z} + m g i_{z},

(14)

where the thrust forces are only applied on the z-axis of the quadrotor in the body frame and

a^{b}

is the acceleration measured by the onboard IMU sensor. The parameter

a^{b}

includes the gravitational acceleration.

A low-pass filter (LPF) with cut-off frequency at 30 Hz is used to reduce the effects of noise caused by rotor spinning vibrations or the IMU. The thrust and acceleration are transformed to the inertial reference frame prior to filtering. The reason of this pre-processing is that the external force in the inertial reference frame

{\hat{F}}_{e x t}

is assumed to have a lower rate of change than do be slow-changing relative to the LPF dynamics.

3.2. Disturbance Compensator

When an external disturbance

F_{e x t}

is acting on the quadrotor (Figure 3), this disturbance generates a translational acceleration vector

a_{e x t}

. For disturbance compensation, a new compensation thrust vector

g_{f c}

is defined. This vector combines

a_{e x t}

and the gravitational acceleration vector

g i_{z}

and can be expressed as follows:

i_{zc} = \frac{a_{e x t} + g i_{z}}{| a_{e x t} + g i_{z} |},

(15)

which only considers the hovering situation without an acceleration command from the trajectory tracking reference. The normalized vector

g_{f c}

is then used to formulate a new coordinate frame (force compensation frame) with a rotation matrix

R_{c i}

relative to the inertial frame.

In three-dimensional space, any rotation coordinate system about a fixed point is equivalent to a single rotation by an angle

θ

about a fixed axis (called the Euler axis) that passes through the fixed point.

To obtain the rotation matrix

R_{c i}

from

i_{z}

to

g_{f c}

represent in quaternion with the following equations can be used:

\begin{matrix} v = [0, g_{f c}] \\ q = [cos (\frac{θ}{2}), sin (\frac{θ}{2}) u] \\ v^{'} = {qvq}^{*} = {qvq}^{- 1} = i_{z}, \end{matrix}

(16)

where

v

is the original quaternion vector,

u

is the unit vector of the rotation axis,

v^{'}

is the rotated quaternion vector,

q

is the rotation vector between

v

and

v^{'}

, and

θ

is the rotation angle.

By substituting

q_{c i}

into (3),

R_{c i}

can be determined as follows:

\begin{matrix} v^{'} = q_{c i} v q_{c i}^{*} = q_{c i} v q_{c i}^{- 1} \\ = R_{c i} v . \end{matrix}

(17)

The aforementioned equation is equivalent to Rodrigues’ rotation formula. After obtaining the rotation matrix

R_{c i}

, the quadrotor attitude in the force compensation frame

R_{c b}

can be calculated using the following equation:

R_{c b} = R_{c i} R_{i b} .

(18)

After obtained the corrected coordinate for compensation, the magnitude of thrust of motors with compensation is

T_{c i} = \frac{{\hat{F}}_{e x t} \cdot b_{z}}{4} + α_{i} (s),

(19)

where

α_{i} (s)

is the i-th motor action output of neural network which would be specified in following section. By modifying the quadrotor attitude in the compensation frame

R_{c b}

and using this attitude as the input state of the RL controller, the controller can generate corresponding motor thrust to maintain the target attitude and therefore eliminate the disturbance acting on the quadrotor.

4. Experiments

In this section, we introduce our training method for a low-level quadrotor control policy. The RL controller receives the information on the quadrotor state (position, velocity, attitude, and angular velocity) from sensors and directly outputs the control commands of four rotors. The training was first performed and tested in a simulator. The quadrotor simulator was established using Python according to the dynamic model described in Section 2.1 for training and verifying the flight performance.

After verifying that the RL control policy was trained successfully, we transported the controller into our DCRL structure and performed a real flight with the quadrotor. To implement the proposed DCRL control algorithm in this study, PixRacer flight controller hardware was developed and implemented using Simulink. The DCRL control strategy was examined in an indoor environment by performing fixed-point hovering under an external wind disturbance. Then, the quadrotor was set to track a square trajectory in an outdoor experiment. The position and velocity of the quadrotor were obtained using an OptiTrack motion capture system on the ground station computer. These data were transmitted through Wi-Fi to the onboard PixRacer flight controller within 10 m range in the indoor experiment. The physical parameters of the quadrotor platform are presented in Table 1. For outdoor trajectory tracking, position information was only obtained from an onboard global positioning system (GPS) sensor.

4.1. RL Controller Training

In the RL training process, we followed the dynamic equations in Section 2.1 and constructed a simulation environment in Python to generate training data. In the simulation environment, the state space of the MDP comprised the position, velocity, attitude, and angular velocity of the quadrotor. Moreover, the four motors thrust outputs were chosen as the action space. The training process follows the work in [22] using two processes, one for data collection and another for value and policy network update. The update is based on off-policy training, and the main difference between on-policy is the on-policy only uses the collected data once and be cleaned up after each time neural network updates. On the contrary in the off-policy training, the collection thread keeps generating the trajectory data and neural network updating thread can reuse the data from collection which accelerates the learning process.

In data collection process, the quadrotor was randomly launched in a 2 m cubic space with random states. The training data, which comprised the quadrotor states, action, probability, and reward, were recorded in each episode which contained 200 steps in two seconds flight, and then saved as a single data trajectory in a memory buffer. The normalized reward function for evaluating the current state of the quadrotor is as follows:

r = - (0.002 ∥e_{q}∥ + 0.002 ∥e_{p}∥ + 0.002 ∥a∥),

(20)

where

e_{q}

is the vehicle angle error,

e_{p}

is the vehicle position error, and a is the motor thrust command for constraining the energy cost.

When the number of data trajectories in memory buffer exceeded 10, the training process starts. In training process, trajectories were randomly sampled from the memory buffer. The advantage and value functions were defined recursively and calculated in reverse direction which depend on the future time

t + 1

. The functions were estimated according to

A_{t}^{t r a c e} = A_{t} + γ min (1, \frac{π_{t + 1}}{μ_{t + 1}}) A_{t + 1}^{t r a c e},

(21)

and

V_{t}^{t r a c e} = V_{t} + min (1, \frac{π_{t}}{μ_{t}}) A_{t}^{t r a c e} .

(22)

With the two equations above, we use stochastic gradient descent to optimize the objectives as follows:

\begin{matrix} m a x i m i z e L_{p o l i c y} & = \sum_{(s, a) \in T} min [(\frac{π (a | s)}{μ (a | s)} - 1) A^{t r a c e}, ϵ | A^{t r a c e} |] \\ m i n i m i z e L_{v a l u e} & = \frac{1}{| T |} \sum_{(s, a) \in T} {(V (s) - V^{t r a c e})}^{2} . \end{matrix}

(23)

To approximate the function

π (a | s)

for proposing actions and

V (s)

for predicting the state value, two neural network were formulate as following equations, where stochastic Gaussian policy was used for the actor network:

\begin{matrix} θ (s) & = {\tilde{h}}_{1}^{32} \circ {\tilde{h}}_{0}^{32} (s) \end{matrix}

(24)

\begin{matrix} α (s) & = sin \circ {\tilde{y}}_{2}^{4} \circ θ (s) \end{matrix}

(25)

\begin{matrix} σ (s) & = \frac{1}{2} + \frac{1}{2} cos \circ {\hat{y}}_{2}^{4} \circ θ (s) \end{matrix}

(26)

π (a | s) = \frac{1}{\sqrt{2 π σ^{2} (s)}} e^{- \frac{{(a - α (s))}^{2}}{2 σ^{2} (s)}} .

(27)

The state value function can be approximated as follows:

V (s) = y_{2}^{1} \circ h_{1}^{128} \circ h_{0}^{128} (s),

(28)

where

h_{i}^{j}

and

y_{i}^{j}

are the ith hidden layer and output layer of neural networks with width j, ∘ is the fully connected activation function. In both the actor and critic networks, the input state s is the quadrotor’s position, velocity, attitude and angular velocity. The sin and cos functions are used to constrain the output range. A rectified linear unit (ReLU) is used as the activation function due to its characteristic of fast calculation and easy implementation in a microcontroller unit. When implementing the RL controller in a quadrotor flight computer system, only (25) is used to control the quadrotor.

To apply the developed RL controller in an outdoor environment, we extracted the parameters of a trained neural network and loaded them into a Simulink model. The input state of position was limited to the same finite range as that adopted in the training environment to prevent an untrained condition from occurring when using the developed RL controller with a GPS in outdoor environments. With the successful training of the external disturbance observer and RL neural network controller, the DCRL control policy was transferred to the PixRacer flight computer in real quadrotors to replace the original PID controller.

4.2. Results of the Indoor Experiment

In the indoor experiment, we put an electric fan to simulate a constant wind disturbance (Figure 4). We used a self-made quadrotor with a flight control board and GPS mounted in the plane. An OptiTrack motion capture system provides reliable state information, and a multisensor fusion framework in the flight computer fuses the measurement from the onboard IMU and the motion capture data to compensate for the time delay and low update frequency of the OptiTrack system. We compared the position errors between RL with and without compensation under wind disturbance. The measured wind speed was 3.6 m/s at the center of the x-axis of the quadrotor. Figure 5 displays the position error histogram. The mean errors of the original RL and DCRL controllers were 8.4 and 2 cm, respectively, which reduced the hovering error by 75%. Figure 6 presents the estimated position error and external disturbance force for a 30-s flight. Video clips of the indoor experiment can be found at https://youtu.be/RtAoiljZTSI (accessed on 5 April 2021).

4.3. Results of the Outdoor Experiment

After verifying the DCRL control algorithm under motion capture accurate position and velocity measurement and relative steady wind perturbation in a laboratory environment, the quadrotor was moved outdoors and a GPS was used to obtain position feedback. The maximum wind speed was measured to be 4.2 m/s by an anemometer. The position trajectory was a 10-m² square with a constant height and a velocity of 1 m/s. The results of the outdoor experiment are shown in Figure 7, and the estimated external forces acting on the x-axis and y-axis are presented in Figure 8. Table 2 summarizes the position errors in the indoor and outdoor experiments. In the outdoor experiment, a position waypoint was used as a reference for trajectory tracking without a velocity command. Thus, the quadrotor had to maintain a certain position error to obtain the moving velocity for following the waypoint. The tracking errors may have also been caused by the 2.5-m horizontal position accuracy and 10-Hz update rate of the adopted GPS sensor. However, the experimental results still indicate that the DCRL structure can reduce the quadrotor positioning error.

5. Conclusions

In this paper, an RL control structure with external force compensation and an external force disturbance observer is proposed for quadrotors. The DCRL controller can reduce the effects of wind gust on quadrotors in fixed-position hovering and trajectory tracking tasks and improve their flight performance. In the outdoor experiments, compared with the original RL control algorithm, the proposed control strategy reduced the fixed-position hovering error by 75%. To the best of our knowledge, this study is the first to use a low-level RL controller with a GPS in an outdoor environment to eliminate the external disturbance acting on flying quadrotors.

Author Contributions

Conceptualization, C.-H.P.; Methodology, C.-H.P.; Project administration, S.C.; Software, C.-H.P. and W.-Y.Y.; Supervision, S.C.; Validation, C.-H.P. and W.-Y.Y.; Writing—original draft, C.-H.P. and W.-Y.Y.; Writing—review & editing, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bouabdallah, S.; Noth, A.; Siegwart, R. PID vs. LQ control techniques applied to an indoor micro quadrotor. In Proceedings of the 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566), Sendai, Japan, 28 September–2 October 2004; Volume 3, pp. 2451–2456. [Google Scholar]
Lee, D.; Kim, H.J.; Sastry, S.S. Feedback linearization vs. adaptive sliding mode control for a quadrotor helicopter. Int. J. Control. Autom. Syst. 2009, 7, 419–428. [Google Scholar] [CrossRef]
Alexis, K.; Nikolakopoulos, G.; Tzes, A. Switching model predictive attitude control for a quadrotor helicopter subject to atmospheric disturbances. Control Eng. Pract. 2011, 19, 1195–1207. [Google Scholar] [CrossRef] [Green Version]
Alexis, K.; Nikolakopoulos, G.; Tzes, A. Model predictive quadrotor control: Attitude, altitude and position experimental studies. IET Control Theory Appl. 2012, 6, 1812–1827. [Google Scholar] [CrossRef] [Green Version]
Lee, T. Robust Adaptive Attitude Tracking on ${SO}(3)$ With an Application to a Quadrotor UAV. IEEE Trans. Control Syst. Technol. 2013, 21, 1924–1930. [Google Scholar]
Wang, H.; Ye, X.; Tian, Y.; Zheng, G.; Christov, N. Model-free–based terminal SMC of quadrotor attitude and position. IEEE Trans. Aerosp. Electron. Syst. 2016, 52, 2519–2528. [Google Scholar] [CrossRef]
Xu, B. Composite Learning Finite-Time Control With Application to Quadrotors. IEEE Trans. Syst. Man Cybern. Syst. 2018, 48, 1806–1815. [Google Scholar] [CrossRef]
Xu, R.; Özgüner, Ü. Sliding Mode Control of a Quadrotor Helicopter. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006; pp. 4957–4962. [Google Scholar]
Dydek, Z.T.; Annaswamy, A.M.; Lavretsky, E. Adaptive Control of Quadrotor UAVs: A Design Trade Study with Flight Evaluations. IEEE Trans. Control Syst. Technol. 2013, 21, 1400–1406. [Google Scholar] [CrossRef]
Zou, Y.; Meng, Z. Immersion and Invariance-Based Adaptive Controller for Quadrotor Systems. IEEE Trans. Syst. Man Cybern. Syst. 2019, 49, 2288–2297. [Google Scholar] [CrossRef]
Liu, H.; Zhao, W.; Zuo, Z.; Zhong, Y. Robust control for quadrotors with multiple time-varying uncertainties and delays. IEEE Trans. Ind. Electron. 2016, 64, 1303–1312. [Google Scholar] [CrossRef]
Chovancova, A.; Fico, T.; Hubinský, P.; Duchon, F. Comparison of various quaternion-based control methods applied to quadrotor with disturbance observer and position estimator. Robot. Auton. Syst. 2016, 79, 87–98. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, Z.; Zhang, X.; Sun, Q.; Sun, M. A novel control scheme for quadrotor UAV based upon active disturbance rejection control. Aerosp. Sci. Technol. 2018, 79, 601–609. [Google Scholar] [CrossRef]
Xu, L.X.; Ma, H.J.; Guo, D.; Xie, A.H.; Song, D.L. Backstepping sliding-mode and cascade active disturbance rejection control for a quadrotor UAV. IEEE/ASME Trans. Mechatronics 2020, 25, 2743–2753. [Google Scholar] [CrossRef]
Yang, H.; Cheng, L.; Xia, Y.; Yuan, Y. Active Disturbance Rejection Attitude Control for a Dual Closed-Loop Quadrotor Under Gust Wind. IEEE Trans. Control Syst. Technol. 2018, 26, 1400–1405. [Google Scholar] [CrossRef]
Besnard, L.; Shtessel, Y.; Landrum, B. Quadrotor vehicle control via sliding mode controller driven by sliding mode disturbance observer. J. Frankl. Inst. 2012, 349, 658–684. [Google Scholar] [CrossRef]
Chen, F.; Lei, W.; Zhang, K.; Tao, G.; Jiang, B. A novel nonlinear resilient control for a quadrotor UAV via backstepping control and nonlinear disturbance observer. Nonlinear Dyn. 2016, 85, 1281–1295. [Google Scholar] [CrossRef]
McKinnon, C.D.; Schoellig, A.P. Unscented external force and torque estimation for quadrotors. In Proceedings of the 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea, 9–14 October 2016; pp. 5651–5657. [Google Scholar]
Smolyanskiy, N.; Kamenev, A.; Smith, J.; Birchfield, S. Toward low-flying autonomous MAV trail navigation using deep neural networks for environmental awareness. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; pp. 4241–4247. [Google Scholar]
Greatwood, C.; Richards, A. Reinforcement learning and model predictive control for robust embedded quadrotor guidance and control. Auton. Robot. 2019, 1–13. [Google Scholar] [CrossRef] [Green Version]
Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Control of a quadrotor with reinforcement learning. IEEE Robot. Autom. Lett. 2017, 2, 2096–2103. [Google Scholar] [CrossRef] [Green Version]
Pi, C.H.; Hu, K.C.; Cheng, S.; Wu, I.C. Low-level autonomous control and tracking of quadrotor using reinforcement learning. Control Eng. Pract. 2020, 95, 104222. [Google Scholar] [CrossRef]
Wang, Y.; Sun, J.; He, H.; Sun, C. Deterministic Policy Gradient With Integral Compensator for Robust Quadrotor Control. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 3713–3725. [Google Scholar] [CrossRef]
Tomić, T.; Ott, C.; Haddadin, S. External wrench estimation, collision detection, and reflex reaction for flying robots. IEEE Trans. Robot. 2017, 33, 1467–1482. [Google Scholar] [CrossRef]
Sutton, R.S.; McAllester, D.A.; Singh, S.P.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems; ACM: New York, NY, USA, 2000; pp. 1057–1063. [Google Scholar]
Degris, T.; White, M.; Sutton, R.S. Off-Policy Actor-Critic. arXiv 2012, arXiv:1205.4839. [Google Scholar]
Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; Moritz, P. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1889–1897. [Google Scholar]
Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction. IEEE Trans. Neural Netw. 1988, 16, 285–286. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]

Figure 1. Body-fixed frame of the quadrotor and motor placement.

Figure 2. The structures of quadrotor control. (a) Block diagram of the quadrotor control structure with reinforcement learning control and external disturbance compensator. (b) The neural network in the reinforcement learning (RL) actor.

Figure 3. Force compensation frame.

Figure 4. Experimental setup for indoor fixed-position hovering. An electric fan was used for simulating a constant wind disturbance with a speed of 3.6 m/s which was applied along the x-axis of the quadrotor.

Figure 5. The position error histogram distribution of indoor fix-position hover experiment. The mean error of the original RL controller and the disturbance compensation RL (DCRL) is 8.4 cm and is 2 cm respectively.

Figure 6. (a) The indoor fix-position hover experiment setup. An electric fan was used for simulating a constant wind disturbance with wind speed 3.6 m/s applied onto quadrotor x-axis. (b) The position error along x-axis. (c) The estimation of external force on x-axis.

Figure 7. The position of tracking a 10 m square trajectory in an outdoor experiment. The GPS sensor was used for position information measurement.

Figure 8. The estimation of external forces on x and y-axis in outdoor tracking experiment.

Table 1. Physical parameters of the quadrotor.

Weight (g)	$I_{xx}, I_{yy}, I_{zz}$ (kgm $^{2}$ )
665	$0.0023, 0.0025, 0.0037$

Table 2. Root mean square errors along the three quadrotor axes in the indoor and outdoor experiments.

		$e_{x}$ RMSE (m)	$e_{y}$ RMSE (m)	$e_{z}$ RMSE (m)
Indoor	RL	0.08	0.04	0.02
Indoor	DCRL	0.02	0.06	0.02
Outdoor	RL	0.80	0.46	0.17
Outdoor	DCRL	0.45	0.41	0.07

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pi, C.-H.; Ye, W.-Y.; Cheng, S. Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation. Appl. Sci. 2021, 11, 3257. https://0-doi-org.brum.beds.ac.uk/10.3390/app11073257

AMA Style

Pi C-H, Ye W-Y, Cheng S. Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation. Applied Sciences. 2021; 11(7):3257. https://0-doi-org.brum.beds.ac.uk/10.3390/app11073257

Chicago/Turabian Style

Pi, Chen-Huan, Wei-Yuan Ye, and Stone Cheng. 2021. "Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation" Applied Sciences 11, no. 7: 3257. https://0-doi-org.brum.beds.ac.uk/10.3390/app11073257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Quadrotor Control through Reinforcement Learning with Disturbance Compensation

Abstract

1. Introduction

2. Preliminary Information

2.1. Quadrotor Dynamic Model

2.2. Reinforcement Learning

3. Disturbance Observation and Control Strategy

3.1. Disturbance Observer

3.2. Disturbance Compensator

4. Experiments

4.1. RL Controller Training

4.2. Results of the Indoor Experiment

4.3. Results of the Outdoor Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI