Next Article in Journal
Νarrow Row Spacing and Cover Crops to Suppress Weeds and Improve Sulla (Hedysarum coronarium L.) Biomass Production
Next Article in Special Issue
Energy Efficient Routing and Dynamic Cluster Head Selection Using Enhanced Optimization Algorithms for Wireless Sensor Networks
Previous Article in Journal
An Efficient Method Combined Data-Driven for Detecting Electricity Theft with Stacking Structure Based on Grey Relation Analysis
Previous Article in Special Issue
Design of Power Location Coefficient System for 6G Downlink Cooperative NOMA Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiagent Reinforcement Learning Based on Fusion-Multiactor-Attention-Critic for Multiple-Unmanned-Aerial-Vehicle Navigation Control

1
Department of Computer Science and Engineering, Konkuk University, Seoul 05029, Korea
2
Konkuk Aerospace Design-Airworthiness Research Institute, Konkuk University, Seoul 05029, Korea
*
Authors to whom correspondence should be addressed.
Submission received: 2 September 2022 / Revised: 4 October 2022 / Accepted: 5 October 2022 / Published: 10 October 2022
(This article belongs to the Special Issue Energy Efficiency in Wireless Networks)

Abstract

:
The proliferation of unmanned aerial vehicles (UAVs) has spawned a variety of intelligent services, where efficient coordination plays a significant role in increasing the effectiveness of cooperative execution. However, due to the limited operational time and range of UAVs, achieving highly efficient coordinated actions is difficult, particularly in unknown dynamic environments. This paper proposes a multiagent deep reinforcement learning (MADRL)-based fusion-multiactor-attention-critic (F-MAAC) model for multiple UAVs’ energy-efficient cooperative navigation control. The proposed model is built on the multiactor-attention-critic (MAAC) model, which offers two significant advances. The first is the sensor fusion layer, which enables the actor network to utilize all required sensor information effectively. Next, a layer that computes the dissimilarity weights of different agents is added to compensate for the information lost through the attention layer of the MAAC model. We utilize the UAV LDS (logistic delivery service) environment created by the Unity engine to train the proposed model and verify its energy efficiency. The feature that measures the total distance traveled by the UAVs is incorporated with the UAV LDS environment to validate the energy efficiency. To demonstrate the performance of the proposed model, the F-MAAC model is compared with several conventional reinforcement learning models with two use cases. First, we compare the F-MAAC model to the DDPG, MADDPG, and MAAC models based on the mean episode rewards for 20k episodes of training. The two top-performing models (F-MAAC and MAAC) are then chosen and retrained for 150k episodes. Our study determines the total amount of deliveries done within the same period and the total amount done within the same distance to represent energy efficiency. According to our simulation results, the F-MAAC model outperforms the MAAC model, making 38% more deliveries in 3000 time steps and 30% more deliveries per 1000 m of distance traveled.

1. Introduction

In recent years the usage of unmanned aerial vehicles (UAVs) for various applications has increased spontaneously. Multiple UAVs are deployed for cooperative missions such as passenger transportation, logistics delivery, and surveillance [1]. In order to successfully carry out the mission in limited resources and time, an energy-efficient multiple-UAV navigation control is needed for the cooperative task. Since the energy consumption of a UAV is proportional to operating time, the UAV’s energy efficiency is directly related to a high performance [2]. To develop an energy-efficient multiple-UAV control model, control complexity is a typical problem that needs to be resolved. When UAVs perform cooperative missions together, the decision of one UAV affects the decision of other UAVs. Moreover, complexity increases exponentially as the number of UAVs increases [3]. Consequently, there are clear limitations in solving such problems with existing conventional heuristic-based search algorithms.
Multiagent deep reinforcement learning (MADRL) is a novel model that enables each agent to perform cooperative tasks by interacting with other agents through their own decisions. MADRL is a suitable model compared to a conventional model, which can be applied to various environments where multiple agents exist, such as multirobot controls, multiplayer games, and multiple-UAV control, etc. [4,5]. Unlike a ground vehicle that moves on a 2D plane, the range of a UAV’s motion is much broader. As a result, the movement strategy for mission performance is more diverse. Furthermore, UAVs must make appropriate decisions by using their own sensor information and the information retrieved by other UAVs. For these reasons, a suitable MADRL model must be selected for efficient navigation control.
There has been considerable research carried out in reinforcement learning (RL) based on UAV navigation and its application. G. Muñoz et al. [6] developed a DQN-based model applied to a single UAV for navigation with obstacle avoidance. The Airsim-based realistic simulated 3D environment was utilized for training the agent. The author evaluated and demonstrated that the proposed model outperformed other DQN-based algorithms. Similarly, H. Qie et al. [7] proposed a multiagent deep deterministic policy gradient (MADDPG)-based model for multiple-UAV target assignment and path planning. The results showed that agents could be assigned to their targets at a relatively close distance with a clear behavior for avoiding threat areas. Linfei Feng [8] introduced the policy gradient (PG) model, which could be applied to optimize the logistics distribution routes of a single UAV. The results showed that the UAV arranged delivery routes to multiple destinations with the shortest path. Ory Walker et al. [9] developed a framework based on the combination of proximal policy optimization (PPO) and adaptive belief tree (ABT) for multiple-UAV exploration and target finding. The proposed algorithm was verified in both 2D and 3D environments with the physically simulated UAVs using the PX4 software stack. W.J. Yoon et al. [10] utilized the QMIX model for eVTOL mobility in drone taxi applications. The proposed QMIX-based algorithm showed optimal performance when compared with independent DQN (I-DQN) and a random walk in the drone taxi service scenario. Zhou W. et al. [11] proposed a reciprocal-reward multiagent actor-critic (MAAC-R) method and applied it for learning cooperative tracking policies for UAV swarms. The training results demonstrated that the proposed model performed better than the MAAC model in terms of cooperative tracking behaviors of UAV swarms. D. Xu et al. [12] improved the MADDPG-based algorithm and applied it for the autonomous and cooperative control of UAV clusters in combat missions. The proposed algorithm was tested by performing two conventional combat missions. The result showed that the learning efficiency and the operational safety factor were improved when compared with the original MADDPG algorithm. Similarly, Guang Zhan et al. [13] applied multiagent proximal policy optimization (MAPPO) in a Unity based 3D-simulated air combat environment. The proposed algorithm was trained with a Ray based distributed training framework. In the experiment, MAPPO outperformed COMA and BiCNet in average accumulate reward. Table 1 shows a detailed comparison of research activity conducted utilizing MADRL and RL.
From Table 1, most of the research was carried out using actor-critic-based models. Additionally, based on the previous research related to MADRL, we conclude that centralized training with a decentralized execution methodology is more suitable for real-world situations. In real-world execution, it is difficult for one UAV to obtain data from all other UAVs in real time. A decentralized actor network can be used to infer the action in such a partially observable environment. We paid attention to the multi-actor-attention-critic (MAAC) model, which showed optimal performance among algorithms based on a centralized critic and a decentralized policy, which can be used in environments where information exchange between agents is not guaranteed [16].
This study makes the following significant contributions.
  • The development of an MAAC-based model with two significant improvements by applying a sensor fusion layer in the actor network and a dissimilarity layer in the critic network.
  • A new feature to calculate the energy efficiency of UAVs is incorporated with the previously developed UAV LDS simulation environment.
  • The performance of the existing RL and MADRL models are compared with two energy efficiency indicators.
In this research, we focus on optimizing learning efficiency by efficiently processing the observations of multiple UAVs by adding two features to the MAAC model. First, we introduce a sensor fusion layer in the actor network to extract features from various sensors such as a ray-cast sensor for preventing collision with adjacent obstacles, an inertial navigation system (INS) for the self-awareness of flight status, and a radio detection and ranging (RADAR) system for collecting location data from other UAVs. Second, in the critic network, a dissimilarity layer is added to provide more weight to the information of agents with fewer similarities. By implementing these functions, the efficiency of information processing is increased, and we prove through experiments that it plays a decisive role in achieving the goal of energy-efficient UAV navigation control.
To experiment and validate our proposed MADRL model, the logistic delivery service virtual test bed is adopted from our previous research [21]. The test bed is customized by adding an energy efficiency module for multiple-UAV cooperation specifically for logistic delivery. To find out whether UAVs can cooperatively perform missions well, the environment includes a scenario in which two UAVs cooperate for transport logistics. A function to measure the total travel distance of UAVs has been added to validate the energy efficiency of the UAVs. Our proposed model shows the highest performance in terms of energy efficiency compared to conventional RL algorithms. We measure energy efficiency with the number of trips carried out during the same time, and the number of cargos carried out during the same distance traveled. Our model shows superiority in both indicators.
Our work is structured as follows. Section 2 covers the general background of the RL and MADRL algorithms. In Section 3, we expound on the proposed fusion-MAAC (F-MAAC) method. Section 4, the test bed for the training and evaluation is described in detail. Section 5 shows the results and discusses performance evaluation. Finally, the study concludes with future directions in Section 6.

2. Background

RL is a field that has recently been spotlighted in the field of machine learning. It is a technology that learns a model through the trial and error of an agent in a given environment without any data. RL can be described as a learning process that develops a behavior through trial and error to maximize the cumulative reward in a sequential decision-making problem. The Markov decision process (MDP) can be expressed as a sequential decision-making problem. RL is being utilized in various fields and situations expressed as sequential decision-making problems, such as stock investment, driving, and games.

2.1. Markov Decision Process (MDP)

RL is an optimization method for solving sequential decision-making problems using the Markov decision process (MDP). The MDP is defined as follows.
S , A , P , R , γ
Here, S stands for state space and A stands for action space. P is the probability distribution of the next state s when the agent chooses the action a A from the state s S , and R means the reward received in the next state s . For the cumulative reward, the future reward is depreciated using the discount rate γ . This reflects future uncertainty and prevents the divergence of the cumulative reward so that learning can be performed stably. Figure 1a exemplifies the basic concept of MDP. When an agent chooses action a, the environments proceed to the next step by action a and return the next state s and reward r.
A Markov game is a multiagent extension of the MDP [22]. A Markov game is defined as a set of states and actions for N agents. A probability distribution for the next state is given through the current state and action of each agent. The reward function for each agent depends on the global state and action of all agents. Observation O i is a partial state that agent i can observe and includes some information of the global state. Each agent learns the policy π : O i P A i that maximizes the expected sum of rewards. Figure 1b shows the multiple-agent interaction with the environment to update the rewards. Multiple agents { A 1 A N } send the action command a N = { A 1 { a 1 }, A 2 { a 2 }, …, A N { a N }} to the environment. The environment returns the following set of state s N = { A 1 { s 1 }, A 2 { s 2 }, …, A N { s N }} and reward r N = { A 1 { r 1 }, A 2 { r 2 }, …, A N { r N }}.

2.2. Bellman Equation

Solving the MDP is divided into prediction and control problems. Prediction is the problem of evaluating the value of each state given a policy. Control is the problem of finding the optimal policy. The policy and value need to be expressed through the Bellman equation to solve these problems. Bellman’s equation is defined using the recursive relationship between the present time step t and the next time step t + 1. The value function V S and the action value function Q ( S , A ) can be expressed as the Bellman expectation equation and the Bellman optimal equation [23].
The expected reward G t is derived using the following equation:
G t = R t + 1 + R t + 2 + + R T
where R is the reward, G t is the sum of the rewards received from time step t + 1 to the final time step T.
Since immediate rewards are more important than the future reward, the discount factor γ is multiplied by Equation (2) to redefine G t .
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + = R t + 1 + γ G t + 1
  • Bellman’s expectation equation—The value function v π s t is calculated using the expected value G t in Equation (4).
    v π s = E π [ G t | S t = s ]
    where E π is the expectation when following the policy π .
    Now to derive the value function v π s t , Equation (3) is substituted in Equation (4).
    v π s t = E π R t + 1 + γ v π s t + 1
    The action value function q π s , a is calculated using the expected value E π ( G t ).
    q π s , a = E π [ G t | S t = s , A t = a ]
    Now, to action value function q π s t , a t , Equation (3) is substituted in Equation (6).
    q π ( s t , a t ) = E π [ R t + 1 + γ q π ( s t + 1 , a t + 1 ) ]
  • Bellman’s optimal equation—The optimal value v s and q s , a is calculated as follows:
    v s = m a x π v π s t = m a x a E [ R t + 1 + γ v π s t + 1 ]
    where m a x π is the maximum cumulative rewards and m a x a is the best action a out of all actions a t + 1 that provides a maximum reward.
    q s , a = m a x π q π s t , a t = E [ R t + 1 + γ m a x a q ( s t + 1 , a ) ]

2.3. Multiagent Deep Reinforcement Learning

Multiagent Deep Reinforcement Learning (MADRL) is one of the most popular and effective models for solving more complex problems where multiple agents collaborate to perform specific tasks. For example, playing soccer games with multiple robots where the team of robots collaborates to achieve the mission. One of the key challenges in such an environment is that the environment is more dynamic to the perceptive of each agent, which may affect the individual learning rate as a team.
  • Multiagent deep deterministic policy gradient (MADDPG)— MADDPG [15] is a multiagent extension of DPG [24], which combines DDPG [25] with DQN [26] approaches such as replay buffer and target separation. Each agent has its own actor and critic. In the MADDPG method, centralized training with a decentralized execution approach is used. The architecture of the MADDPG model is shown in Figure 2. A centralized critic network Q 1 N is used for centralized training with observations o 1 N and actions a 1 N from all other agents as input. In the decentralized execution, agents use an actor network π 1 N to choose an action by only using local information. By this approach, the MADDPG model can be applied even in a partially observable environment where communication between agents is limited.
  • Multiactor-attention-critic (MAAC)—MAAC was developed by [16] and adopted from the MADRL model. The model trains the decentralized policies in multiagent environments by utilizing centrally computed critics with an attention mechanism. It chooses relevant information for each agent at every time step. The multiattention head layer consists of multiple attention heads. The attention function in the attention head can be described as mapping a query and a set of key–value pairs to an output [27]. The attention function is calculated as Equation (10), where query Q has the corresponding key K and value V and d k is a scaling factor. As shown in Figure 3, encodings of the agent’s state and action denotes the state action encodings ( S A E i ) are the key and value. The encodings of the other agent’s state encoder ( S E j ), j \ i are the query. In each attention head N, different attention head values (AHVs) are derived according to the influence of the query, key, and value extractors. The final output attention value (AV) is achieved with the combination of AHVs. The final output Q i (o,a) is derived through fully connected layers FC 1 and FC 2 with the input of AV and SE i . In the multiattention head layer, the agent updates the weighted value which is more similar to other agents. This attention mechanism enables a more effective and flexible learning in complex multiagent environments compared to MADDPG.
A t t e n t i o n F u n t i o n Q , K , V = s o f t m a x Q K T d k V

3. Fusion-Multiactor-Attention-Critic (F-MAAC) Model

In this section, the F-MAAC model is discussed for the application of multiple-UAV cooperative navigation. To increase the learning efficiency of the agent, we used a new sensor fusion layer with MAAC. The sensor fusion layer was used for the UAV’s local observation, and another layer named cosine dissimilarity was added to utilize global information obtained by other UAVs efficiently. The overall architecture of the proposed F-MAAC model is exemplified in Figure 4.
The overall flow of the F-MAAC model follows the basis of the MAAC model, including a loss function and the gradients of the objective function. Each agent has its own independent actors and critics following a centralized training with a decentralized execution. In the training phase, all agents’ observations are entered as inputs of each agent’s critic network. In the execution phase, the decentralized actor network is used to choose the action as inference by using only its own observation for input data. This general F-MAAC model can be applied to N agents equipped with M types of sensors. The step-by-step training procedure of the F-MAAC model is as follows:
Step 1: Initialize the critic network Q 1 N ψ and actor network π 1 N θ with random parameters and synchronize the parameters of target critics Q 1 N ψ ¯ with critics Q 1 . . N φ , and target actors π 1 N θ ¯ with actors π 1 . . N θ .
Step2: Get observation o 1 N from the environment, feed-forward to actors π 1 N θ o , and select action a 1 N .
Step 3: Proceed to the next time step with actions a 1 N and get the next observations o 1 N and rewards r 1 N from the environment.
Step 4: Push the obtained set of ( o , a , o , r ) 1 N to the replay buffer.
Step 5: Repeat step 2 to 4 until the number of E data is collected.
Step 6: Sample B = ( o , a , o , r ) 1 N from the replay buffer,
Step 7: Perform a gradient descent by using B to minimize the loss function in Equation (11) with respect to the network parameter φ
L Q φ = i = 1 N E [ ( Q i φ o , a y i 2 ) ]
where y i = r i + γ E a π θ o Q i ψ ¯ o , a α log π θ i a i o i
Step 8: Perform a gradient ascent by using o 1 N in B to maximize the gradient of the objective function in Equation (12) with respect to the network parameter θ
θ i J π θ = E o D , a π θ i log π θ i a i | o i α log π θ i a i | o i + A i o , a
Step 9: Update the parameters of target critics Q 1 N ψ ¯ ( o , a ) with Equation (13) and target actors π 1 N θ ¯ ( a , s ) with Equation (14) using an update rate of τ = 0.005
ψ ¯ = ψ ¯ 1.0 τ + ψ τ
θ ¯ = θ ¯ 1.0 τ + θ τ
Step 10: Steps 2 to 9 should be repeated until the end of the episode.

3.1. Deep Fusion Layer in Actor Network

As illustrated in Figure 5, we propose a deep fusion layer in the actor network to increase efficiency. Observations are separated into the M types of sensors to extract features from each sensor. For instance, three different types of sensors are used for UAVs in our virtual UAV LDS environment: a ray-cast sensor for preventing collision with surrounding obstacles, an INS for the self-awareness of flight status, and a RADAR for retrieving coordinates of other UAVs and hubs. Each sensor’s data pass through the sensor encoder. The encoded sensor data are concatenated and pass through two fully connected layers. The output of the deep fusion layer can be expressed by Equation (15). F C 1 , F C 2 , and sensor encoders( S N E 1 3 ) are fully connected layers.
O u t p u t = F C 2 ( F C 1 ( C o n c a t ( S N E 1 ( s e n s o r 1 ) , S N E 2 ( s e n s o r 2 ) , S N E 3 ( s e n s o r 3 ) ) )

3.2. Dissimilarity Layer in Critic Network

In the critic network, state encodings ( SE 1 N ) are shared with other agents, as shown in Figure 6. The attention head in the multiattention head layer selects relevant information from other agents’ observations. The attention head is constructed with a scaled dot product [27] which calculates the degree of similarity between encoded observations of agent i ( SE i , SAE i ) and the encoded observations of the other agents j \ i ( SE j ). The UAVs at adjacent distances will have similar observation data. When more weights are provided to similar observations, the UAVs will have a wider field of view and less chance of colliding with each other.
However, there are also drawbacks derived from the multiattention head layer. For example, when an agent’s observation at a distance that is dissimilar from the current agent’s observation plays an essential role in performing its mission, it can lead to serious performance degradation. More specifically, the observation from an agent at long distances near a target point may provide helpful information. For these reasons, a dissimilarity layer was added to prevent performance degradation due to attention and to improve learning stability. In the previous study, we verified the effect of adding a dissimilarity layer to the MAAC model in a simple 2D cooperative navigation environment [28].
Cosine similarity refers to the similarity between two vectors obtained by using the cosine angle between the two vectors. The additional use of the observation multiplied by the dissimilarity value may offset the effect of attention. The dissimilarity value is calculated with the encoded observations of agent i (SE i ) and the encoded observations of other agents j \ i ( SE j ). The value passed through the dissimilarity layer’s dissimilarity value (DV) is concatenated with the value from the multiattention head layer’s attention value (AV) and SE i . Then, the concatenated value is sent to the fully connected layers FC 1 and FC 2 to calculate the critic value Q i .
Figure 7 shows the detailed process of the dissimilarity layer. The dissimilarity weight between the agent’s observations is calculated by multiplying the cosine similarity value by a negative number as in Equation (16).
C o s i n e D i s i m i l a r i t y S E i , S E n = 1 · S E i · S E n m a x S E i 2 · S E n 2 , ε
where, ε = 1 × 10 8 .
The negative dissimilarity values are replaced with 0 to focus on the agents’ information with different patterns. Then, the observations of each agent are multiplied by the cosine dissimilarity weight and concatenated. The concatenated value is entered as the input value of the fully connected layer. The output value DV from the dissimilarity layer, the output value AV from the multiattention head layer, and the encoded value S E i are concatenated as an input of the fully connected layers.

4. Test bed

4.1. UAV LDS Environment

In our previous work [21], we developed a UAV logistic delivery service (UAV LDS) environment for evaluating MADRL-based models. To calculate energy efficiency for this research, we added the feature of calculating the total movement of all agents. The UAV LDS environment is a virtual environment designed to reflect simplified logistics delivery scenarios in the real world and implemented through the Unity platform equipped with the 3D physics engine. The modified source was updated in the following repository (https://github.com/leehe228/LogisticsEnv, accessed on 3 October 2022). The environment follows the Open AI Gym API [29] design which provides standard communication between learning algorithms and environments. In LDS, multiple UAVs act as an air transportation system which is used to carry cargo in the three-dimensional city sky that connects land and air. To implement this as a simulated environment, we constructed blocks representing obstacles such as buildings, warehouses, and cargo to be transported. In the scenario, UAVs delivered big cargo and small cargo from hubs to the destination. What was unique about this environment was that two UAVs had to collaborate to move a big cargo. The reason for including this scenario was that it was possible to check whether the cooperation of UAVs worked well directly. In addition, such cooperative situations could occur any time in the real world, such as when multiple UAVs need to move together to load multiple cargos. Figure 8 shows the UAV carrying cargo in the UAV LDS environment. The gray box indicates the buildings in the real world, the blue box is the small cargo, and the red box is the big cargo. Cargos are generated from the hubs, colored blue on the ground. The destination of the big cargo is colored pink and that of the small cargo is colored green.

4.2. Observation, Action, and Reward Design

This section describes the state, action, and reward of the environment which are essential elements of MDP. The state is the observation received by the agent, the action is the type of movement that can be selected, and the reward is the compensation according to the UAV’s action.
  • Observation—The UAVs received three different sensor data such as ray-cast for preventing collision with adjacent obstacles, INS for the self-awareness of flight status, and RADAR used to find the location of the other UAVs and hubs. In Table 2, a detailed description of the sensor data is provided.
  • Actions—The UAVs could perform seven types of actions: ascend, descend, forward, backward, left, right, and not move.
  • Driving reward—To make the UAVs deliver cargo in the shortest path, a driving reward was given at every step. The reward was calculated with the difference between the distance of the previous time step d p r e and the distance of the current time step d c u r r . Each distance was calculated with the distance to the target point. Before picking up the cargo, the nearest cargo was the target point. After picking up the cargo, the delivery point was the target point. If the UAV was not closer to the target point in the current time step than in the previous time step, a negative reward was given as d p r e d c u r r × 0.5 .
  • Delivery rewards—The values in Table 3 were designed to make UAVs deliver cargo efficiently. For training numerous UAVs to work together to carry cargos, we delicately designed the rewards related to the delivery.
  • Collision penalty—The UAVs must avoid buildings and other UAVs with ray-cast observations. A negative reward of 10 was given when a collision occurred.

4.3. Environmental Setup

The UAV environment provided custom settings for the environmental setup. In this research we used the default values in Table 4 for training and evaluation.

5. Experimental Simulation and Results

The proposed F-MAAC model was validated using the environment proposed in Section 4. For more efficient evaluations of the proposed F-MAAC model, we first compared the mean episode rewards of the MAAC, MADDPG, and DDPG models with a training of 20k episodes. Then, the two models with the highest performance, F-MAAC and MAAC, were selected for the training of 150k episodes. To evaluate the trained model to achieve a meaningful scale length, the episode length was replaced with 3000 from 1000 time steps. The timescale of the environment was decreased in the evaluation phase to observe and analyze the strategies of UAVs. The total number of deliveries during one episode and the same distance traveled were evaluated to verify the energy efficiency.
The hyperparameters for training the RL models are shown in Table 5.

5.1. Comparison of Performance of RL Models

Two MADRL models (MAAC, MADDPG) and one single agent RL model (DDPG) were compared with the proposed F-MAAC model. Each model was trained for 20k episodes with 1000 steps per episode in the proposed UAV LDS simulation environment.
According to Figure 9, the DDPG’s mean episode reward value showed the worst performance because it did not increase significantly tableuntil 20k episodes. Although it rose slightly higher from 5k episodes to 20k episodes when compared to DDPG, the increase in the mean episode reward value in the MADDPG model was also minor. The F-MAAC and MAAC models, on the other hand, displayed an impressive performance and successfully conveyed some quantities of both large and small cargos. Between 10k and 20k training episodes, the F-MAAC model demonstrated a more significant value than the MAAC model in the mean episode reward. At the end of 20k training sessions, the F-MAAC model demonstrated greater mean episode rewards than the MAAC model by more than 30%.

5.2. Comparison of Performance between F-MAAC and MAAC Models

We retrained the F-MAAC and MAAC models with 150k episodes, which took about six days with two GPU machines. The detailed specifications of the machine are listed below in Table 6.
Figure 10 shows the mean episode rewards of the MAAC and F-MAAC models for 150k training episodes. The mean episode reward value of the F-MAAC and MAAC models increased noticeably in this experiment compared to the previous Section 5.1. The difference between them with training episodes until 40k was unnoticeable. After the training of 40k episodes, the F-MAAC model started to outperform the MAAC model. From 80k to 150k, the mean episode reward of the MAAC model decreased while that of the F-MAAC model constantly increased. At the end of the training, the F-MAAC model obtained 50% more rewards than the MAAC model. The randomness and instability of the complex 3D environment produced different learning patterns compared with the previous training, since the maps of the UAV LDS environment were generated randomly for every episode. However, both results showed that the F-MAAC model outperformed the MAAC model. The result of this experiment showed a more reliable comparison since it was trained longer, until 150k episodes.

5.3. Comparison of Energy Efficiency between F-MAAC and MAAC Models

For energy efficiency evaluation, we executed the trained model of F-MAAC and MAAC with 150k episodes. Each model was executed for 100 episodes with 3000 time steps of each episode. The average performance per episode is shown with a box plot in Figure 11. We show the number of successful deliveries of small cargo and big cargo. Furthermore, the total performance was evaluated with S c o r e = N u m b e r O f S m a l l C a r g o + 1.5 N u m b e r O f B i g C a r g o . The weight of 1.5 was multiplied by the number of big cargos since we gave 50% more rewards to the big cargos in the training phase.
The result showed that the number of deliveries in both small and big cargos with the F-MAAC model was higher than in the MAAC model. Table 7 shows that the score of the F-MAAC model was 38% higher than that of the MAAC model during one episode, indicating that the F-MAAC model was more energy efficient.
We also provided the energy efficiency with Score_movement, which is the performance per 1000 m distance moved. We recorded the total movements of the UAV during execution. The Score_movement was calculated with S c o r e M o v e m e n t × 1000 . The results showed that the F-MAAC model was 30% more efficient compared to the MAAC model.
In addition, the number of collisions of the F-MAAC model was about 9% less than that of the MAAC model. The improvement of the F-MAAC model’s sensor processing efficiency can be interpreted as having a positive effect on the obstacle avoidance performance of the UAV.

6. Conclusions

This study proposed an MAAC-based multiple-UAV navigation control model that improved energy efficiency through efficient data processing of the UAVs. The following significant findings were obtained.
(a) In the proposed model, the sensor fusion layer was adapted in the actor network, and the dissimilarity layer was utilized for the critic network. When applied to the UAV LDS simulation environment, it outperformed the conventional RL model in terms of energy efficiency.
(b) The sensor fusion layer extracted features from each sensor enabling the UAVs to use various sensor data efficiently. The dissimilarity layer compensated for the loss derived from the attention layer by providing data with high dissimilarity to other agents.
(c) The F-MAAC-applied UAVs transported more cargo than the MAAC in the same amount of time and distance with greater cooperation and fewer collisions.
The feature of measuring the total movement of UAVs was added to the existing UAV LDS environment to calculate energy efficiency. We provided two indicators that calculated the energy efficiency of UAVs. The proposed model showed the best performance in both types of energy efficiency indicators out of various RL models, including the original MAAC model. In future studies, further verification and development are needed for the model in a more sophisticated environment, including realistic sensors and dynamic flight models. Furthermore, the scalability should be verified in a broader environment where more agents exist.

Author Contributions

Conceptualization, S.J.; Investigation, S.J. and H.C.; Methodology, S.J. and H.J.; Project administration, V.K.K. and D.M.; Software, H.L.; Supervision, V.K.K. and D.M.; Validation, V.K.K. and T.A.N.; Visualization, S.J.; Writing—original draft, S.J.; Writing—review & editing, V.K.K. and T.A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (no. 2020R1A6A1A03046811). This work was supported by the National Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT (MIST)) (no. 2021R1A2C209494311).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

UAVUnmanned aerial vehicle
RLReinforcement learning
MADRLMultiagent reinforcement learning
LDSLogistic delivery service
MAACMultiactor-attention-critic
F-MAACFusion-multiactor-attention-critic
DDPGDeep deterministic policy gradient
MADDPGMultiagent deep deterministic gradient
MDPMarkov decision process
INSInertial navigation system
RADARRadio detection and ranging

References

  1. Roldán, J.J.; Cerro, J.D.; Barrientos, A. A proposal of methodology for multi-UAV mission modeling. In Proceedings of the 2015 23rd Mediterranean Conference on Control and Automation (MED), Torremolinos, Spain, 16–19 June 2015; pp. 1–7. [Google Scholar]
  2. Abeywickrama, H.V.; Jayawickrama, B.A.; He, Y.; Dutkiewicz, E. Comprehensive energy consumption model for unmanned aerial vehicles, based on empirical studies of battery performance. IEEE Access 2018, 6, 58383–58394. [Google Scholar] [CrossRef]
  3. Zhang, J.; Jiahao, X.I.N.G. Cooperative task assignment of multi-UAV system. Chin. J. Aeronaut. 2020, 33, 2825–2827. [Google Scholar] [CrossRef]
  4. Nguyen, T.T.; Nguyen, N.D.; Nahavandi, S. Deep reinforcement learning for multiagent systems: A review of challenges, solutions, and applications. IEEE Trans. Cybern. 2020, 50, 3826–3839. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Chang, H.; Chen, Y.; Zhang, B.; Doermann, D. Multi-UAV mobile edge computing and path planning platform based on reinforcement learning. IEEE Trans. Emerg. Top. Comput. Intell. 2021, 6, 489–498. [Google Scholar] [CrossRef]
  6. Muñoz, G.; Barrado, C.; Çetin, E.; Salami, E. Deep reinforcement learning for drone delivery. Drones 2019, 3, 72. [Google Scholar] [CrossRef] [Green Version]
  7. Qie, H.; Shi, D.; Shen, T.; Xu, X.; Li, Y.; Wang, L. Joint optimization of multi-UAV target assignment and path planning based on multi-agent reinforcement learning. IEEE Access 2019, 7, 146264–146272. [Google Scholar] [CrossRef]
  8. Feng, L. Reinforcement learning to optimize the logistics distribution routes of unmanned aerial vehicle. arXiv 2020, arXiv:2004.09864. [Google Scholar]
  9. Walker, O.; Vanegas, F.; Gonzalez, F. A framework for multi-agent UAV exploration and target-finding in GPS-denied and partially observable environments. Sensors 2020, 20, 4739. [Google Scholar] [CrossRef] [PubMed]
  10. Yun, W.J.; Jung, S.; Kim, J.; Kim, J.H. Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications. ICT Express 2021, 7, 1–4. [Google Scholar] [CrossRef]
  11. Zhou, W.; Li, J.; Liu, Z.; Shen, L. Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning. Chin. J. Aeronaut. 2022, 35, 100–112. [Google Scholar] [CrossRef]
  12. Xu, D.; Chen, G. Autonomous and cooperative control of UAV cluster with multi-agent reinforcement learning. Aeronaut. J. 2022, 126, 932–951. [Google Scholar] [CrossRef]
  13. Zhan, G.; Zhang, X.; Li, Z.; Xu, L.; Zhou, D.; Yang, Z. Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework. Drones 2022, 6, 166. [Google Scholar] [CrossRef]
  14. Yu, C.; Velu, A.; Vinitsky, E.; Wang, Y.; Bayen, A.; Wu, Y. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv 2021, arXiv:2103.01955. [Google Scholar]
  15. Lowe, R.; Wu, Y.I.; Tamar, A.; Harb, J.; Abbeel, O.P.; Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. 30p. [Google Scholar]
  16. Iqbal, S.; Sha, F. Actor-attention-critic for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 2961–2970. [Google Scholar]
  17. Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; Whiteson, S. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 4295–4304. [Google Scholar]
  18. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar]
  19. Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the Advances in Neural Information Processing Systems 12 (NIPS 1999), Denver, CO, USA, 29 November–4 December 1999; Volume 12. [Google Scholar]
  20. Hasselt, H.V.; Guez, A.; Silver, D. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; Volume 30. [Google Scholar]
  21. Jo, H.; Lee, H.; Jeon, S.; Kaliappan, V.K.; Nguyen, T.A.; Min, D.; Lee, J.W. Multi-Agent Reinforcement Learning-based UAS Control for Logistics Environments. In Proceedings of the Asia-Pacific International Symposium on Aerospace Technology, Jeju, Korea, 15–17 November 2021. [Google Scholar]
  22. Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings; Morgan Kaufmann: Waltham, MA, USA, 1994; pp. 157–163. [Google Scholar]
  23. Glorennec, P.Y. Reinforcement learning: An overview. In Proceedings of the European Symposium on Intelligent Techniques (ESIT-00), Aachen, Germany, 14–15 September 2000; pp. 14–15. [Google Scholar]
  24. Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
  25. Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
  26. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Hassabis, D. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
  27. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. 30p. [Google Scholar]
  28. Jeon, S.; Kaliappan, V.K. Dissimilarity Multi Actor Attention Critic based model for robot navigations in cooperative disaster recovery applications. In Proceedings of the International Virtual Conference on Industry 4.0, Amsterdam, The Netherlands, 22–24 September 2022. [Google Scholar]
  29. OpenAi Gym. Available online: https://github.com/openai/gym (accessed on 3 October 2022).
Figure 1. Conceptual diagram: (a) Markov decision process and (b) Markov game (a—action, s—state, r—reward).
Figure 1. Conceptual diagram: (a) Markov decision process and (b) Markov game (a—action, s—state, r—reward).
Energies 15 07426 g001
Figure 2. Overall architecture of DDPG.
Figure 2. Overall architecture of DDPG.
Energies 15 07426 g002
Figure 3. Critic network of MAAC.
Figure 3. Critic network of MAAC.
Energies 15 07426 g003
Figure 4. F-MAAC model’s overall architecture with training flow.
Figure 4. F-MAAC model’s overall architecture with training flow.
Energies 15 07426 g004
Figure 5. Actor network of F-MAAC model.
Figure 5. Actor network of F-MAAC model.
Energies 15 07426 g005
Figure 6. Critic network of F-MAAC model.
Figure 6. Critic network of F-MAAC model.
Energies 15 07426 g006
Figure 7. Dissimilarity layer in critic network.
Figure 7. Dissimilarity layer in critic network.
Energies 15 07426 g007
Figure 8. UAV logistic delivery service virtual environment.
Figure 8. UAV logistic delivery service virtual environment.
Energies 15 07426 g008
Figure 9. Mean episode rewards comparison of different models for 20k episodes.
Figure 9. Mean episode rewards comparison of different models for 20k episodes.
Energies 15 07426 g009
Figure 10. Mean episode rewards of the MAAC and F-MAAC models for 150k episodes.
Figure 10. Mean episode rewards of the MAAC and F-MAAC models for 150k episodes.
Energies 15 07426 g010
Figure 11. Comparison of delivery performance.
Figure 11. Comparison of delivery performance.
Energies 15 07426 g011
Table 1. Comparison of RL-based UAV application.
Table 1. Comparison of RL-based UAV application.
Name of the ResearchYearBaselineActor CriticSingle/
Multiagent
Centralized/
Decentralized
ApplicationsSimulated Environment
Multiple-UAV Reinforcement Learning Algorithm Based on Improved PPO in Ray Framework [13]2022MAPPO [14]YesMultiagentCentralized critic with decentralized actorDistributed decision-making and complete cooperation taskUnity collaborative combat environment 3D
Autonomous and cooperative control of UAV cluster with multi-agent reinforcement learning [12]2022MADDPG [15]YesMultiagentCentralized critic with decentralized actorAutonomous and cooperative control of UAV clustersConventional combat environment
Improving multi-target cooperative tracking guidance for UAV swarms using multi-agent reinforcement learning [11]2021MAAC [16]YesMultiagentCentralized critic with decentralized actorTracking the perceived targets and searching the unknown targetsCoordinate plane 2D
Distributed deep reinforcement learning for autonomous aerial eVTOL mobility in drone taxi applications [10]2021QMIX [17]YesMultiagentCentralized critic with decentralized actorComputing the optimal passenger transportation routes200-by-200 grid map 2D
A Framework for Multi-Agent UAV Exploration and Target-Finding in GPS-Denied and Partially Observable Environments [9]2020ABT + PPO [18]YesMultiagentDecentralized actor and criticMultiple-UAV exploration and target findingOccupancy map with OpenAI Gym 2D + 3DR Iris and 3DR Solo with Gazebo 3D
Reinforcement Learning to Optimize the Logistics Distribution Routes of Unmanned Aerial Vehicle [8]2020PG [19]YesSingle Agent-Path planning for UAVs in complex surroundingsCoordinate plane 2D
Joint Optimization of Multi-UAV Target Assignment and Path Planning Based on Multi-Agent Reinforcement Learning [7]2019MADDPG [15]YesMultiagentCentralized critic with decentralized actorMultiple-UAV target assignment and path planningOpenAI’s platform 2D
Deep reinforcement learning for drone delivery [6]2019DDQN [20]NoSingle Agent-Navigation with obstacle avoidance in realistic environmentRealistic neighborhood environment on AirSim 3D
Table 2. Summary of observations.
Table 2. Summary of observations.
Sensor TypeSizeDescription
Ray-cast1 × 9Distance of 9 directions of ray-cast sensor
2 × 9One-hot encoding of the detected object (nothing, building) of 9 direction of ray-cast sensor
INS3(x, y, z)—coordinates of UAV i .
3(x, y, z)—velocity of UAV i .
3One-hot encoded cargo type (not holding, small cargo, and big cargo).
RADAR6(x, y, z, x, y, z)—coordinates of a big cargo hub and a small cargo hub.
2Distance from UAV to big and small cargo hubs.
6(x, y, z, x, y, z)—each nearest big and small cargo coordinates.
2Distances from UAV i to the nearest big and small cargos.
4(x, y, z, d) if UAV i holds any cargo, the coordinates and distance of the destination are given.
7 × 4 Coordinates of UAV j (size 3), cargo type of UAV j (size 3), and distance from UAV i to UAV j (size 1). *
* UAVi is the current, and UAVj are the rest of all UAVs except UAVi.
Table 3. Summary of delivery rewards.
Table 3. Summary of delivery rewards.
ActionCollaborativeFirst UAVSecond UAV
UAV picks up a small cargoNo+20.0-
Small cargo delivery completedNo+20.0-
First UAV picks up a big cargoYes+10.0-
The second UAV picks up a big cargoYes+ 10.0+20.0
Big cargo delivery completedYes+30.0+30.0
First UAV drops a big cargoYes−8.0-
Both UAVs drop a big cargoYes−15.0−15.0
Table 4. Summary of environmental setup.
Table 4. Summary of environmental setup.
ParameterDescriptionDefault (Training)Default (Execution)
NumAgentTotal number of UAVs55
widthWidth of the Unity window480 pixels1280 pixels
heightHeight of the Unity window270 pixels720 pixels
timescaleThe multiplier for the time20×
mapsizeSize of the map13 m13 m
numbuildingNumber of buildings3 units3 units
MaxSmallboxTotal number of small cargos that can be generated100 units100 units
MaxBigboxTotal number of big cargos that can be generated100 units100 units
Table 5. Hyperparameter settings of RL models.
Table 5. Hyperparameter settings of RL models.
DDPGMADDPGMAACF-MAAC
Number of episodes1000100010001000
Steps per update100100250250
Batch size1024102410241024
Number of attention heads--44
Policy hidden dimension128128128128
Learning rate of critic0.010.010.0010.001
Learning rate of policy0.010.010.0010.001
Table 6. Specifications and environmental setup of the GPU machine.
Table 6. Specifications and environmental setup of the GPU machine.
CPUIntel i7 8700 k
GPUNvidia RTX 3080
RAM64 GB
OSUbuntu 20.04 LTS
Deep Learning FrameworkPytorch 1.8.2
Table 7. Overall comparison of MAAC and F-MAAC models.
Table 7. Overall comparison of MAAC and F-MAAC models.
MAACF-MAAC
Score13.2918.31
Movement1200 m1270 m
Collision9.28.4
Score_movement11.0814.42
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Jeon, S.; Lee, H.; Kaliappan, V.K.; Nguyen, T.A.; Jo, H.; Cho, H.; Min, D. Multiagent Reinforcement Learning Based on Fusion-Multiactor-Attention-Critic for Multiple-Unmanned-Aerial-Vehicle Navigation Control. Energies 2022, 15, 7426. https://0-doi-org.brum.beds.ac.uk/10.3390/en15197426

AMA Style

Jeon S, Lee H, Kaliappan VK, Nguyen TA, Jo H, Cho H, Min D. Multiagent Reinforcement Learning Based on Fusion-Multiactor-Attention-Critic for Multiple-Unmanned-Aerial-Vehicle Navigation Control. Energies. 2022; 15(19):7426. https://0-doi-org.brum.beds.ac.uk/10.3390/en15197426

Chicago/Turabian Style

Jeon, Sangwoo, Hoeun Lee, Vishnu Kumar Kaliappan, Tuan Anh Nguyen, Hyungeun Jo, Hyeonseo Cho, and Dugki Min. 2022. "Multiagent Reinforcement Learning Based on Fusion-Multiactor-Attention-Critic for Multiple-Unmanned-Aerial-Vehicle Navigation Control" Energies 15, no. 19: 7426. https://0-doi-org.brum.beds.ac.uk/10.3390/en15197426

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop