Reinforcement Learning Algorithms

A special issue of Algorithms (ISSN 1999-4893). This special issue belongs to the section "Evolutionary Algorithms and Machine Learning".

Deadline for manuscript submissions: closed (31 March 2022) | Viewed by 13927

Special Issue Editor

Computer Science and Creative Technologies, University of the West of England, Bristol BS16 1QY, UK
Interests: metaheuristics; parallel computing; multi-agent systems; planning and scheduling
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Reinforcement learning is a modern machine learning paradigm which implies gaining experience on run-time and using it to learn the relationship between input and output sets. It is particularly useful for modern robotic applications as well as softbot systems to let robotic and software agents smartly interact with their living/operating environments; this enables them to take the environmental events as stimuli to decide the best reaction to the environment, in response to what is ongoing within the environment. The main principle behind reinforcement learning is to let agents learn how to act optimally in order to maximise the environmental reward generated in response to its actions. Reinforcement learning is a research hot topic in AI, machine learning, and data science studies. A Special Issue on the advancements in reinforcement learning research will help update the current level of the state-of-the-art approaches, technologies, and applications in this regard. We are seeking recent research results in reinforcement learning.

Dr. Mehmet Aydin
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Algorithms is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Reinforcement learning
  • Q-leanring
  • Temporal differences
  • Markovien decision processes
  • Deep reinforcement learning
  • Deep Q-network algorithm
  • Policy optimisation
  • Policy based reinforcement learning
  • Actor-critic RL algorithms

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 765 KiB  
Article
Outsmarting Human Design in Airline Revenue Management
by Giovanni Gatti Pinheiro, Michael Defoin-Platel and Jean-Charles Regin
Algorithms 2022, 15(5), 142; https://0-doi-org.brum.beds.ac.uk/10.3390/a15050142 - 22 Apr 2022
Cited by 2 | Viewed by 2676
Abstract
The accurate estimation of how future demand will react to prices is central to the optimization of pricing decisions. The systems responsible for demand prediction and pricing optimization are called revenue management (RM) systems, and, in the airline industry, they play an important [...] Read more.
The accurate estimation of how future demand will react to prices is central to the optimization of pricing decisions. The systems responsible for demand prediction and pricing optimization are called revenue management (RM) systems, and, in the airline industry, they play an important role in the company’s profitability. As airlines’ current pricing decisions impact future knowledge of the demand behavior, the RM systems may have to compromise immediate revenue by efficiently performing price experiments with the expectation that the information gained about the demand behavior will lead to better future pricing decisions. This earning while learning (EWL) problem has captured the attention of both the industry and academia in recent years, resulting in many proposed solutions based on heuristic optimization. We take a different approach that does not depend on human-designed heuristics. We present the EWL problem to a reinforcement learning agent, and the agent’s goal is to maximize long-term revenue without explicitly considering the optimal way to perform price experimentation. The agent discovers through experience that “myopic” revenue-maximizing policies may lead to a decrease in the demand model quality (which it relies on to take decisions). We show that the agent finds novel pricing policies that balance revenue maximization and demand model quality in a surprisingly effective way, generating more revenue over the long run than current practices. Full article
(This article belongs to the Special Issue Reinforcement Learning Algorithms)
Show Figures

Figure 1

44 pages, 3503 KiB  
Article
Long-Term Visitation Value for Deep Exploration in Sparse-Reward Reinforcement Learning
by Simone Parisi, Davide Tateo, Maximilian Hensel, Carlo D’Eramo, Jan Peters and Joni Pajarinen
Algorithms 2022, 15(3), 81; https://0-doi-org.brum.beds.ac.uk/10.3390/a15030081 - 28 Feb 2022
Cited by 2 | Viewed by 2526
Abstract
Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the [...] Read more.
Reinforcement learning with sparse rewards is still an open challenge. Classic methods rely on getting feedback via extrinsic rewards to train the agent, and in situations where this occurs very rarely the agent learns slowly or cannot learn at all. Similarly, if the agent receives also rewards that create suboptimal modes of the objective function, it will likely prematurely stop exploring. More recent methods add auxiliary intrinsic rewards to encourage exploration. However, auxiliary rewards lead to a non-stationary target for the Q-function. In this paper, we present a novel approach that (1) plans exploration actions far into the future by using a long-term visitation count, and (2) decouples exploration and exploitation by learning a separate function assessing the exploration value of the actions. Contrary to existing methods that use models of reward and dynamics, our approach is off-policy and model-free. We further propose new tabular environments for benchmarking exploration in reinforcement learning. Empirical results on classic and novel benchmarks show that the proposed approach outperforms existing methods in environments with sparse rewards, especially in the presence of rewards that create suboptimal modes of the objective function. Results also suggest that our approach scales gracefully with the size of the environment. Full article
(This article belongs to the Special Issue Reinforcement Learning Algorithms)
Show Figures

Figure 1

17 pages, 550 KiB  
Article
Transfer Learning for Operator Selection: A Reinforcement Learning Approach
by Rafet Durgut, Mehmet Emin Aydin and Abdur Rakib
Algorithms 2022, 15(1), 24; https://0-doi-org.brum.beds.ac.uk/10.3390/a15010024 - 17 Jan 2022
Cited by 4 | Viewed by 2706
Abstract
In the past two decades, metaheuristic optimisation algorithms (MOAs) have been increasingly popular, particularly in logistic, science, and engineering problems. The fundamental characteristics of such algorithms are that they are dependent on a parameter or a strategy. Some online and offline strategies are [...] Read more.
In the past two decades, metaheuristic optimisation algorithms (MOAs) have been increasingly popular, particularly in logistic, science, and engineering problems. The fundamental characteristics of such algorithms are that they are dependent on a parameter or a strategy. Some online and offline strategies are employed in order to obtain optimal configurations of the algorithms. Adaptive operator selection is one of them, and it determines whether or not to update a strategy from the strategy pool during the search process. In the field of machine learning, Reinforcement Learning (RL) refers to goal-oriented algorithms, which learn from the environment how to achieve a goal. On MOAs, reinforcement learning has been utilised to control the operator selection process. However, existing research fails to show that learned information may be transferred from one problem-solving procedure to another. The primary goal of the proposed research is to determine the impact of transfer learning on RL and MOAs. As a test problem, a set union knapsack problem with 30 separate benchmark problem instances is used. The results are statistically compared in depth. The learning process, according to the findings, improved the convergence speed while significantly reducing the CPU time. Full article
(This article belongs to the Special Issue Reinforcement Learning Algorithms)
Show Figures

Figure 1

22 pages, 682 KiB  
Article
Multi-Objective UAV Positioning Mechanism for Sustainable Wireless Connectivity in Environments with Forbidden Flying Zones
by İbrahim Atli, Metin Ozturk, Gianluca C. Valastro and Muhammad Zeeshan Asghar
Algorithms 2021, 14(11), 302; https://0-doi-org.brum.beds.ac.uk/10.3390/a14110302 - 21 Oct 2021
Cited by 3 | Viewed by 1572
Abstract
A communication system based on unmanned aerial vehicles (UAVs) is a viable alternative for meeting the coverage and capacity needs of future wireless networks. However, because of the limitations of UAV-enabled communications in terms of coverage, energy consumption, and flying laws, the number [...] Read more.
A communication system based on unmanned aerial vehicles (UAVs) is a viable alternative for meeting the coverage and capacity needs of future wireless networks. However, because of the limitations of UAV-enabled communications in terms of coverage, energy consumption, and flying laws, the number of studies focused on the sustainability element of UAV-assisted networking in the literature was limited thus far. We present a solution to this problem in this study; specifically, we design a Q-learning-based UAV placement strategy for long-term wireless connectivity while taking into account major constraints such as altitude regulations, nonflight zones, and transmit power. The goal is to determine the best location for the UAV base station (BS) while reducing energy consumption and increasing the number of users covered. Furthermore, a weighting method is devised, allowing energy usage and the number of users served to be prioritized based on network/battery circumstances. The suggested Q-learning-based solution is contrasted to the standard k-means clustering method, in which the UAV BS is positioned at the centroid location with the shortest cumulative distance between it and the users. The results demonstrate that the proposed solution outperforms the baseline k-means clustering-based method in terms of the number of users covered while achieving the desired minimization of the energy consumption. Full article
(This article belongs to the Special Issue Reinforcement Learning Algorithms)
Show Figures

Figure 1

20 pages, 1211 KiB  
Article
Multiagent Hierarchical Cognition Difference Policy for Multiagent Cooperation
by Huimu Wang, Zhen Liu, Jianqiang Yi and Zhiqiang Pu
Algorithms 2021, 14(3), 98; https://0-doi-org.brum.beds.ac.uk/10.3390/a14030098 - 21 Mar 2021
Viewed by 2604
Abstract
Multiagent cooperation is one of the most attractive research fields in multiagent systems. There are many attempts made by researchers in this field to promote cooperation behavior. However, several issues still exist, such as complex interactions among different groups of agents, redundant communication [...] Read more.
Multiagent cooperation is one of the most attractive research fields in multiagent systems. There are many attempts made by researchers in this field to promote cooperation behavior. However, several issues still exist, such as complex interactions among different groups of agents, redundant communication contents of irrelevant agents, which prevents the learning and convergence of agent cooperation behaviors. To address the limitations above, a novel method called multiagent hierarchical cognition difference policy (MA-HCDP) is proposed in this paper. It includes a hierarchical group network (HGN), a cognition difference network (CDN), and a soft communication network (SCN). HGN is designed to distinguish different underlying information of diverse groups’ observations (including friendly group, enemy group, and object group) and extract different high-dimensional state representations of different groups. CDN is designed based on a variational auto-encoder to allow each agent to choose its neighbors (communication targets) adaptively with its environment cognition difference. SCN is designed to handle the complex interactions among the agents with a soft attention mechanism. The results of simulations demonstrate the superior effectiveness of our method compared with existing methods. Full article
(This article belongs to the Special Issue Reinforcement Learning Algorithms)
Show Figures

Figure 1

Back to TopTop