A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station

Wang, Ruisheng; Chen, Zhong; Xing, Qiang; Zhang, Ziqi; Zhang, Tian

doi:10.3390/su14031884

Open AccessArticle

A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station

¹

School of Electrical Engineering, Southeast University, Nanjing 210096, China

²

School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(3), 1884; https://0-doi-org.brum.beds.ac.uk/10.3390/su14031884

Submission received: 10 January 2022 / Revised: 29 January 2022 / Accepted: 3 February 2022 / Published: 7 February 2022

Download

Browse Figures

Versions Notes

Abstract

:

To improve the operating efficiency and economic benefits, this article proposes a modified rainbow-based deep reinforcement learning (DRL) strategy to realize the charging station (CS) optimal scheduling. As the charging process is a real-time matching between electric vehicles ‘(EVs) charging demand and CS equipment resources, the CS charging scheduling problem is duly formulated as a finite Markov decision process (FMDP). Considering the multi-stakeholder interaction among EVs, CSs, and distribution networks (DNs), a comprehensive information perception model was constructed to extract the environmental state required by the agent. According to the random behavior characteristics of the EV charging arrival and departure times, the startup of the charging pile control module was regarded as the agent’s action space. To tackle this issue, the modified rainbow approach was utilized to develop a time-scale-based CS scheme to compensate for the resource requirements mismatch on the energy scale. Case studies were conducted within a CS integrated with the photovoltaic and energy storage system. The results reveal that the proposed method effectively reduces the CS operating cost and improves the new energy consumption.

Keywords:

charging station; electric vehicle; deep reinforcement learning; rainbow method; optimal scheduling

1. Introduction

As an environment-friendly means of transportation, electric vehicles (EVs) have received much attention in recent years. Governments worldwide encourage transportation electrification to solve energy and environmental problems [1,2]. However, the explosive growth of EVs has brought challenges to distribution networks (DNs) operation and charging infrastructure construction [3,4,5]. The problems such as the difficult charging of EV users and the uneven distribution of charging facilities hinder the friendly development of EVs. In particular, it is widespread that EV owners do not perform charging operations immediately after arriving at the charging station (CS) and still occupy the charging piles after completing the charging tasks; that is, there is an “overstay issue.” The real-time CS optimal scheduling strategy is a prerequisite for solving the mismatch problem between EV charging requirements and CS resources as well as improving EV users’ favorable charging experience.

Various optimal scheduling methods have been proposed to realize the economic operation of CSs. A novel dynamic electricity price model based on the ant colony algorithm is proposed in [6], which solves the optimal pricing strategy and reduces the overlap between residential and CS charging loads. Luo et al. [7] customize a stochastic dynamic programming (SDP)-based dynamic pricing strategy for charging service providers. Compared with the greedy algorithm, the SDP effectively improves the economic benefits of charging service providers. Given that static peak-valley electricity prices cannot reflect the EV charging demands, a dynamic pricing strategy that considers user satisfaction is developed in [8] to optimize the economic benefits of EVs and power grids. The work presented in [9] proposes a nonlinear optimization model for EV charging scheduling. It uses a hybrid Nelder–Mead cuckoo search algorithm and effectively reduces the energy demand gap of EV users. Li et al. [10] propose a scheme for optimizing the cost of CSs and the contract capacity of aggregators. A chance-constrained programming method is adopted to solve this problem, and the risk that the aggregator cannot meet the power grid demand is reduced. Depending on the CS equipped with the photovoltaic (PV) and the energy storage system (ESS), reference [11] makes full use of the peak–valley tariff mechanism to formulate a CS energy management strategy to promote PV consumption. To maximize the profit of CSs, Nishimwe H. et al. [12] present an optimal scheduling method considering intermittent PV generation and the EV arrival mode. Aiming at the phenomenon of overdue parking in CSs, an interchange mechanism is presented in [13] to improve the utilization of charging facilities.

Although the above-mentioned studies deeply analyze the CS optimization control mechanism, these works rely on the conventional mathematical programming models, which are essentially time-consuming offline approaches. With the massive proliferation of EVs in the future, it is challenging to realize the flexible deployment of charging scheduling schemes and the rapid response to EV charging requirements.

Fortunately, with advances in artificial intelligence technology, especially deep reinforcement learning (DRL) solutions, DRL applied in CS scheduling has become a popular research direction. A batch reinforcement learning-based optimal charging strategy is proposed in [14] to reduce the charging cost of EV users. It realizes the proper scheduling of CSs in different dimensions. Wang et al. [15] improve reinforcement learning through a feature-based linear function approximation algorithm and finally actualize the optimization of charging scheduling and pricing scheme. However, limited by action and state space dimensions, reinforcement learning is difficult to deal with large-scale and complex decision-making problems. To balance charging costs and battery damage, Wan et al. [16] transform the uncertainty of the scheduling process into a Markov decision process. The long short-term memory network is used to capture the time characteristics of charging prices, and the Q network is used to control the real-time charging power of EVs. Considering the randomness of EVs and the uncertainty of new energy generations, the authors in [17] propose a deep Q network (DQN)-based real-time scheduling method to reduce power fluctuations and optimize charging costs. Considering the dynamic traffic information and the uncertainty of the future state, studies [18,19] use DRL to help EV owners choose the optimal routes and CSs to minimize the total time cost. The charging and discharging scheduling problem of the CSs is formulated as a finite Markov decision process (FMDP) in [20], and the safe deep reinforcement learning method is used to reduce the users charging cost while ensuring that the batteries are fully charged.

The studies above-mentioned show the preliminary exploration of the combination of CSs scheduling and DRL methods. For the state-of-the-art scheduling methods in this field, there are still some significant limitations. In terms of application, most existing DRL studies focus only on the single perspective of user interests or CS interests, which is unrealistic under the interaction of multiple stakeholders in the charging service process. Besides, the DRL-based methods fail to fully consider the application scenarios such as special working conditions and extreme weather, and the scalability of online scheduling lacks further evaluation. In terms of algorithms, the most popular DQN algorithm suffers shortcomings of convergence speed and overestimation.

The rainbow algorithm proposed in [21] integrates multiple DQN-based extensions, and it shows remarkable performance in demand response [22], predictive panoramic video delivering [23], and wind farm generation control [24]. However, the adaptive learning ability of the rainbow agent remains to be improved. Thus, combining CS application requirements and DQN-based extensions, this study proposes a modified rainbow algorithm to formulate the optimal scheduling strategy for CSs. The main contributions of this study are threefold.

To the best of our knowledge, this is the first time to propose a CS scheduling strategy that combines the EV random charging behavior characteristics with DRL. It improves the agent’s perception and learning ability based on the comprehensive perception of the “EV-CS-DN” environment information. Considering the uncertainty of the EV arrival and departure times, the electric access time of EVs is controlled by the relay action of the charging module to achieve energy resource matching within the EV parking time slot. The proposed method can reasonably solve the overstay issue and improve the operation efficiency of CSs.
As the basic version of the DQN-based rainbow algorithm has shortcomings of overlearning and poor stability in the late training stage, we improved it by introducing the learning rate attenuation strategy. In this way, the agent maintains a large learning rate in the early training stage to ensure exploration ability. As the episode increases, the learning rate gradually decays until it is maintained at a low level, ensuring that the agent fully uses the previous experience in the later training stage.
Under the realistic CSs operating scenarios, we further verified the practicability of our proposed model and algorithm. The experimental results show that the modified rainbow method overcomes the limitations of low training efficiency and poor application stability of the DRL algorithms. The CS operating cost and new energy consumption are effectively optimized. Especially, the proposed method exhibits promising performance in adapting to extreme weather and equipment failure scenarios.

The rest of the article is organized as follows. The problem formulation based on FMDP is introduced in Section 2. Then, Section 3 sketches the modified rainbow-based DRL algorithm. Case studies are reported in Section 4 for assessing our proposed methodology. Finally, Section 5 concludes the article.

2. Problem Formulation

The essence of the CS scheduling problem is matching the time-scale-based EV charging demands and the energy-scale-based CS equipment resources. The state of EVs, CSs, and DNs at the next moment depends on the current state and is independent of the previous state so that the CS optimization problem can be duly formulated as a finite Markov decision process (FMDP) [25]. The scheduling architecture is detailed in Figure 1. Depending on the state perception of the current “EV-CS-DN” comprehensive environment, the agent selects the startup time of the charging pile control module based on the evaluation network and obtains the rewards correspondingly. The historical samples are accumulated through the above interactions, and then the agent is trained based on the modified rainbow algorithm. Finally, the optimal mapping relationship from the state space S to the action space A is formed, that is, the online optimization scheduling strategy. The developed model optimizes the cost and operating efficiency of CSs while meeting the charging demands of EV users. The details about the modeling process are described as follows.

2.1. State

The state is the agent’s perception of the external environment information, and the state space S is the set of environment states. The environment is divided into EV, CS, and DN to capture the comprehensive information accurately. The state

s_{t}

is shown in Equation (1).

s_{t} = {\underset{EV}{\underset{︸}{T_{t}^{arr}, S_{t}^{arr}, T_{t}^{lea}, S_{t}^{\exp}}}, \underset{CS}{\underset{︸}{P_{t}^{PV}, E_{t}^{ESS}, P_{t}^{EV}}}, \underset{DN}{\underset{︸}{λ_{t}^{DN}}}}

(1)

where:

T_{t}^{arr}

and

S_{t}^{arr}

are the EV arrival time and the battery state of charge (SOC) when the user arrives at the CS, respectively.

T_{t}^{lea}

and

S_{t}^{\exp}

separately represent the time and expected SOC when the user leaves the CS, respectively.

P_{t}^{PV}

stands for the PV output.

E_{t}^{ESS}

is the remaining power of the ESS.

P_{t}^{EV}

is the total charging load in the CS. Let

λ_{t}^{DN}

denote the time-of-use price of the DN.

2.2. Action

The action is the decision taken by the agent in response to the environment state. Aiming at the overstay issue in CSs, the startup time of the charging module is controlled by the agent according to the “EV-CS-DN” environment state. The action

a_{t}

is described below.

a_{t} = κ_{t}, κ_{t} \in [0, T_{i}^{park}]

(2)

where:

κ_{t}

is the startup time of the charging pile control module, indicating that the EV is accessed to the DC line in the CS.

T_{i}^{park}

represents the parking time after the EV is connected to the charging pile. The selection range in Equation (2) cannot guarantee that the EV battery reaches the expected SOC when the owner leaves the CS. Thus, the agent’s action selection should be further optimized and guided through the reward function.

2.3. Reward

The reward represents the feedback obtained by the agent after selecting an action. It is the most critical part in training the agent to learn an ability or achieve a goal. Considering that the EVs charging process involves multi-stakeholders, the reward function is designed from three aspects: EVs, CSs, and clean energy consumption. The total reward value

r_{t}

of the agent is constructed as follows.

r_{t} = r_{t}^{EV} + r_{t}^{CS} + r_{t}^{PV}

(3)

where:

r_{t}^{EV}

represents the EV charging satisfaction reward.

r_{t}^{CS}

represents the reward of the CS operation cost.

r_{t}^{PV}

stands for the reward of the PV curtailment penalty.

EV charging satisfaction cost

The most fundamental task of CSs is to respond to the charging demand of EV users. Given that the EV may not be fully charged when the user leaves, we designed the reward representing the EV charging satisfaction cost as follows.

r_{t}^{EV} = - λ^{EV} (S^{\exp} - S^{lea})

(4)

where:

S^{\exp}

and

S^{lea}

are the user’s expected SOC and the realistic SOC when leaving the CS.

λ^{EV}

represents the penalty coefficient for incomplete EV charging.

2.: CS operation cost

The power purchase cost accounts for a significant portion of the CSs operating cost. However, considering the impact of charging and discharging on the life of ESS, we take the power purchase cost and the ESS charging–discharging degradation cost as the operation cost of the CS.

r_{t}^{CS} = - \sum_{t = T_{t}^{arr}}^{T_{t}^{lea}} P_{t}^{DN} λ_{t}^{DN} Δ t - \sum_{t = T_{t}^{arr}}^{T_{t}^{lea}} | P_{t}^{ESS} | λ^{ESS} Δ t

(5)

where:

P_{t}^{DN}

stands for the power provided by the DN to the CS.

P_{t}^{ESS}

denotes the charging and discharging power of the ESS.

λ^{ESS}

represents the degradation cost coefficient of the ESS [26].

3.: PV curtailment penalty

The PV curtailment penalty is introduced to promote friendly interaction between CS equipment resources and EVs. Therefore, the PV curtailment penalty is developed to improve clean energy consumption and promote efficient utilization of charging resources.

r_{t}^{PV} = - \sum_{t = T_{t}^{arr}}^{T_{t}^{lea}} (P_{t}^{PV, e} - P_{t}^{PV}) λ^{PV} Δ t

(6)

where:

P_{t}^{PV, e}

and

P_{t}^{PV}

represent the generated PV power and the actual output power, respectively.

λ^{PV}

is the PV curtailment penalty coefficient [27].

2.4. Action-Value Function

The action-value function

Q^{π} (s, a)

(Q-value) is utilized to evaluate the quality of action

a_{t}

taken in a particular state

s_{t}

. It can be expressed as below.

Q^{π} (s, a) = E [\sum_{k = 0}^{K} γ^{k} r_{t + k} | s_{t} = s, a_{t} = a]

(7)

where:

π

is the policy that maps from a comprehensive state to a charging schedule.

K

is the horizon of time steps.

γ

is the discount factor, and the value range is 0–1. When

γ \to 0

, the agent is more concerned about the immediate reward, otherwise, the agent is more concerned about the future reward.

Using the Bellman equation, the action-value function can be further expressed as follows:

Q^{π} (s_{t}, a_{t}) = E [r_{t} (s_{t}, a_{t}, s_{t + 1}) + γ Q^{π} (s_{t + 1}, a_{t + 1})]

(8)

Furthermore, solving the optimal policy

π *

is equivalent to maximizing the action-value function:

Q^{π *} (s, a) = \max_{π} Q^{π} (s, a)

(9)

Accordingly, the Bellman optimality equation of the action-value function can be denoted as Equation (10).

Q^{π} (s_{t}, a_{t}) = E [r_{t} (s_{t}, a_{t}, s_{t + 1}) + γ \max_{π} Q^{π} (s_{t + 1}, a_{t + 1})]

(10)

In Q-learning, the state, action, and action-value are recorded in the form of a table (Q-table), and the agent recommends action by querying the Q-table. During the training of the agent, the action-value is updated by Equation (11).

Q (s_{t}, a_{t}) = Q (s_{t}, a_{t}) + α [r_{t} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]

(11)

where:

α

is the learning rate, balancing the importance between prior knowledge and current evaluation.

However, limited by action and state dimensions, Q-learning is difficult to fit high-dimensional nonlinear relationships. Based on the Q-learning framework, DQN uses the deep neural network instead of the Q-table to fit the mapping relationship between state, action, and action value, and its Bellman equation is expressed below.

Q (s_{t}, a_{t}; θ^{+}) = Q (s_{t}, a_{t}; θ^{+}) + α [r_{t} + γ \max_{a_{t + 1}} Q (s_{t + 1}, a_{t + 1}; θ^{-}) - Q (s_{t}, a_{t}; θ^{+})]

(12)

where:

θ^{+}

and

θ^{-}

are the parameters of the evaluation network and the target network, respectively. In the training process, parameters

θ^{+}

are copied to the target network every few steps, and the stability of the algorithm is improved via the cooperation of both networks.

3. Proposed Modified Rainbow-Based Solution

The effectiveness of various DQN-based extensions is verified in many fields [21,22,23,24]. However, as a popular DQN-based version, the rainbow algorithm is not competent for the early exploration and later utilization at the same time, adopting the fixed learning rate strategy. Therefore, the phenomenon of overlearning will appear in the later training, resulting in the instability of the algorithm. Thus, we introduce a modified rainbow-based version for the CS scheduling issue. Double DQN (DDQN), dueling DQN, and prioritized replay buffer are integrated into the classical DQN to improve the stability and convergence of the algorithm. Furthermore, the learning rate attenuation strategy is integrated into the DQN algorithm, which increases the exploration ability in the early training and effectively utilizes the early experience in the later training. The details of the proposed algorithm are as follows.

Double DQN

DDQN changes the iterative rule of Q value compared with the classical DQN. The action that obtains the maximum Q value is recommended by the evaluation network, and then the Q value corresponding to the action is output by the target network. The Q value overestimation is effectively alleviated through the cooperation of both networks.

2.: Dueling DQN

The structure of the evaluation network is modified, and the Q value is output through the combination of the state value

V (s_{t})

and the action advantage

A (s_{t}, a_{t})

. When the agent takes different actions but the gap between the corresponding Q values is small, Dueling DQN can remove redundant freedom degrees and improve the stability.

3.: Prioritized replay buffer

The samples in the experience replay buffer are sampled with uniform probability during the DQN training phase, while the prioritized replay buffer uses the loss value to specify the priority level. As shown in Equation (9), the samples are sorted according to the time difference error (TD-Error) values.

P_{t} \propto {| r_{t} + γ Q (s_{t + 1}, \underset{a^{'}}{\arg \max} Q (s_{t + 1}, a^{'}; θ^{+}); θ^{-}) - Q (s_{t}, a_{t}; θ^{+}) |}^{ω}

(13)

4.: Learning rate attenuation

To balance the training speed and the stability of the algorithm, the learning rate is generally attenuated according to the iterations in deep learning. We adopt linear cosine decay to adjust the learning rate of DQN. The attenuation model can be described as below.

α = c_{decay} α_{0}

(14)

where:

α_{0}

is the initial learning rate.

c_{decay}

represents the attenuation coefficient.

c_{decay} = \frac{1}{2} (1 + \cos ξ_{n} π) (1 - α_{\min}) + α_{\min}

(15)

ξ_{n} = \frac{\min (n, n_{decay})}{n_{decay}}

(16)

where:

ξ_{n}

is the cosine coefficient.

α_{\min}

represents the minimum learning rate.

n

and

n_{decay}

are the current episode and attenuation episode, respectively.

Figure 2 illustrates the training architecture of the proposed modified rainbow-based method. Algorithm 1 demonstrates the training process of the proposed solution method. In each episode, the evaluation network with the Dueling DQN architecture is used to capture the “EV-CS-DN” state and select the EV’s charging pile relay module action time. Upon the action,

a_{t}

is executed; the immediate reward

r_{t}

of the agent is calculated; and the new state

s_{t + 1}

is observed. Historical samples are accumulated based on the above interaction process, and the loss value of all samples in the replay buffer D is calculated via Equation (13). Then, N_b samples with larger loss values are taken to perform mini-batch gradient descent to update the parameters of the evaluation network, and the parameters

θ^{+}

are copied to the target network every N_f steps. Repeat the above process until the agent completes the scheduling task for a whole day. After a completed episode, the learning rate

α

is attenuated to reduce the agent’s exploration ability. Repeat the above steps until the maximum training episode is reached. The details of the

ε

-greedy strategy in line 5 can refer to the study [27].

Algorithm 1: Modified Rainbow-based Solution Method

4. Case Studies

4.1. Case Study Setup

The simulation experiments are illustrated within a CS equipped with the PV and ESS. Table 1 lists the environment parameters. In addition, charging time characteristics are detailed in reference [28], and the duration of EV users arrived at the CS obeys N (58.32, 12.44) [29].

The rainbow algorithm consists of two hidden layers with 120 and 80 neurons, respectively. In the training process, the initial learning rate

α_{0}

is 0.15, and the minimum learning rate

α_{\min}

is 0.001. The attenuation episode

n_{decay}

is 700, and the discount factor

γ

is 0.9. For the evaluation network, we selected 128 samples with higher priority from the replay buffer D with a capacity of 2000 samples to perform gradient descent. After the evaluation network completes 300 updates, the trained parameters are copied to the target network for one update.

4.2. Training Process Analysis

Figure 3 exhibits the rewards of each training episode. Due to the random state of EVs and the PV output in each episode, the rewards obtained by the agent fluctuate significantly. Nevertheless, the rewards show an apparent upward trend in the early training stage and converge from 180 to 210 episodes. The average rewards stabilize at −1.69, indicating that the agent fits the optimal state-action mapping. Note, the agent is encouraged to explore the environment with a significant learning rate in the early training, which leads to the unstable learning ability of agents during 200–400 episodes. In particular, the reward was only −2.35 in episode 287 (the learning rate was set as 0.096). However, the stability of the agent was improved with the increase in training episodes.

4.3. Application Results Analysis

Furthermore, we benchmarked the performance of our method with an uncoordinated policy. Figure 4 details the CS power distribution under the above two methods. It can be seen from Figure 4a that the charging load reached its peak at around 12:00. The ESS, PV, and DN must cooperate in responding to the charging load during this period. However, the charging load dropped sharply at around 14:00. Due to strong light radiation and high temperature, PV output was much greater than the maximum charging power of the ESS, resulting in the PV curtailment, and the PV utilization rate was only 87.40%. Given the depletion of ESS after 19:00, the CS can only rely on the DN for power supply, and the power purchase cost for the whole day reached USD 180.90. As attested by Figure 4b, the PV utilization rate increased to 92.23% after operating the optimal scheduling approach. Moreover, as the charging load during the peak electricity price period was shifted, the dependence on the DN was significantly reduced. The electricity purchase cost was reduced to USD 140.03 (a reduction by 22.59% compared with the benchmark).

4.4. Generalization Performance Assessment

In this part, three experimental scenarios were designed to analyze the generalization performance of the proposed modified rainbow method. The configuration of the three scenarios are as follows: (1) The ESS works under normal conditions on clear days (1–1000 episodes); (2) the ESS works under normal conditions on cloudy days (1001–2000 episodes); (3) the ESS works under failure conditions on clear days (2001–3000 episodes). The total number of training episodes was 3000, and the scenario was changed every 1000 episodes. The training reward can be obtained as shown in Figure 5. The stable average rewards in the three scenarios were −1.69, −2.08, and −2.06, respectively. Owning to the low PV output, the PV curtailment rarely appeared in scenario 2. However, there were significant increases in the power purchase cost, and the reward was reduced by 23.08% compared to scenario 1. The ESS failure brings further uncertainty to scenario 3. The rewards received by the agent fluctuated significantly, and the average reward was reduced by 21.89% compared to scenario 1.

As regards the application performance, the agent trained only in scenario 1 was tested online for 100 episodes in the three scenarios, and the CS operating cost was detailed in Figure 6. As attested, the proposed method effectively reduced the CS cost in the three scenarios. In scenario 1, the average operating cost under the uncoordinated policy was USD 170.80, while under the proposed method it was only USD 149.13. The operating cost in the three scenarios was reduced by an average of 9.72%. In particular, although the operating conditions in scenarios 2 and 3 were not designed during the training process, the CS operating cost was effectively reduced during the test process, showing a strong generalization ability. The proposed method quickly adapts to the current environment and supports the engineering application.

4.5. Algorithm Performance Comparison

We also benchmarked the enhancement performance of our algorithm with DQN and DDQN solutions. Figure 7 exhibits their training process. Obviously, DQN rises rapidly and tends to converge in 150 episodes. However, as DQN is challenging to solve the complex decision-making problem with high state and action space dimensions, the average reward after convergence was only −1.94. The reward of DDQN was increased by 7.10%, while the sacrifice was the convergence rate. The comprehensive performance of our proposed algorithm was greatly improved. While ensuring the convergence speed, the reward increased by 12.81% compared with the classical DQN solution.

Finally, to measure the application performance of different algorithms, the above three algorithms were employed for CS scheduling lasting for one month (10 days for each scenario). Table 2 lists the obtained operating indicators. The modified rainbow algorithm achieved promising performance in terms of the power purchase cost and the PV utilization rate. The average power purchase cost was 10.49% lower than that of DQN, and the PV utilization was increased by 4.2%. It is worth noting that the charging–discharging power of the ESS based on the modified rainbow algorithm was the largest among the three algorithms. These results demonstrate that the scheduling strategy of the proposed method depends more on the ESS regulation.

5. Conclusions

Considering the time-scale uncertainty of EV charging behaviors, this article proposed an optimal DRL-based CS control strategy. The optimal scheduling model was established as an FMDP, and four improved mechanisms were introduced to the classical DQN version. The main conclusions are detailed below based on experimental verification in multiple scenarios.

The well-trained agent intelligently formulates an EV charging plan according to the current environmental state to achieve the multiple stakeholders’ optimal benefit. Especially under extreme scenarios, the proposed method exhibits superior generalization capabilities and meets the needs of engineering applications. The proposed method improved the PV utilization to 90.31% and reduced the CS operating cost by 9.72% on average.
The modified rainbow method overcomes the low training efficiency and poor stability of the classical DRL algorithms. The proposed method effectively balanced the convergence and stability and significantly enhanced the performance by 12.81%.

This study has significance in that it expands the improvement for DRL algorithms, which means that specific extensions can be added to the basic algorithm to solve their own problems. This study mainly analyzed the effect of the proposed method in engineering applications. The influence of parameters on the proposed method was not discussed. Accordingly, one future direction is to evaluate the sensitivity of DRL-based training parameters and further improve the proposed method. Besides, we plan to integrate multiple extensions to the deep deterministic policy gradient algorithm to see if it can bring better training results.

Author Contributions

Conceptualization, R.W. and Q.X.; methodology, R.W.; software, R.W.; validation, Z.C. and Z.Z.; formal analysis, R.W. and Q.X.; investigation, Z.C.; resources, T.Z.; data curation, T.Z.; writing—original draft preparation, R.W. and Q.X.; writing—review and editing, Q.X., Z.Z. and Z.C.; visualization, R.W.; supervision, Q.X.; project administration, Z.C. and Q.X.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the editor of this journal and the reviewers for their detailed and helpful comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Kapustin, N.O.; Grushevenko, D.A. Long-term electric vehicles outlook and their potential impact on electric grid. Energy Policy 2020, 137, 111103. [Google Scholar] [CrossRef]
Dong, F.; Liu, Y. Policy evolution and effect evaluation of new-energy vehicle industry in China. Resour. Policy 2020, 67, 101655. [Google Scholar] [CrossRef]
Rajendran, G.; Vaithilingam, C.A.; Misron, N.; Naidu, K.; Ahmen, M.R. A comprehensive review on system architecture and international standards for electric vehicle charging stations. J. Energy Storage 2021, 42, 103099. [Google Scholar] [CrossRef]
Das, H.S.; Rahman, M.M.; Li, S.; Tan, C.W. Electric vehicles standards, charging infrastructure, and impact on grid integration: A technological review. Renew. Sustain. Energy Rev. 2020, 120, 109618. [Google Scholar] [CrossRef]
Zhang, J.; Yan, J.; Liu, Y.; Zhang, H.; Lv, G. Daily electric vehicle charging load profiles considering demographics of vehicle users. Appl. Energy 2020, 274, 115063. [Google Scholar] [CrossRef]
Moghaddam, Z.; Ahmad, I.; Habibi, D.; Masoum, M.A.S. A coordinated dynamic pricing model for electric vehicle charging stations. IEEE Trans. Transp. Electr. 2019, 5, 226–238. [Google Scholar] [CrossRef]
Luo, C.; Huang, Y.; Gupta, V. Stochastic dynamic pricing for EV charging stations with renewable integration and energy storage. IEEE Trans. Smart Grid 2018, 9, 1494–1505. [Google Scholar] [CrossRef]
Zhang, Q.; Hu, Y.; Tan, W.; Li, C.; Ding, Z. Dynamic time-of-use pricing strategy for electric vehicle charging considering user satisfaction degree. Appl. Sci. 2020, 10, 3247. [Google Scholar] [CrossRef]
Raja S, C.; Kumar N M, V.; J, S.K.; Nesamalar J, J.D. Enhancing system reliability by optimally integrating PHEV charging station and renewable distributed generators: A Bi-level programming approach. Energy 2021, 229, 120746. [Google Scholar] [CrossRef]
Li, D.; Zouma, A.; Liao, J.; Yang, H. An energy management strategy with renewable energy and energy storage system for a large electric vehicle charging station. eTransportation 2020, 6, 100076. [Google Scholar] [CrossRef]
Yang, M.; Zhang, L.; Zhao, Z.; Wang, L. Comprehensive benefits analysis of electric vehicle charging station integrated photovoltaic and energy storage. J. Clean. Prod. 2021, 302, 126967. [Google Scholar] [CrossRef]
Nishimwe H., L.F.; Yoon, S.-G. Combined optimal planning and operation of a fast EV-charging station integrated with solar PV and ESS. Energies 2021, 14, 3152. [Google Scholar] [CrossRef]
Zeng, T.; Zhang, H.; Moura, S. Solving overstay and stochasticity in PEV charging station planning with real data. IEEE Trans. Ind. Inform. 2020, 16, 3504–3514. [Google Scholar] [CrossRef]
Sadeghianpourhamami, N.; Deleu, J.; Develder, C. Definition and evaluation of model-free coordination of electrical vehicle charging with reinforcement learning. IEEE Trans. Smart Grid 2020, 11, 203–214. [Google Scholar] [CrossRef] [Green Version]
Wang, S.; Bi, S.; Zhang, Y.A. Reinforcement learning for real-time pricing and scheduling control in EV charging stations. IEEE Trans. Ind. Inform. 2021, 17, 849–859. [Google Scholar] [CrossRef]
Wan, Z.; Li, H.; He, H.; Prokhorov, D. Model-free real-time EV charging scheduling based on deep reinforcement learning. IEEE Trans. Smart Grid 2019, 10, 5246–5257. [Google Scholar] [CrossRef]
Li, H.; Li, G.; Wang, K. Real-time dispatch strategy for electric vehicles based on deep reinforcement learning. Automat. Electr. Power Syst. 2020, 44, 161–167. [Google Scholar] [CrossRef]
Lee, K.; Ahmed, M.A.; Kang, D.; Kim, Y. Deep reinforcement learning based optimal route and charging station selection. Energies 2020, 13, 6255. [Google Scholar] [CrossRef]
Qian, T.; Shao, C.; Wang, X.; Shahidehpour, M. Deep reinforcement learning for EV charging navigation by coordinating smart grid and intelligent transportation system. IEEE Trans. Smart Grid 2020, 11, 1714–1723. [Google Scholar] [CrossRef]
Li, H.; Wan, Z.; He, H. Constrained EV charging scheduling based on safe deep reinforcement learning. IEEE Trans. Smart Grid 2020, 11, 2427–2439. [Google Scholar] [CrossRef]
Hessel, M.; Modayil, J.; van Hasselt, H.; Schaul, T.; Ostrovski, G.; Dabney, W.; Horgan, D.; Piot, B.; Azar, M.; Silver, D. Rainbow: Combining improvements in deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
Harrold, D.J.B.; Cao, J.; Fan, Z. Data-driven battery operation for energy arbitrage using rainbow deep reinforcement learning. Energy 2022, 238, 121958. [Google Scholar] [CrossRef]
Xiao, G.; Wu, M.; Shi, Q.; Zhou, Z.; Chen, X. DeepVR: Deep reinforcement learning for predictive panoramic video streaming. IEEE Trans. Cogn. Commun. 2019, 5, 1167–1177. [Google Scholar] [CrossRef]
Yang, J.; Yang, M.; Wang, M.; Du, P.; Yu, Y. A deep reinforcement learning method for managing wind farm uncertainties through energy storage system control and external reserve purchasing. Int. J. Electr. Power 2020, 119, 105928. [Google Scholar] [CrossRef]
Yang, T.; Zhao, L.; Li, W.; Zomaya, A.Y. Reinforcement learning in sustainable energy and electric systems: A survey. Annu. Rev. Control 2020, 49, 145–163. [Google Scholar] [CrossRef]
Cui, S.; Wang, Y.; Shi, Y.; Xiao, J. An efficient peer-to-peer energy-sharing framework for numerous community prosumers. IEEE Trans. Ind. Inform. 2020, 16, 7402–7412. [Google Scholar] [CrossRef]
Huang, Y.; Huang, W.; Wei, W.; Tai, N.; Li, R. Logistics-Energy Collaborative Optimization Scheduling Method for Large Seaport Integrated Energy System. Available online: https://kns.cnki.net/kcms/detail/11.2107.TM.20210811.1724.013.html (accessed on 12 August 2021).
Liu, X.; Feng, T. Energy-storage configuration for EV fast charging stations considering characteristics of charging load and wind-power fluctuation. Glob. Energy Interconnect. 2021, 4, 48–57. [Google Scholar] [CrossRef]
Lin, X.; Liu, T.; Wang, Z. Annual Report on Green Development of China’s Urban Transportation (2019); Social Science Literature Press: Beijing, China, 2019. [Google Scholar]

Figure 1. The overview of the proposed CS optimal scheduling architecture.

Figure 2. The training architecture of the proposed modified rainbow-based method.

Figure 3. The training process of the proposed method.

Figure 4. CS power balance comparison before and after control. (a) Uncoordinated power balance; (b) power balance under the proposed method.

Figure 5. The training process in three scenarios of the proposed method.

Figure 6. Comparison of the CS operating cost in three scenarios.

Figure 7. Average reward values in each episode for different algorithms (DQN, double DQN, and modified rainbow).

Table 1. Settings of environment parameters.

Parameters	Value	Unit
Charging pile output power	60	kW
PV maximum power	225	kW
ESS capacity	295.68	kWh
ESS maximum power	90	kW
ESS maximum SOC	0.95	/
ESS minimum SOC	0.1	/
EV battery capacity	40	kWh
EV expected SOC	0.9	/
Number of EVs	100	/
Equipment efficiency	0.95	/
Penalty coefficient $λ^{EV}$	15.82	USD
Penalty coefficient $λ^{ESS}$	0.01	USD/kWh
Penalty coefficient $λ^{PV}$	0.0158	USD/kWh

Table 2. Application performance for different algorithms (uncoordinated, DQN, double DQN, and modified rainbow).

	CS Power Purchase Cost/USD	PV Utilization Rate	ESS Charging-Discharging Capacity/kWh
Uncoordinated	172.95	86.04%	763.58
DQN	164.25	86.33%	784.45
DDQN	151.47	87.38%	745.32
Modified rainbow	147.02	90.53%	792.14

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, R.; Chen, Z.; Xing, Q.; Zhang, Z.; Zhang, T. A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station. Sustainability 2022, 14, 1884. https://0-doi-org.brum.beds.ac.uk/10.3390/su14031884

AMA Style

Wang R, Chen Z, Xing Q, Zhang Z, Zhang T. A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station. Sustainability. 2022; 14(3):1884. https://0-doi-org.brum.beds.ac.uk/10.3390/su14031884

Chicago/Turabian Style

Wang, Ruisheng, Zhong Chen, Qiang Xing, Ziqi Zhang, and Tian Zhang. 2022. "A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station" Sustainability 14, no. 3: 1884. https://0-doi-org.brum.beds.ac.uk/10.3390/su14031884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Modified Rainbow-Based Deep Reinforcement Learning Method for Optimal Scheduling of Charging Station

Abstract

1. Introduction

2. Problem Formulation

2.1. State

2.2. Action

2.3. Reward

2.4. Action-Value Function

3. Proposed Modified Rainbow-Based Solution

4. Case Studies

4.1. Case Study Setup

4.2. Training Process Analysis

4.3. Application Results Analysis

4.4. Generalization Performance Assessment

4.5. Algorithm Performance Comparison

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI