3.1. Experiment Design and Implementation
In this study, a crowd dynamics simulation system based on the social force model [3
], is used as the agent’s interactive environment. A typical multi-room, dual-exit indoor scene is constructed, in which the guidance signs are marked as green arrows, and the exits are marked as green rectangles. The simulation system takes the display states of the evacuation guidance signs as input. In the simulation, the individuals first determine a subjective driving force direction based on the nearby guidance signs and the distance to each exit. When the simulated individuals do not see the evacuation guidance signs, they choose the closest exit and escape according to the statically shortest route. When the individuals notice the evacuation guidance sign, they escape in the direction indicated by the sign, unless they are close to an exit, in which case they may ignore the sign. When congestion occurs, the individual’s speed is limited, and the subjective driving force that tends to get rid of congestion is added. Then, the dynamic simulation calculates the speed and displacement of the individual movement, simulating the collision between people and people, and people and walls. Individuals within the exit area are deemed to have successfully escaped. At the end of a simulation step, the crowd’s positions and the signs’ states are painted in a scene image as output, which is passed to the reinforcement learning agent as input to its neural network. The simulation system is based on C++ and the Qt library.
As shown in Figure 4
, the simulation scene’s size is
m × 19.7 m, and the size of the floor map is
pixels. The scene contains two exits and six rooms, where the upper and lower channels connect the rooms and exits. Each channel is set to have five dynamic guidance signs, and the signs can indicate one of two directions. The number of people is 200, and their initial positions are randomly distributed in a circular range; the range of the distribution center and radius is
, and the maximum individual movement speed is 5 m/s. Each time step of the simulation system dynamics calculation is 40 ms, and the upper limit of the simulation time is 100 s.
The CA-DQN method is implemented based on Python, the TensorFlow platform, and OpenAI/baseline library. In each time step of the reinforcement learning agent, the simulation system first performs a five-step calculation, simulating the crowd’s movement within 200 ms. The last four images obtained are down-sampled into 1/2-size grayscale images, and combined into a four-channel image of
pixels, which is input to the agent’s Q network as state
. The Q network structure is shown in Figure 3
, comprising a three-layer convolutional neural network and three-layer fully connected neural network. The 20 neurons in the output layer are divided into ten groups. From each group, the larger output value is selected as the display signal for the dynamic guide sign. By combining them, a ten-dimensional discrete output vector is formed as agent action
acts on the simulation system to change the direction displayed by the ten guide signs, thereby directing the crowd’s movement. At this time, the interaction between the agent and simulation environment completes a cycle. The agent’s reward for each step is fixed at
, implying that it obtains a reward of
The agent’s training goal is to reduce the overall evacuation time. For example, the exits’ congestion and the imbalanced exit utilization lead to longer evacuation time and lower rewards, giving the agent negative feedback. The agent will learn how to avoid this situation and find the optimal strategy through trial and error.
Among the training parameters, the batch size is 64, the learning rate is , the total time step is , the experience pool’s sample size is , and the current Q network parameters are copied to the target Q network every steps. The experimental hardware platform is an AMD Threadripper 2990WX CPU, NVIDIA RTX 2080Ti GPU, and 128 GB memory.
3.2. Experimental Results and Analysis
As the original DQN method would require
output layer nodes for the experiments conducted herein, compared with the 20 nodes for CA-DQN, the DQN network is too large to implement under existing conditions. Therefore, in this study, the chosen method is based on static guidance signs, an evacuation guidance algorithm based on topology map modeling, and dynamic Dijkstra shortest path method [3
] for comparison. The static guidance signs indicate the nearest exit, regardless of the real-time distribution of the crowd. Because the individuals in this simulation environment escape to the nearest exit if they cannot find a guide sign, the static guide sign’s role in this simulation environment is similar to having no guide sign. The dynamic Dijkstra shortest path method requires experts to manually build a topology map model based on its channel structure. They have to set up multiple virtual camera nodes, count the crowd density on different paths, adjust the real-time weights of each edge, and use the Dijkstra algorithm for path planning to achieve effective crowd evacuation.
The agent only needs to be previously trained once for a scene, and the training process takes about three days. The training curve in Figure 5
shows that the agent reaches the optimal strategy after approximately 30,000 training periods, and when the number of interactions between the agent and simulation environment is approximately
time steps. When the agent executes the optimized strategy obtained from training, for a 200 ms step, the execution of strategy takes
ms, and the crowd simulation takes
ms. The processing speed can meet the real-time requirement of actual deployment.
The CA-DQN-mean method is more stable in the later stage of training; moreover, the training effect of the CA-DQN-mean method is better than that of the CA-DQN-max method. This result indicates that under the influence of random interference, the CA-DQN-mean method can calculate the sample importance more accurately with the information of all the dimensions. Consequently, it is more appropriate to define the sample priority as the average of absolute TD errors in each dimension.
As shown in Table 1
, a 100-cycle evacuation simulation was performed for different evacuation methods using new random crowd distribution parameters. The bold numbers indicate the best period reward and the shortest evacuation time. The average period reward of the agent’s optimal strategy was
, implying that the average evacuation time was
s, which was better than the
s obtained using static guidance signs and
s obtained using the dynamic Dijkstra shortest path method. This demonstrates that the intelligent evacuation guidance agent based on CA-DQN can effectively guide an evacuation.
shows a typical evacuation process. Figure 6
a is the crowd’s initial distribution, which is primarily in the four rooms on the left. Without dynamic guidance, the static evacuation strategy with the shortest distance to the exit will cause congestion at the exit on the left, and the right exit will not be used effectively. In Figure 6
b, the agent perceives the crowd’s distribution in the scene image with CNN and leads some of the crowd in the upper left room to the left exit and the rest of the crowd to the right exit. Notice that the side signs for the lower channel lead to an unexpected direction. This phenomenon is because when survivors are very close to the exit in the simulation environment, they will ignore the signs and go directly to the exit. Such signs have little influence on the evacuation efficiency, and it is hard to obtain feedback for training. For the same reason, the left side signs in Figure 6
c are mostly ignored.
In Figure 6
d,e, the congestion at the left exit has been relieved. The right exit is expected to have more people being evacuated; consequently, the agent will lead the remaining people in the lower-left area to the left exit. Notice the first and third signs in the upper channel of Figure 6
e seem to be unexpected. This phenomenon is because jamming will slow down the crowd in the simulation environment. The agent learned this feature and tries to delay part of the crowd to avoid exit congestion. Finally, in Figure 6
f, the crowd is evacuated from both sides of the exit simultaneously, indicating that the crowd evacuation guide agent has maximized the crowd evacuation efficiency. The signs of empty areas in Figure 6
f will not affect the evacuation efficiency, and the agent cannot learn from feedback; therefore, the signs in these areas are uncertain.
The number of people initialized in the simulation scene was changed, and 100 cycles of evacuation simulation were run. A comparison of the evacuation effects of different methods is shown in Figure 7
. When the number of people was small, each channel could remain unobstructed, and the static guidance method was effective. When the number of people increased, the static guidance method was affected more, while CA-DQN and the dynamic shortest path method avoided crowd congestion. When the number of people increased to more than 80, the three dynamic methods’ evacuation effect was better than the static method. The CA-DQN-mean method, which used the average value of TD errors in each dimension to define the sample priority according to Formula (13
), performed better than the dynamic shortest path method. The effect of the CA-DQN-max method, which was defined by Formula (12
), was equivalent to that of the dynamic Dijkstra shortest path method.
The experimental results show that, compared with static signs that cannot perceive crowd distribution information, the proposed CA-DQN reinforcement-learning-based crowd evacuation guidance method can dynamically adjust the display signals of the guidance signs, and effectively improve the efficiency of crowd evacuation. Compared with the dynamic Dijkstra shortest path method that is based on topological map modeling, the proposed method demonstrated higher evacuation guidance efficiency, while avoiding the workload and potential manual errors of artificial topology map construction.