1. Introduction
Corona Virus Disease (COVID-19), rampaging around the whole world throughout the year 2020, has not only adversely affected global public health security but also seriously threated human’s health [
1,
2]. Although the whole country is actively coordinating to control the epidemic, the COVID-19 is still spreading. Therefore, the accurate identification of the current high-risk areas of the epidemic and the assessment of the risk level of the epidemic in different areas are both important prerequisites for the formulation of epidemic prevention policies [
3]. With the approach of the winter season in the Northern Hemisphere, COVID-19 is becoming more and more active, which makes the prevention of the second outbreak of COVID-19 still an important challenge for the global epidemic treatment. Therefore, the prevention and control of COVID-19 will continue to be an important issue for maintaining urban public security and achieving sustainable urban development in the future. This is also the reason why the assessment of the current epidemic risk can provide reliable support for urban safety decision-making.
It is generally believed that the relatively effective anti-infection measures are to limit massive human migration, to classify the areas with more confirmed cases as high-risk areas, to lock down the smaller regions where the epidemic risk is relatively higher and so on. Although these anti-infection measures did play active roles in the prevention and control of COVID-19, they failed to heighten the key areas [
4]. Therefore, the accurate assessment of the risk of COVID-19 in the perspective of geographical space is helpful to rational prevention and control of the epidemic [
5].
Many studies about clinical diagnosis [
6], transmission relationship [
7], drug vaccine [
8], spatiotemporal pattern [
9], risk assessment [
10], and epidemic transmission [
11] of COVID-19 have been done by scholars since the breakout of the epidemic, which has made great contributions to the prevention and control of COVID-19. With the arrival of the post-epidemic period, the risk assessment of epidemic has gradually become the focus of attention [
12,
13]. The risk assessment of epidemic mainly includes using the risk model of population floating to assess the epidemic risk [
14], using Tencent positioning data to divide the infection risk of epidemic [
15], using the big data on population mobility to predict the spread pattern of the epidemic by considering the transmission law of the epidemic [
16], and revealing the epidemic law and change law with the help of big data on population mobility while using social factors [
17]. Although all these studies have played a good role in the risk assessment of COVID-19, the current risk assessment results of COVID-19 mostly focus on the macro-regional scale, which fails to better reflect the risk assessment of COVID-19 on the micro-geospatial scale of cities under the influence of urban elements. [
18,
19].
As the abstract presentation tool of various spatial factors in geographical virtual space, Point of Interest (POI) data mainly presents the agglomeration condition of various urban factors by describing the density of each POI data point in the virtual geographic space [
20,
21]. With this character, POI can present the spatial differences among various urban factors preferably, with which POI is becoming more and more prevalent in relevant research about geographical space [
22]. What’s more, POI data has indeed contributed to the simulating of urban spatial structure [
23], as well as the analysis of urban hot spots [
24,
25]. As for the Tencent-Yichuxing data, it is a macro spatial population mobility index of certain time quantum generating by accessing users’ location information of the APP belongs to Tencent. With this index, the interaction information among people in regional space could be preferably presented, which makes Tencent-Yichuxing data more and more popular in exploring and studying the research about urban agglomeration, in addition to the information exchanges among urban cities [
26,
27].
At present, the heterogeneity of urban space results in the difficulty of Unisource data in adapting the rapid transformation of urban spatial structure [
28]. Therefore, there are more and more scholars beginning to pay attention to the use of the data fused from multiple sources in relevant research about urban cities. Fusion of multi-source data mainly refers to the fusion using of traditional source data and emerging source data [
29]. Traditional source data includes statistic data and remote sensing data [
30], while emerging data mainly refers to the data that could be obtained from Internet, such as POI data [
31], thermodynamic diagram data [
32] and so on [
33]. It is shown that data fusion does have an edge over other data in conducting relevant research in terms of the extraction of urban build-up areas [
34], the defining of urban center [
35], the identification of urban functions [
36] and the regulation of urban spatial form [
37]. Data fusion can not only reduce even avoid the inaccuracy generating by Unisource data in urban space research, but can also improve the accuracy of research results [
38]. Therefore, fusing multi-source data has become a new on-the-rise way to resolving relevant problems about urban cities [
39], and that is why the fusion of multi-source data should be abundantly applied when conducting the risk assessment of COVID-19 or exploring the influence factors of COVID-19 within urban spaces.
With the development of computer technology, many computing methods such as machine learning have been widely applied in urban related research [
40]. Machine learning obtains general patterns from existing data samples, and then predicts the results based on the patterns [
41]. Compared with general linear regression [
42], GNN-CA model [
43], cellular automata model [
44], etc., logistic regression, as one of the most common models of machine learning, has a more objective algorithm, a rigorous calculation process, and a simpler and more efficient calculation between normal hypothesis and variables [
45]. At present, logistic regression has achieved good results in urban land prediction [
46], urban expansion simulation [
47] and urban spatial change [
48]. This study uses the advantages of logistic regression models in urban spatial prediction to assess the current risk level of COVID-19.
Due to the potential interdependence among the observed data of different variables distributed in the same region, the spatial factors affecting the risk level of the epidemic will have obvious spatial differentiation [
49]. At present, there are few analytical methods for spatial differentiation, mainly including spatial analytical measure [
49], geodetector statistics [
50], MSN of stratified samples [
51], Bshade of sample deviation [
52], SPA model of single point sample [
53] and Sandwich model of multi-unit conversion [
54]. As a statistical method to detect spatial differentiation and reveal its driving force, the fundamental of geodetector is that if an independent variable has an important influence on the dependent variable, then the spatial distribution of the independent variable and the dependent variable should be similar [
50]. Compared with other spatial differentiation analysis methods, geographic detectors can detect not only data, but also the interaction between different factors [
55,
56]. Therefore, the geodetector can not only analyze the main role of different factors, but also judge the relationship between factors, which cannot be achieved by other spatial differentiation analysis methods [
57].
As the super first-tier cities, although there are no large-scale COVID-19 infections like in Wuhan; Beijing, Shanghai, Guangzhou and Shenzhen are also the regions with the greatest potential threat from COVID-19 as they are the most complex areas of population flow between cities and cities and between cities and regions in mainland China. In order to prevent and control the epidemic more efficiently with a more rational utilization of anti-epidemic resources as well as a more prominent emphases, taking Guangzhou and Wuhan as cases, this study firstly uses logistic regression model to fuse big data such as POI and Tencent-Yichuxing, and then evaluates the current epidemic risk level based on this. Then, the correlation between spatial factors affecting the epidemic risk level is analyzed by using the geo-detector model, and the accuracy is finally verified. All these provide important reference for the formulation of epidemic prevention policy.
3. Results
3.1. Model Training
Based on COVID-19 data from January to April 2020 in Guangzhou and Wuhan, COVID-19 risk areas are divided and risk areas and risk-free areas are constructed. In this study, the factors are firstly diagnosed by collinearity, because the eight spatial factors used in this study may have multicollinearity, which will cause serious deviation to the operation results of logistic regression model. The product of TOL (tolerance) and VIF (variance inflation factor) is close to 1, which is a common indicator to reflect the degree of collinearity of factors. Generally speaking, when the product of VIF and VOL is greater than 10 or less than 0.1, it indicates a high degree of collinearity among factors, which does not meet the modeling conditions. The results of the multicollinearity analysis of 8 spatial factors are shown in
Table 2, showing that all the factors VIF and VOL product are around 1. It is proven that all factors meet the conditions of collinearity analysis through multicollinearity diagnosis. Therefore, eight spatial factors are introduced into the model training.
3.2. Risk Assessment of COVID-19 of Guangzhou
It can be found from the training results of the model that the higher the risk level is, the higher the probability of COVID-19 occurrence will be. Combined with the actual geographical location, the distribution map of epidemic risk level in Guangzhou is obtained (
Figure 9).
According to the distribution diagram of the risk level of COVID-19, the regions with high risk level are concentrated in Yuexiu District, Tianhe District, Liwan District and Haizhu District, which are the core areas of Guangzhou. By comparing
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8, it can be found that the areas with higher COVID-19 risk are also those with higher transportation stations, restaurants and areas with high population mobility and resident population density. Therefore, it can be concluded that COVID-19 has a great spatial correlation with traffic stations, restaurants and population distribution while the spatial correlation with the hotel and living space is relatively smaller.
Yuexiu District, Tianhe District, Liwan District and Haizhu district all belong to the area with higher population density in Guangzhou. The population floating and interaction is further strengthened with more intensive traffic stations and restaurants in these four districts. It can also be found that new confirmed cases for COVID-19 are mainly concentrated in these four regions except for the external traffic stations after the breakout of COVID-19, which proves that the density of permanent population is one of the most important factors that influence the transmission of the epidemic. In addition, what can be further found is found is that the high-risk areas, including Haizhu District, are all urban villages with dense residential areas and relatively backward related facilities.
There is also an obvious increase of confirmed imported cases in external traffic stations such as Baiyun International Airport and Guangzhou South Railway Station, in which population mobility is the main factor that results in the higher epidemic risk level. It can be found from previous studies that the main spread modes are population aggregation and population floating on a large scale. Although COVID-19 was first reported in Wuhan, the route of transmission is cut down immediately, resulting in the efficient control of epidemic after the policies of lockdown in Wuhan and restricted population mobility implemented by Chinese government. This is an excellent example that proves restricted population floating could indeed prevent and control the epidemic.
There is no doubt that the infection risk of COVID-19 will be greatly increased if population is exposed to a dangerous environment for a long time. The anti-epidemic policy “home quarantine” was decidedly implemented after the breakout of COVID-19 in China, which not only efficiently brought the population mobility all over the country under control in a short time, but also promptly controlled the population density in public areas. Although the implementation of this policy has efficiently reduced the epidemic risk, the public areas with higher population density and larger floating population are still high-risk areas at present.
3.3. Verification of COVID-19 Risk Level in Wuhan
Although the confusion matrix and ROC curve can prove the correctness of the model and results of this study, the number of COVID-19 infectors in Guangzhou is only 349, while the total population of Guangzhou is up to 15.3059 million, which shows that the results may have a certain degree of randomness. Therefore, Wuhan, the city with the most COVID-19 patients in China, was selected as the verification area in this study. As one of the cities with the largest number of COVID-19 infections in China, many studies on the risk of COVID-19 in Wuhan have been carried out. The correctness of this study can be judged by comparing the COVID-19 risk calculated by this study with that calculated by other relevant studies.
By the end of April 2020, a total of 50,333 people had been infected with COVID-19 in Wuhan. The evaluation results can directly demonstrate the correctness of this study. By collecting urban spatial element data of Wuhan and importing this model, the risk distribution of COVID-19 in Wuhan can be obtained, as shown in
Figure 10.
It can be found from the distribution of the risk level of COVID-19 in Wuhan that high-risk areas for COVID-19 were mainly concentrated in Wuchang District, Jianghan District, Qiaokou District and Hongshan District, shown by Google map of Wuhan to be old urban areas with more a concentrated permanent population density. Moreover, these high-risk areas are relatively dense with floating population and other public facilities. This result is consistent with the results of Wuhan’s epidemic risk assessment through other methods and research perspectives [
76,
77,
78], which further illustrates the correctness of the results of this study.
By comparing the research results of Guangzhou and Wuhan, it can be found that there is no significant difference in the number of permanent residents between Guangzhou and Wuhan, the number of infected cases in Wuhan is far greater than that in Guangzhou, and the risk level of COVID-19 in different regions is different. Through the analysis of the data, methods and results of this study, it can be found that the accuracy of the assessment results of the risk level distribution of the epidemic in Guangzhou and Wuhan have been further verified, indicating that although the data sample size of different new crown cases will affect the risk level value of different regions (the larger the data sample size, the higher the risk level), it does not affect the distribution of the epidemic risk level in the entire region. Therefore, large cities where the epidemic is more prevalent and with more patients infected with COVID-19 can also use the method proposed in this study to assess the risk level of COVID-19.
3.3.1. Verification of Confusion Matrix
Precision verification is of great importance to the detection of the assessment of logistic regression model on risk level of COVID-19. Besides this, it was used to verify the distribution of epidemic risk level in Guangzhou and Wuhan, as shown in
Figure 9 and
Figure 10. The verification result of confusion matrix is shown in
Table 3, in that the precision verification of risky area and non-risk area is 0.892 and 0.996, respectively, with a Kappa value of 0.806 and 0.811, which proves that the logistic regression model is of great accuracy to the assessment of epidemic risk level.
3.3.2. Verification of ROC (Receiver Operating Characteristic) Curve
The verification of ROC curve is a means to detect and evaluate the prediction accuracy of logistic regression model comprehensively with the utilization of value of AUC (Area Under Curve); the closer the value is to 1, the higher the prediction accuracy of the model is. It can also be found in
Figure 11 and
Figure 12 that the AUC value of training sample, text sample and overall data is 0.99, 0.99 and 0.99, respectively. The values are all close to 1, which not only proves that the assessment is of great accuracy, but also proves that the logistic regression model can play an accurate role in assessing the risk distribution of COVID-19.
3.4. Analysis of Influence Factor
- 1.
Risk Factor Detector
It is shown in the factor detection results of
Figure 7 that the density of permanent population is the most crucial factor that could decide the risk level of COVID-19 when it comes to the assessment of it, followed by population mobility, which is consistent with the assessment result of epidemic risk level. It is also proven that the most efficient anti-epidemic measure is to avoid population mobility and aggregation by conducting home quarantine. The effect degree of density of traffic station, living market and restaurant on the distribution of COVID-19 is similar, with a lower influence than permanent population density and population mobility. The reason is that the density of traffic stations, restaurants and living space could affect the risk of the epidemic only in terms of human beings. As long as the population density in public areas is controlled, the risk of the epidemic will be decreased. This further shows that the rational anti-epidemic measure is to reduce population floating and interaction to prevent people from being exposed to public environment in population accumulation area. During the epidemic period, people’s demands for life are the highest, followed by catering and transportation. Different demands can also reflect different risk level distribution, which is also shown in the risk factor detector table (
Table 4).
The factor that has the lowest effect degree to risk level of COVID-19 is the density of hotel and fever clinic; this phenomenon resulted from the following reasons: on the one hand, the decrease in the number of people going out during the epidemic directly causes the decrease in the number of people staying in hotels, which makes the spread of COVID-19 more difficult. Moreover, even if some hotels are selected as isolation hotels, the epidemic risk level would also be reduced due to the epidemic prevention measures. On the other hand, no matter where it appears, infectors can always be sent to a fever clinic for timely treatment.
- 2.
Interaction Detector
The detection result, shown as
Table 5, could be obtained after detecting the interaction of different factors. It can be found that the lowest risk level can be reached when interacting permanent population density with floating population density. Assuming that the value of q is 0.71, the effect degree of two factors after interaction is bigger than single influence factor. This also proves that the spread of epidemic can be efficiently controlled, the risk of epidemic can be effectively reduced on the premise of rational controlling permanent population and floating population.
- 3.
Ecological Detector
Assuming that the text value of F is 0.05. The ecological detector table of
Table 6 can be obtained, of which, Y stands for a significant difference with N stands for a non-significant difference. In terms of the risk distribution of COVID-19, there is a significant difference between permanent population and other spatial factors, which proves that the density of permanent population is indeed the most important factor affecting epidemic risk, followed by floating population. Compared with population factors, other factors, including supermarkets, hotels and living markets, show no significant difference in COVID-19 risk distribution. This suggests that other public places will not directly result in the increase of epidemic risk level on the premise of rational controlling the density of regional permanent population. In other words, population is a direct contributor leading to the increase in the risk of COVID-19 in public places in urban spaces.
It can be found by analyzing the factors affecting the risk level of COVID-19 by the geographic detector that population is the direct factor affecting the level of epidemic risk. Under the premise of reasonable population control, public places will not directly cause an increase in the level of epidemic risk. Therefore, it is not necessary to completely implement lockdown policies and restrict the use of public spaces in large cities. As long as population density restrictions are implemented in public spaces and places with public uses, the risk level of the epidemic can be effectively controlled without affecting the operation of the city. Compared with the strict epidemic prevention measures adopted in public areas such as schools and administrative centers, restricting population density in transportation stations, restaurants, and living spaces is undoubtedly a more convenient and effective choice. “Home quarantine” is the best epidemic prevention measure at present, because it can greatly limit the movement and interaction of the population, which is conducive to the decrease of population density in public space. The decline in population density in public space will undoubtedly directly reduce the risk of the epidemic. In addition, there is no significant difference in the results of Risk Factor Detector, Interaction Detector and Ecological Detector between Guangzhou and Wuhan, indicating that within the urban space, the influencing factors of the Risk level of the new crown epidemic are the same, which also makes the results of this study valuable for promotion.
4. Discussion
Using the fused data obtained from fusing spatial geographical big data such as POI data and Tencent-Yichuxing data, this study assesses the epidemic risk level of Guangzhou and conducts spatial difference analysis on different urban spatial factors that affecting the distribution of COVID-19 on the basis of combining with logistic regression model and geodetector model. What’s more, the following main factors that affecting epidemic level are obtained: logical regression model calculates the performance of different factors affecting the epidemic in urban space, and then simulates the final result. In fact, logistic regression model is a process of constantly seeking the optimal solution of the results. Compared with other machine learning models, the calculation process is simpler and the result expression is more direct. As for the geodetector, it can detect the heterogeneity of spatial distribution pattern between dependent variables and independent variables through the spatial difference between different variables, and then measure the degree of mutual explanation between different variables. Therefore, it can be found that compared with other statistical methods, the geodetector can better reflect the causal relationship between different variables.
Since the breakout of COVID-19, relevant research about the epidemic are mainly carried out from the perspective of population mobility [
79], regardless whether in the range of region or country even of the global world. Population mobility has indeed played an important role in the risk assessment of epidemic and it has also been proven that it is one of the most crucial factors that could enormously influence the epidemic risk [
7]. However, population mobility is not the only factor that results in a higher risk for COVID-19, and that is why this study objectively assesses the risk level of epidemic on the basis of comprehensively considering population, public places and other open special factors. Using the fused data obtained from fusing spatial geographical big data such as POI data and Tencent-Yichuxing data, this study also explores primary and secondary factors that affect epidemic risk and the interrelationship among these factors except for assessing the risk level, and it is also shown in the world. In addition, the verification result also shows that the risk assessment of epidemic conducted by this study is of great accuracy.
The high-risk areas for epidemic are mainly concentrated in the areas with more intensive permanent population and more frequent interaction among people [
7,
19], which has also been proven to be correct by previous research. However, comparing with previous research, this study not only takes various urban spatial factors into account to comprehensively discussing the influencing degree of different factors, but also analyzes the distribution of epidemic risk level comprehensively and objectively, which could be instrumental to the target prevention and control of regional epidemic.
Compared with the existing studies on the distribution of epidemic risk [
15,
17], the main contribution of this study is reflected in the research methods and ideas. From the perspective of research methods, this study uses geographic detectors to analyze the urban spatial factors that affect the risk of the epidemic, derives the primary and secondary factors that affect the risk of the epidemic, and analyzes the correlation between different factors. Secondly, from the geographical perspective, compared with other epidemic research, this study explores the distribution of the risk level of the epidemic in urban space, which highlights the role and influence of urban space in the spread of the epidemic. The results obtained from the study do not only contribute to the formulation of urban epidemic prevention policies, but also play a positive role in guiding the risk prevention and control of COVID-19.
In the prevention, control and management of the epidemic, no matter individuals, families or countries have made great efforts and sacrifices to defeat it. From the policy point of view, restricting population mobility undoubtedly has an active impact on the communication of population in different ways, although home quarantine restricts the communication among individuals, families and cities to a certain extent, mainly reflecting in the population restrictions. By comparing
Table 4,
Table 5 and
Table 6, it can be found that population density (permanent population and floating population) is the main factor affecting the risk of the epidemic. Therefore, the way of reducing population density can effectively reduce the risk level of the epidemic and make outstanding contribution to the management of the epidemic, which is also demonstrated in this study. If there is no restriction among individuals, families and cities, the risk of epidemic will increase sharply. Therefore, this study also has theoretical guidance value for epidemic prevention and control.
There is no doubt that there are still many deficiencies and improvements to be made in this study. Although COVID-19 has been brought into efficient control and people’s life has gradually returned to normal, it is still necessary to conduct assessment and analysis more meticulously. To prevent the secondary breakout of COVID-19 more efficiently, especially in winter when the virus will be more active, it is essential to carry out simulation analysis of the global pandemic.