Next Article in Journal
Modeling Population Spatial-Temporal Distribution Using Taxis Origin and Destination Data
Previous Article in Journal
Normalized Difference Vegetation Index and Chlorophyll Content for Precision Nitrogen Management in Durum Wheat Cultivars under Semi-Arid Conditions
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Applying PCA to Deep Learning Forecasting Models for Predicting PM2.5

1
Department of Agricultural Economics and Rural Development, Seoul National University, Gwanak-ro 1, Gwanak-gu, Seoul 08826, Korea
2
Program in Agricultural and Forest Meteorology, Research Institute of Agriculture and Life Sciences, Seoul National University, 1 Gwanangno, Gwanak-gu, Seoul 08826, Korea
*
Author to whom correspondence should be addressed.
Sustainability 2021, 13(7), 3726; https://0-doi-org.brum.beds.ac.uk/10.3390/su13073726
Submission received: 5 March 2021 / Revised: 16 March 2021 / Accepted: 22 March 2021 / Published: 26 March 2021
(This article belongs to the Section Environmental Sustainability and Applications)

Abstract

:
Fine particulate matter (PM2.5) is one of the main air pollution problems that occur in major cities around the world. A country’s PM2.5 can be affected not only by country factors but also by the neighboring country’s air quality factors. Therefore, forecasting PM2.5 requires collecting data from outside the country as well as from within which is necessary for policies and plans. The data set of many variables with a relatively small number of observations can cause a dimensionality problem and limit the performance of the deep learning model. This study used daily data for five years in predicting PM2.5 concentrations in eight Korean cities through deep learning models. PM2.5 data of China were collected and used as input variables to solve the dimensionality problem using principal components analysis (PCA). The deep learning models used were a recurrent neural network (RNN), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM). The performance of the models with and without PCA was compared using root-mean-square error (RMSE) and mean absolute error (MAE). As a result, the application of PCA in LSTM and BiLSTM, excluding the RNN, showed better performance: decreases of up to 16.6% and 33.3% in RMSE and MAE values. The results indicated that applying PCA in deep learning time series prediction can contribute to practical performance improvements, even with a small number of observations. It also provides a more accurate basis for the establishment of PM2.5 reduction policy in the country.

1. Introduction

Fine particulate matter (PM2.5) indicates particles with an aerodynamic diameter of 2.5 μm or less. It is not a specific chemical, such as sulfur oxides ( SO x ) and nitrogen oxides ( NO x ), but a mixture of particles of varying sizes, components, and shapes. Typical substances that form PM2.5 include elemental carbon (EC), organic carbon (OC), NO x , volatile organic compounds (VOC), ozone ( O 3 ), ammonia ( NH 3 ), SO x , condensate particles, metal particles, mineral particles, etc. Because of its small size, it penetrates the body through the respiratory tract, causing inflammation or damaging organs [1]. The WHO considers PM2.5 a major environmental risk factor that causes cardiovascular, respiratory, and various other cancers [2]. Figure 1 shows the effects of PM2.5 on the body [3].
Korea’s PM2.5 concentration was the highest among the 37 OECD (Organization for Economic Co-operation and Development) countries in 2019 [4], and studies have shown that it has a negative effect on people’s health. Han et al. [5] stated that 1763 early deaths in Seoul in 2015 were closely related to PM2.5. Hwang et al. [6] explained that, when the average annual concentration of PM2.5 in Seoul increases by   10   μ g / m 3 , the risk of death over 65 years increases by 13.9%. This is in line with the major causes of death for Koreans in 2019. Statistics Korea shows that cancer (158.2 deaths per 100,000 people), cardiovascular diseases (60.4 deaths per 100,000 people), and pneumonia (45.1 deaths per 100,000 people) are the three major causes of death [7]. This suggests that PM2.5 is highly correlated to the main cause of death for Koreans.
The Korean government is making great efforts to reduce PM2.5 concentration to protect people’s health. The government has divided the crisis into three stages according to the current status and prediction of PM2.5 concentration and has devised a manual for local governments for each stage of action. The government also aims to reduce the annual average concentration of PM2.5 by 35% compared to 2016 by establishing a five-year plan for PM2.5 concentration reduction. To achieve this purpose, the government selected 15 major tasks by evaluating its potential reduction, cost effectiveness, linkage with other policies, and social impact. These tasks are implemented by each local government [8].
Table 1 shows Korea’s crisis stage standard for PM2.5 concentration, which reflects the concentration of PM2.5 in the current period and future forecast values. It suggests that the accurate prediction of PM2.5 concentration is needed in the short and long terms. In this regard, several studies have conducted air quality prediction using deep learning methods with domestic data (wind speed, NO2, SO2, temperature, etc.) in Korea, and new deep learning models have been developed to show high performance in air quality prediction [9,10]. However, foreign factors should also be considered in predicting PM2.5 concentration in Korea, as the concentration of PM2.5 in the Shandong region of China is also found to affect Korea’s PM2.5 concentration [11]. However, as China’s past PM2.5 concentration data are composed of daily data, Korea’s data should also be organized on a daily basis for deep learning PM2.5 prediction. This data composition can cause a “curse of dimensionality” due to the small number of observations compared to variables, which can reduce the performance of the model.
This study aims to show that the application of principal component analysis (PCA) in the deep learning time series prediction models for PM2.5—a recurrent neural network (RNN), long short-term memory (LSTM), and bidirectional LSTM (BiLSTM)—can result in better performance by comparing the root-mean-square error (RMSE) and mean absolute error (MAE) with the same models without PCA application.

2. Previous Research

Several studies have shown the association of PM2.5 with lung and cardiovascular disease (CVD). Wang et al. [12] reported that CVD is the one of the main mortality factors of elder people. It was found that the ambient PM2.5 concentration is related to several CVDs by linking PM2.5 exposure and CVD based on multiple pathophysiological mechanisms. César et al. [13] showed that the exposure to PM2.5 can cause hospitalizations for pneumonia and asthma in children younger than 10 years of age through an ecological study of time series and a generalized additive model of Poisson regression. Kim et al. [14] reported associations of short-term PM2.5 exposure with acute upper respiratory infection and bronchitis among children aged 0–4 years through a difference-in-differences approach generalized to multiple spatial units (regions) and time periods (day) with distributed lag non-linear models. Vinikoor-Imler et al. [15] studied the relationship between PM2.5 concentration, lung cancer incidence, and mortality by linear regression and concluded that there is a possibility of an association between them. Choe et al. [16] reported that the effect of changes in PM2.5 emissions on changes in internal visits and hospitalization probabilities due to respiratory diseases was estimated through Probit and Tobit models. If PM2.5 emissions change by 1%, the probability of visitation due to respiratory diseases increases from 0.755% to 1.216%, and the probability of hospitalization increases from 0.150% to 0.197%.
The need for PM x prediction research is emerging, and various studies are underway on PM x prediction. Zev Ross et al. [17] developed the land use regression model to predict PM2.5 in New York City and showed that urbanization factors such as traffic volume and population density have a high explanation in predicting PM2.5. Rob Beelen et al. [18] compared the performance of ordinary kriging, universal kriging, and regression mapping in developing EU-wide maps of air pollution and showed that universal kriging performs better in mapping NO 2 , PM 10 , and O 3 . Vikas Singh et al. [19] suggested a cokriging-based approach and interpolated PM 10 in areas not observed in the network in PM 10 monitoring based on the suggested method with secondary variable from the results of a deterministic chemical transport model (CTM) simulation. And the results showed that the proposed method provides flexibility in collecting ultrafine dust data.
Other studies have shown examples of predicting PM2.5 through machine learning and deep learning. Zhao et al. [20] predicted the PM2.5 contamination of stations in Beijing using long short-term memory—fully connected (LSTM-FC), LSTM, and an artificial neural network (ANN) with historical air quality data, meteorological data, weather forecast data, and the day of the week data. They showed that the LSTM-FC model outperforms LSTM and the ANN, with MAE = 23.97–50.13 and RMSE = 35.82–69.84 over 48 h. Karimian et al. [21] also predicted Tehran’s PM2.5 concentration by implementing multiple additive regression trees (MARTs), a deep feedforward neural network (DFNN), and a new hybrid model LSTM with meteorological data (temperature, surface-level pressure, relative humidity, etc.). The best model in this research was LSTM in 12, 24, and 48 h prediction, with RMSE = 7.03–11.73 μg/m3 and MAE = 5.59–8.41 μg/m3. Qadeer et al. [22] used XGBoost (XGB), the light gradient boosting machine (LGBM), the gated recurrent unit (GRU), convolutional neural network–LSTM (CNNLSTM), BiLSTM, and LSTM to predict PM2.5 concentration of eight sites in Seoul and Gwangju with community multiscale air quality (CMAQ) data. The result showed that LSTM performs best, with MAE = 3.5847 μg/m3, RMSE = 4.8292 μg/m3, R = 0.8989, and IA = 0.9368 of the mean in all sites.
The RNN, LSTM, and BiLSTM models were used in this study, because previous studies have shown that the deep learning sequence model performs better in prediction. The local weather and air quality data were used to predict PM2.5, as shown in previous studies, and used as predictive input variables. The regional data of China are also used as predictive input variables, which were found to affect PM2.5 in Korea.

3. Data

3.1. Spatial Area

Figure 2 shows the spatial range of the research. A total of eight cities in Korea were selected for analysis. Of the eight cities, six are metropolitan cities (Busan, Daejeon, Daegu, Gwangju, Incheon, and Ulsan) representing each province, one is the capital city (Seoul), and one is the most populous city (Wonju) in the province without a metropolitan city. In each city, daily air quality data (PM2.5, SO2, O3, NO2, and CO) [23] and meteorological data (temperature, wind speed, wind direction, humidity, precipitation, etc.) [24] were collected in consideration of the internal factors of PM2.5 generation. Air quality data were collected within 5 km of each city’s meteorological data observatory.
Figure 3 shows that Korea is mainly a country with north and west winds. As a result, the air quality of Korea can be directly and indirectly affected by the air quality of China, a country located in the west and north. Figure 4 [25] also shows the concentration of PM2.5 in Korea and China at the same time before and after the outbreak of COVID-19. According to Bao et al. [26], it can be seen that the lockdown of Chinese factories after the COVID-19 outbreak actually improved the Chinese air quality. Considering this, with the direction of the wind in Korea, we can see that the air quality of Korea is highly affected by the air quality in China. Accordingly, daily PM2.5 concentrations in 55 areas in China close to Korea were selected as input variables in this study, including the PM2.5 concentration in Shandong province, which was found to increase PM2.5 concentration in Korea.

3.2. Data Preprocessing

All variables have a time range from 1 January 2015 to 31 December 2019 and are collected as daily data. There are missing values in some variables, and these missing values were processed by the exponentially weighted moving average (EWMA) using the imputeTS package of the R software [27]. The EWMA gives higher weights to the latest data, reducing the weight of older values, and the formula for EWMA imputation suggested by Hunter [28] is as follows:
S ^ t   = S ^ t     1 + α e t     1
= S ^ t     1 + α ( S t     1     S ^ t     1 )
= α S t     1 + ( 1     α ) S ^ t     1
= α k = 1 t 2 ( 1     α ) k     1 S t k + ( 1     α ) t     2 S 2
*   α = 2 n + 1 ,   n =   M o v i n g   A v e r a g e   P e r i o d ,   k { 1 , 2 , } , t 2
S ^ t is the predicted value at time t, S t is the observed value at time t, e t is the observed error at time t, and α is a constant value called the weight from zero to one. The higher the α value is, the less it reflects past data.
Figure 5 and Figure A1, Figure A2 and Figure A3 show the concentration of PM2.5 in China (Figure 5a), Seoul (Figure 5b) with air quality, and the meteorological data of Seoul (Figure A1, Figure A2 and Figure A3). Each variable shows the values in a different range due to the differences in units of measurement and the characteristics within the region. In the case of Chinese data, the concentration of PM2.5 in each city over time seems to be constant, but some cities have outliers. If one variable has a relatively greater value, or a wider range of values than the others, in the composition of the data, it can result in a significant impact on the predicted value, regardless of the predictive importance of the variable.
To solve these problems, the scope of the variables should be adjusted through normalization. In this study, maximum–minimum normalization was carried out to every data of each city as shown in the following equation:
N o r m a l i z e d   V a r i a b l e s   v a l u e =   V a r i a b l e s   O r i g i n a l   v a l u e     V a r i a b l e s   M i n i m u m   v a l u e V a r i a b l e s   M a x i m u m   v a l u e     V a r i a b l e s   M i n i m u m   v a l u e
Because the wind direction data were collected as 16 cardinal points, these are labels encoded to transform direction data into numerical data.

3.3. Variable Correlation Analysis

As mentioned above, the prediction target of this study is the concentration of PM2.5. The efficiency of the forecast results in deep learning, and machine learning depends on the correlation between the dependent and the independent variables. It is important to add variables with a strong negative or positive correlation between the dependent variable and the independent variable. In addition, the results of correlation are necessary for data analysis because they provide a basis for determining the influence of each independent variable on a dependent variable. In this study, the Pearson correlation coefficient was calculated, which is expressed as the covariance and standard deviation of the variables, as shown in the following equations in the case of observation vector X = ( X 1 , X 2 , X 3 , X n ) :
C o r r e l a t i o n   M a t r i x
= [   ( X 1     X 1 ¯ ) ( X 1     X 1 ¯ ) ( X 1     X 1 ¯ ) 2 ( X 1     X 1 ¯ ) 2   ( X 1     X 1 ¯ ) ( X 2     X 2 ¯ ) ( X 1     X 1 ¯ ) 2 ( X 2     X 2 ¯ ) 2   ( X 1     X 1 ¯ ) ( X n     X n ¯ ) ( X 1     X 1 ¯ ) 2 ( X n     X n ¯ ) 2   ( X n     X n ¯ ) ( X 1     X 1 ¯ ) ( X n     X n ¯ ) 2 ( X 1     X 1 ¯ ) 2   ( X n     X n ¯ ) ( X 2     X 2 ¯ ) ( X n     X n ¯ ) 2 ( X 2     X 2 ¯ ) 2   ( X n     X n ¯ ) ( X n     X n ¯ ) ( X n     X n ¯ ) 2 ( X n     X n ¯ ) 2 ]
=   [   C o v ( X 1 , X 1 ) V a r ( X 1 ) V a r ( X 1 )   C o v ( X 1 , X 2 ) V a r ( X 1 ) V a r ( X 2 )     C o v ( X 1 , X n ) V a r ( X 1 ) V a r ( X n ) C o v ( X n , X 1 ) V a r ( X n ) V a r ( X 1 ) C o v ( X n , X 2 ) V a r ( X n ) V a r ( X 2 ) C o v ( X n , X n ) V a r ( X n ) V a r ( X n ) ]
Each element in the correlation matrix has a value between −1 and 1, showing that a value greater than 0 is a positive correlation and a value less than 0 is a negative correlation. The correlation matrix is symmetric, and all of the diagonal elements of the matrix have a value of 1 considering C o v ( X i , X i ) = V a r ( X i ) , i { 1 , 2 , n } .
Figure 6 is a visualization of the correlation between PM 2.5 concentrations and the highest eight factors inside Seoul, Korea. Appendix A Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9 show the correlation between PM2.5 concentrations and the meteorological air quality factors of each city in Korea. Overall, the factors that have a strong positive correlation with PM2.5 are air quality factors except for O3. PM2.5 also appears to have a positive correlation with local air pressure (LAP), sea-level pressure (SP), wind direction, and relative humidity. Conversely, temperature, wind speed, O3, wind flow sum (wind flow sum refers to the distance that the air flows, and the Korea Meteorological Administration produces a day-to-day wind flow sum (24 h wind flow sum).), and daily precipitation were found to have a negative correlation with PM2.5 concentrations. However, the variables that have a relatively weak correlation with PM2.5 changed the sign of the correlation depending on the region.
Figure 7 shows an origin–destination map of PM2.5 correlations between Chinese [29] and Korean cities. The correlations between PM2.5 concentrations in each Chinese city and PM2.5 concentrations in each Korean city vary, but as shown in Table 2, an overall correlation between 0.13 and 0.55 is shown. Comparing this with the factors inside the Korean cities, we can see that the PM2.5 concentration of each city in China is as much related with the PM2.5 concentration in Korea as the data of air quality inside the city. This suggests that China’s PM2.5 concentration could be an important independent variable in predicting PM2.5 concentrations in Korea.

4. Analytical Methods

4.1. PCA

PCA reduces dimensions by linear combinations of variables with high explanatory power of the overall data variability, explaining variation in high-dimension data in low dimensions. Vectors with p variables can have total p principal components, and the principal components of vector x   ( 1 × p ) , whose covariance matrix is   Σ ( p × p ) , can be generated as follows:
P C = a T x = a 1 x 1 + a 2 x 2 + + a p x p
V a r ( a T x ) = a T V a r ( x ) a = a T Σ a
= a T Σ a     λ ( a T a     1 )
a = 2 Σ a     2 λ a = 0
Σ a = λ a
V a r ( P C ) = a T Σ a = a T ( λ a ) = λ
P C i = a i T x = a i 1 x 1 + a i 2 x 2 + + a i p x p
V a r ( P C i ) = λ i
Because the principal component is a linear combination of X , it can be expressed as Equation (9), and the variance of this linear combination can be expressed as Equation (10). The PCA has to preserve the variance of the original data as much as possible, so Equation (10) should also be maximized. Therefore, the method of generating principal components can be transformed into the problem of obtaining a ( p × 1 ) , which maximizes a T Σ a under the condition a T a = 1 . Equation (11) was derived by applying Lagrange’s multiplier method to Equation (10). Equation (13) was made by Equation (12), which partially differentiates Equation (11) by a . Equation (13) shows that λ is the eigenvalue of Σ , and a is the eigenvector of Σ . As a result, a linear combination that maximizes Equation (10), i.e., the principal component, can be expressed as Equation (9). In addition, Equation (10), which is the variance of the principal component, can be expressed as λ under the condition a T a = 1 . Therefore, in vectors with p variables, the i-th principal component is Equation (15), and the variance is Equation (16). Subsequently, the number of principal components is selected for convenience by the principal components where the sum of the principal components is more than 80% to 90% of the total variance. For example, the number of principal components i has to be selected out of principal components p. Equation (17) has to produce results of more than 80% to 90%:
λ 1 + λ 2 + + λ i λ 1 + λ 2 + λ 3 + λ 4 + + λ p

4.2. RNN

The RNN is a deep learning model for processing sequence data, such as stock charts [30], music [31], and natural language processes [32]. It remembers the state entered from the previous time point (t − 1) through the hidden layer and passes the hidden layer state at that specific time point (t) to the next time point (t + 1). That is, the status at the previous time point affects the state at the present time point, and the state at the present time point affects the status at the next time point. This procedure is repeated until result values becomes optimized; hence the name “recurrent neural network.”
h t     1 = t a n h ( W h h h t     2 + W x h x t     1 + b h )
h t = t a n h ( W h h h t     1 + W x h x t + b h )
y t ^ = W h y h t +   b y
L t = M S E = ( y t     y ^ t ) 2 n
Figure 8b is the unrolled and inner structure of Figure 8a. In Equations (18)–(20), x t is an input, and h t is a hidden state at time t. W i j   is the weight from layer i to layer j, and b i is the bias in each layer. In Equation (21), L t is the loss at time t, and y t and y t ^ are the actual and predicted values, respectively, at time point t.
The RNN model shares the weights and biases at all time points and circulates the input data to output the results. Model training is repeated until the loss value is minimized by gradient descending in the loss function, with information of specific previous time steps. At the same time, the weight is updated to find the optimum value. This is called backpropagation through time (BPTT) and in an RNN can be expressed as follows [33]:
U p d a t e d   W x h = E x i s t i n g   W x h     η t = 1 n k = 0 n L t y ^ t y ^ t h t h t h k h k W x h
U p d a t e d   W h h = E x i s t i n g   W h h     η t = 1 n k = 0 n L t y ^ t y ^ t h t h t h k h k W h h
U p d a t e d   W h y = E x i s t i n g   W h y     η t = 1 n L t W h y
η = l e a r n i n g   r a t e   [ 0 , 1 ]

4.3. LSTM and BiLSTM

In an RNN, tanh is used as an activation function to train the model in a non-linear way. However, there is a long-term dependency problem caused by a “vanishing gradient” problem in the RNN’s BPTT, in which the gradient (weights update rate) disappears as the value (derivative value of the tanh function with respect to h t ) less than 1 continues to multiply. Thus, the state of a relatively distant past time point has almost no effect on an output of the present time point. As a result, the model relies only on short-term data and has a limit in achieving the best performance. To solve this problem, Hochreiter et al. [34] suggested the LSTM model. Figure 9 shows the internal structure of LSTM and its process.
LSTM is the model in which forgetting and memory ( f t ), the input ( i t ), the inner cell state candidate ( C t ˜ ) , the conveying and inner cell state at time point t ( C t ), and the output ( o t ) are added to the RNN model. Especially, C t   , which penetrates all time points, greatly contributes to solving the long-term dependency problem. The order of each part and the internal algorithm can be explained by the following process:
f t = σ ( W x h ( f ) x t + W h h ( f ) h t     1 + b h ( f ) )
i t = σ ( W x h ( i ) x t + W h h ( i ) h t     1 + b h ( i ) )
C t ˜ = t a n h ( W x h ( C t ˜ ) x t + W h h ( C t ˜ ) h t     1 + b h ( C t ˜ ) )
C t = f t C t     1 + i t C t ˜
o t = σ ( W x h ( o ) x t + W h h ( o ) h t     1 + b h ( o ) )
h t = o t t a n h ( C t )
= H a d a m a r d   p r o d u c t ,   σ = s i g m o i d   f u n c t i o n =   1 1 + e     x
Equation (25), output of the forget gate, determines whether the historical state is forgotten by the combination of x t and h t 1 . The output value of this step is converted to a number between 0 and 1 by the sigmoid function and multiplied by C t 1 (memory of past data, i.e., historical state) to determine how much past data to preserve or forget. A value of 0 indicates forgetfulness, and 1 indicates memorization of past data. Equations (26) and (27) are involved in the storage of the inner cell state of time point t. Equation (26), output of the input gate, determines how much data of time point t are memorized. In other words, it has a value between 0 and 1, indicating the degree of memorizing for the new information. At the same time, Equation (27) generates the inner cell state candidate of time point t. Equation (28) generates the new cell state at time point t and passes it on to the LSTM cell at the next time point (t + 1). In other words, LSTM solves the RNN’s long-term dependency problem by adjusting the memorization and forgetfulness of the past and presents the state through Equations (25)–(28). In the end, the output is decided by Equations (29) and (30). Equation (29), output of the output gate, decides which part of the new cell state will become output. A value of the new cell status is converted through the tangent function and calculated with the result value of Equation (29) to produce the final output of time point t, as shown in Equation (30).
BiLSTM is a variant of the bidirectional RNN proposed by Schuster et al. [35]. Figure 10 shows an example of applying a bidirectional way to sentence learning. If (A) is taught in the model and “went” is set as the target, (B) predicts in a forward way and (C) predicts in both a forward and a backward way. If LSTM uses a historical state to predict the value of time point t, bidirectional LSTM predicts the value of time point t by adding an LSTM layer that reads data from a future state. The computations within the model are the same as those of LSTM, and LSTM and BiLSTM update their weights in the training model as an RNN [36].

4.4. Evaluation Model Performance

In this study, MAE (Mean Absolute Error) and RMSE (Root Mean Square Error) were used as evaluation indicators to compare the performance of each model with and without PCA application. The calculations of each indicator are expressed as follows:
M A E =   1 n t = 1 n | y t     y t ^ |
R M S E =   1 n t = 1 n ( y t     y t ^ ) 2  

4.5. Workflow

The flow of this study is divided into four stages: data collection, data preprocessing, prediction, and evaluation (Figure 11). The application of PCA is used in the data preprocessing stage, aiming to reduce the number of variables and increase the performance of model predictions. Thus, the data preprocessing stage was divided into two cases. Case 1 was set as a prediction without a PCA application, and Case 2 was set as a prediction with the PCA application. Afterwards, each case will be compared using evaluation indicators (MAE and RMSE).

5. Results

5.1. PC Selection

For each city, PCA was performed on the input variables, except the dependent variable, PM2.5. The variance of each city’s data was explained by a relatively small number of principal components, which resulted in the selection of five principal components in all cities. This reduced the number of input variables to about 1/16. Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16 and Table A17 show the results of the PCA of each city, and Table 3 shows how much the five principal components describe the overall variation of each city.

5.2. Setup and Case Comparison

China’s daily PM2.5 concentration and Korea’s air quality and meteorological data were collected from 1 January 2015 to 31 December 2019 to predict the concentration of PM2.5 in eight Korean cities. In total, 85% of the collected data were allocated to the train set and 15% to the test set. In the aspect of details in models, the three models have 256 units in the layer, a tanh activation function, 200 epochs, a batch size of 64 and an adaptive moment estimation (ADAM) optimizer [37]. To avoid overfitting, 30% of the train set was designated as a validation set, and a 30% dropout regulation was used between the input layer and the output layer. Additionally, in model learning, earlystopping, one of the callback functions of Keras, was applied to stop learning in the epoch when optimal learning had achieved 200 epochs.
Figure 12 shows the predicted and actual values of PM2.5 for each case and model in Seoul. Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10 show the PM2.5 concentration prediction of each city except for Seoul. Unlike LSTM and BiLSTM, the RNN appears to have outputted average values for all time periods and shows relatively low predictive power in both Case 1 and Case 2. The RNN without PCA seems to follow the trend more and to show relatively higher performance than the RNN with PCA. However, although there are differences between cities, LSTM and BiLSTM show that they follow the trend relatively well, regardless of whether PCA is applied or not. Furthermore, it can be seen that PCA application in all cities corrects the difference between the predicted and actual values that exists if PCA is not applied. It also appears to have produced more accurate results in predicting peak values.
Table 4 and Table 5 are numerical representations of these visual results. As noted above, it is understood that the reduction in dimension in all cities leads to a relatively low performance in the RNN, except for Daegu in terms of the MAE. This means that, for RNNs, reducing variables does not help model learning; rather, providing a high amount of information in a short period of time can lead to a better performance, depending on the feature of the model that relies on short-term information. Instead of an RNN, which lacks overall accuracy, the results that should be considered are those of LSTM and BiLSTM. Unlike the RNN, PCA application to LSTM and BiLSTM showed better results in RMSE and MAE evaluation, similar to the visual results. The order of cities with high performance is as follows: Busan > Daejeon > Gwangju > Daegu > Seoul > Ulsan > Wonju > Incheon, while the order of cities with high improvement in MAE and RMSE is as follows: Busan > Incheon > Gwangju > Seoul > Ulsan > Daegu > Daejeon > Wonju. LSTM showed high performance in Daejeon, Daegu, and Busan, while BiLSTM showed higher performance in the rest of the cities.
The difference in performance and performance improvements city by city makes it worthwhile to consider which characteristics of each city would cause regional differences in the performance of the same model, and which model would perform better depending on regional characteristics. To do so, it is expected that such studies require multidisciplinary considerations.

6. Conclusions

Performance degradation due to the curse of dimensionality can occur in deep learning and machine learning. We proposed a PCA-applied model to solve this problem, and through performance comparison with a non-PCA model, we showed that PCA applications produce better results in deep learning time series prediction. Such a performance improvement technique can be a way to increase the efficiency of the government system by providing better forecasts as a basis for issuing crisis alerts and establishing air pollution reduction policies in the future.
As the correlation analysis shows, the concentration of PM2.5 in China appears to have positive correlations with the concentration of PM2.5 in China, indicating that we have to consider China’s air pollution factors in predicting the concentration of PM2.5 in Korea. It suggests that there is a justification for the setup of real-time air pollution databases between the two countries from the ongoing joint research between Korea and China [38].
However, while PCA applications can improve model performance, the results show relatively weak predictions on predicting the minimum and maximum concentration PM2.5 for each city. It seems to be a problem due to a small number of observations (daily observations, not hourly observations). It is expected that future joint cross-border research will result in better performance by collecting much more observations. Some meteorological data in each Korean city showed a relatively weak correlation with concentration, so it seems necessary to find variables that have causality or strong correlation within areas other than deep learning. For example, if spatial factors (spatial homogeneity, autocorrelation, etc.) in Chinese cities and Korean cities are added to the model as input variables, it is expected that the model will produce better performance by learning time and spatial features of data.
This research will continue to maximize the prediction performance of deep learning models by collecting observations and optimizing models, while applying new algorithms and adding other variables that have causality with concentration of PM2.5 in terms of econometrics and spatial econometrics.

Author Contributions

Both authors have contributed to the ideas of the paper. S.W.C. analyzed the data, interpreted the results, and drafted the article. B.H.S.K. provided core advice, critically revised the article in terms of content and language, and approved this version for publication. Both authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Ministry of Education (NRF-2020S1A5A2A01044582).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The open access fee was supported by the Research Institute for Agriculture and Life Sciences, Seoul National University.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Acronym list.
Table A1. Acronym list.
AcronymMeaning
Min TempMinimum temperature (°C)
Max TempMaximum temperature (°C)
Mean TempMean temperature (°C)
Daily prepDaily precipitation (mm)
Max inst WSMaximum instantaneous wind speed (m/s)
Max inst WSDMaximum instantaneous wind speed directions (16 cardinal points)
Max WSMaximum wind speed (m/s)
Max WSDMaximum wind speed directions (16 cardinal points)
Mean WSMean wind speed (m/s)
WFSWind flow sum (100 m)
Max freq WDMaximum frequent wind directions (16 cardinal points)
Mean DPMean dew point (°C)
Mean RHMean relative humidity (%)
Mean LAPMean local atmospheric pressure (hPa)
Max SPMaximum sea-level pressure (hPa)
Min SPMinimum sea-level pressure (hPa)
Mean SPMean sea-level pressure (hPa)
Min RHMinimum relative humidity (%)
NPCThe number of principal components
CVCumulative variance
Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8 and Table A9 show the correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in each city.
Table A2. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Seoul.
Table A2. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Seoul.
Air
quality
factors
O3
(ppm)
−0.021Meteorological factorsMin Temp−0.175Max inst
WS
−0.196Mean WS−0.174Mean RH0.013Mean SP0.168
CO
(ppm)
0.565Max Temp−0.185Max inst WSD0.09WFS−0.175Mean LAP0.166Min RH−0.047
NO2
(ppm)
0.627Mean Temp−0.156Max WS−0.098Max freqWD0.041Max SP0.17
SO2
(ppm)
0.417Daily prep−0.143Max WSD0.118Mean DP−0.141Min SP0.169
Table A3. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Gwangju.
Table A3. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Gwangju.
Air
quality
factors
O3
(ppm)
0.108Meteorological factorsMin Temp−0.223Max inst
WS
−0.226Mean WS−0.28Mean RH−0.164Mean SP0.192
CO
(ppm)
0.532Max Temp−0.122Max inst WSD0.108WFS−0.281Mean LAP0.192Min RH−0.235
NO2
(ppm)
0.562Mean Temp−0.179Max WS−0.214Max freq WD0.11Max SP0.186
SO2
(ppm)
0.276Daily prep−0.212Max WSD0.102Mean DP−0.2Min SP0.196
Table A4. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Daegu.
Table A4. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Daegu.
Air
quality
factors
O3
(ppm)
−0.113Meteorological factorsMin Temp−0.291Max inst WS−0.305Mean WS−0.373Mean RH−0.056Mean SP0.256
CO
(ppm)
0.665Max Temp−0.193Max inst WSD0.156WFS−0.374Mean LAP0.252Min RH−0.128
NO2
(ppm)
0.702Mean Temp−0.244Max WS−0.317Max freq WD0.053Max SP0.26
SO2
(ppm)
0.437Daily prep−0.157Max WSD0.14Mean DP−0.214Min SP0.253
Table A5. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Daejeon.
Table A5. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Daejeon.
Air
quality
factors
O3
(ppm)
−0.086Meteorological factorsMin Temp−0.299Max ins tWS−0.241Mean WS−0.265Mean RH−0.101Mean SP0.272
CO
(ppm)
0.535Max Temp−0.236Max inst WSD0.121WFS−0.265Mean LAP0.271Min RH−0.173
NO2
(ppm)
0.483Mean Temp−0.272Max WS−0.239Max freq WD0.211Max SP0.27
SO2
(ppm)
0.41Daily prep−0.18Max WSD0.1Mean DP−0.271Min SP0.272
Table A6. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Busan.
Table A6. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Busan.
Air
quality
factors
O3
(ppm)
0.029Meteorological factorsMin Temp−0.139Max inst
WS
−0.231Mean WS−0.162Mean RH−0.126Mean SP0.104
CO
(ppm)
0.32Max Temp−0.095Max inst WSD0.196WFS−0.162Mean LAP0.102Min RH−0.187
NO2
(ppm)
0.554Mean Temp−0.119Max WS−0.07Max freqWD0.178Max SP0.086
SO2
(ppm)
0.366Daily prep−0.17Max WSD0.249Mean DP−0.125Min SP0.124
Table A7. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Ulsan.
Table A7. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Ulsan.
Air
quality
factors
O3
(ppm)
0.095Meteorological factorsMin Temp−0.084Max inst WS−0.198Mean WS−0.318Mean RH−0.125Mean SP0.032
CO
(ppm)
0.665Max Temp0.053Max inst WSD0.023WFS−0.319Mean LAP0.064Min RH−0.233
NO2
(ppm)
0.667Mean Temp−0.015Max WS−0.166Max freq WD−0.055Max SP0.016
SO2
(ppm)
0.525Daily prep−0.167Max WSD0.014Mean DP−0.055Min SP0.051
Table A8. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Wonju.
Table A8. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Wonju.
Air
quality
factors
O3
(ppm)
−0.129Meteorological factorsMin Temp−0.384Max inst WS−0.187Mean WS−0.257Mean RH−0.018Mean SP0.309
CO
(ppm)
0.686Max Temp−0.339Max inst WSD0.171WFS−0.259Mean LAP0.299Min RH−0.077
NO2
(ppm)
0.675Mean Temp−0.366Max WS−0.187Max freq WD0.077Max SP0.318
SO2
(ppm)
0.575Daily prep−0.184Max WSD0.14Mean DP−0.326Min SP0.302
Table A9. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Incheon.
Table A9. The correlation coefficient of the meteorological and air quality factors between PM2.5 concentrations in Incheon.
Air
quality
factors
O3
(ppm)
−0.102Meteorological FactorsMin Temp−0.15Max inst
WS
−0.288Mean WS−0.308Mean RH0.214Mean SP0.149
CO
(ppm)
0.621Max Temp−0.122Max inst WSD0.045WFS−0.309Mean LAP0.143Min RH0.091
NO2
(ppm)
0.667Mean Temp−0.142Max WS−0.254Max freq WD0.07Max SP0.155
SO2
(ppm)
0.559Daily prep−0.144Max WSD0.049Mean DP−0.054Min SP0.151
Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16 and Table A17 show the results of the PCA of each city.
Table A10. The PCA result of Seoul.
Table A10. The PCA result of Seoul.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
182.06%1197.88%2198.99%3199.49%4199.77%5199.92%6199.99%71100.00%
293.11%1298.04%2299.06%3299.52%4299.79%5299.93%62100.00%72100.00%
394.58%1398.19%2399.12%3399.56%4399.81%5399.94%63100.00%73100.00%
495.71%1498.33%2499.18%3499.59%4499.82%5499.95%64100.00%74100.00%
596.31%1598.45%2599.23%3599.62%4599.84%5599.96%65100.00%75100.00%
696.69%1698.56%2699.28%3699.65%4699.85%5699.97%66100.00%76100.00%
797.01%1798.66%2799.33%3799.68%4799.87%5799.98%67100.00%77100.00%
897.28%1898.76%2899.37%3899.70%4899.88%5899.98%68100.00%
997.50%1998.85%2999.41%3999.72%4999.90%5999.99%69100.00%
1097.70%2098.92%3099.45%4099.75%5099.91%6099.99%70100.00%
Table A11. The PCA result of Gwangju.
Table A11. The PCA result of Gwangju.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
179.42%1197.44%2198.79%3199.38%4199.72%5199.90%6199.99%71100.00%
291.66%1297.64%2298.87%3299.43%4299.74%5299.92%62100.00%72100.00%
393.45%1397.82%2398.94%3399.47%4399.76%5399.93%63100.00%73100.00%
494.80%1497.99%2499.01%3499.51%4499.78%5499.94%64100.00%74100.00%
595.53%1598.13%2599.08%3599.54%4599.80%5599.95%65100.00%75100.00%
695.98%1698.26%2699.13%3699.58%4699.82%5699.96%66100.00%76100.00%
796.36%1798.39%2799.19%3799.61%4799.84%5799.97%67100.00%77100.00%
896.69%1898.52%2899.24%3899.64%4899.86%5899.98%68100.00%
996.96%1998.61%2999.29%3999.66%4999.87%5999.98%69100.00%
1097.21%2098.70%3099.34%4099.69%5099.89%6099.99%70100.00%
Table A12. The PCA result of Daegu.
Table A12. The PCA result of Daegu.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
189.40%1198.69%2199.38%3199.69%4199.86%5199.95%61100.00%71100.00%
295.70%1298.79%2299.42%3299.71%4299.87%5299.96%62100.00%72100.00%
396.62%1398.88%2399.46%3399.73%4399.88%5399.97%63100.00%73100.00%
497.33%1498.96%2499.49%3499.75%4499.89%5499.97%64100.00%74100.00%
597.70%1599.04%2599.53%3599.77%4599.90%5599.98%65100.00%75100.00%
697.93%1699.11%2699.56%3699.79%4699.91%5699.98%66100.00%76100.00%
798.13%1799.17%2799.59%3799.80%4799.92%5799.99%67100.00%77100.00%
898.30%1899.23%2899.61%3899.82%4899.93%5899.99%68100.00%
998.44%1999.28%2999.64%3999.83%4999.94%5999.99%69100.00%
1098.57%2099.33%3099.66%4099.85%5099.95%60100.00%70100.00%
Table A13. The PCA result of Daejeon.
Table A13. The PCA result of Daejeon.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
178.91%1197.36%2198.74%3199.36%4199.71%5199.91%6199.99%71100.00%
291.31%1297.56%2298.82%3299.41%4299.74%5299.92%62100.00%72100.00%
393.19%1397.75%2398.90%3399.45%4399.76%5399.93%63100.00%73100.00%
494.62%1497.92%2498.97%3499.49%4499.78%5499.95%64100.00%74100.00%
595.39%1598.06%2599.04%3599.53%4599.81%5599.96%65100.00%75100.00%
695.86%1698.20%2699.10%3699.57%4699.83%5699.97%66100.00%76100.00%
796.26%1798.33%2799.16%3799.60%4799.84%5799.97%67100.00%77100.00%
896.60%1898.45%2899.21%3899.63%4899.86%5899.98%68100.00%
996.88%1998.55%2999.26%3999.66%4999.88%5999.99%69100.00%
1097.13%2098.65%3099.31%4099.69%5099.89%6099.99%70100.00%
Table A14. The PCA result of Busan.
Table A14. The PCA result of Busan.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
190.96%1198.92%2199.49%3199.74%4199.88%5199.96%61100.00%71100.00%
296.47%1299.00%2299.52%3299.75%4299.89%5299.97%62100.00%72100.00%
397.22%1399.08%2399.55%3399.77%4399.90%5399.97%63100.00%73100.00%
497.79%1499.15%2499.58%3499.79%4499.91%5499.98%64100.00%74100.00%
598.10%1599.21%2599.61%3599.80%4599.92%5599.98%65100.00%75100.00%
698.30%1699.27%2699.63%3699.82%4699.93%5699.98%66100.00%76100.00%
798.47%1799.32%2799.65%3799.83%4799.93%5799.99%67100.00%77100.00%
898.60%1899.37%2899.68%3899.85%4899.94%5899.99%68100.00%
998.72%1999.41%2999.70%3999.86%4999.95%5999.99%69100.00%
1098.83%2099.45%3099.72%4099.87%5099.95%60100.00%70100.00%
Table A15. The PCA result of Ulsan.
Table A15. The PCA result of Ulsan.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
183.59%1198.04%2199.07%3199.52%4199.78%5199.92%6199.99%71100.00%
293.60%1298.19%2299.13%3299.55%4299.80%5299.93%62100.00%72100.00%
394.97%1398.32%2399.18%3399.58%4399.82%5399.94%63100.00%73100.00%
496.00%1498.45%2499.24%3499.61%4499.83%5499.95%64100.00%74100.00%
596.55%1598.56%2599.28%3599.64%4599.85%5599.96%65100.00%75100.00%
696.90%1698.66%2699.33%3699.67%4699.86%5699.97%66100.00%76100.00%
797.21%1798.76%2799.37%3799.69%4799.88%5799.97%67100.00%77100.00%
897.46%1898.86%2899.41%3899.71%4899.89%5899.98%68100.00%
997.68%1998.93%2999.45%3999.74%4999.90%5999.99%69100.00%
1097.87%2099.00%3099.48%4099.76%5099.91%6099.99%70100.00%
Table A16. The PCA result of Wonju.
Table A16. The PCA result of Wonju.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
171.67%1196.38%2198.28%3199.13%4199.61%5199.87%6199.99%71100.00%
288.12%1296.65%2298.39%3299.19%4299.64%5299.89%62100.00%72100.00%
390.65%1396.91%2398.50%3399.25%4399.68%5399.91%63100.00%73100.00%
492.60%1497.14%2498.60%3499.31%4499.71%5499.92%64100.00%74100.00%
593.66%1597.34%2598.69%3599.36%4599.73%5599.94%65100.00%75100.00%
694.32%1697.53%2698.77%3699.41%4699.76%5699.95%66100.00%76100.00%
794.86%1797.71%2798.85%3799.45%4799.78%5799.96%67100.00%77100.00%
895.33%1897.88%2898.93%3899.50%4899.81%5899.97%68100.00%
995.71%1998.02%2999.00%3999.54%4999.83%5999.98%69100.00%
1096.05%2098.15%3099.06%4099.57%5099.85%6099.98%70100.00%
Table A17. The PCA result of Incheon.
Table A17. The PCA result of Incheon.
NPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCVNPCCV
191.12%1198.93%2199.49%3199.74%4199.88%5199.96%61100.00%71100.00%
296.52%1299.01%2299.52%3299.75%4299.89%5299.96%62100.00%72100.00%
397.25%1399.08%2399.55%3399.77%4399.90%5399.97%63100.00%73100.00%
497.83%1499.15%2499.58%3499.79%4499.91%5499.97%64100.00%74100.00%
598.12%1599.21%2599.61%3599.80%4599.92%5599.98%65100.00%75100.00%
698.31%1699.27%2699.63%3699.82%4699.92%5699.98%66100.00%76100.00%
798.47%1799.32%2799.65%3799.83%4799.93%5799.99%67100.00%77100.00%
898.61%1899.37%2899.68%3899.84%4899.94%5899.99%68100.00%
998.72%1999.42%2999.70%3999.85%4999.95%5999.99%69100.00%
1098.83%2099.45%3099.72%4099.87%5099.95%6099.99%70100.00%
Figure A1, Figure A2 and Figure A3 show the meteorological data of Seoul and Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10 show the PM2.5 concentration prediction of each city.
Figure A1. The meteorological data of Seoul (atmospheric, sea−level pressure).
Figure A1. The meteorological data of Seoul (atmospheric, sea−level pressure).
Sustainability 13 03726 g0a1
Figure A2. The meteorological data of Seoul (wind).
Figure A2. The meteorological data of Seoul (wind).
Sustainability 13 03726 g0a2
Figure A3. The meteorological data of Seoul (temperature, relative humidity, precipitation).
Figure A3. The meteorological data of Seoul (temperature, relative humidity, precipitation).
Sustainability 13 03726 g0a3
Figure A4. The PM2.5 prediction in Gwangju by two cases.
Figure A4. The PM2.5 prediction in Gwangju by two cases.
Sustainability 13 03726 g0a4
Figure A5. The PM2.5 prediction in Daegu by two cases.
Figure A5. The PM2.5 prediction in Daegu by two cases.
Sustainability 13 03726 g0a5
Figure A6. The PM2.5 prediction in Daejeon by two cases.
Figure A6. The PM2.5 prediction in Daejeon by two cases.
Sustainability 13 03726 g0a6
Figure A7. The PM2.5 prediction in Busan by two cases.
Figure A7. The PM2.5 prediction in Busan by two cases.
Sustainability 13 03726 g0a7
Figure A8. The PM2.5 prediction in Ulsan by two cases.
Figure A8. The PM2.5 prediction in Ulsan by two cases.
Sustainability 13 03726 g0a8
Figure A9. The PM2.5 prediction in Wonju by two cases.
Figure A9. The PM2.5 prediction in Wonju by two cases.
Sustainability 13 03726 g0a9
Figure A10. The PM2.5 prediction in Incheon by two cases.
Figure A10. The PM2.5 prediction in Incheon by two cases.
Sustainability 13 03726 g0a10

References

  1. Gong, S. A Study on the Health Impact and Management Policy of PM2.5 in Korea 1.; Korea Environment Institute: Sejong, Korea, 2012; pp. 1–209. (In Korean) [Google Scholar]
  2. WHO Health Organization. Ambient (Outdoor) Air Pollution. Available online: https://www.who.int/news-room/fact-sheets/detail/ambient-(outdoor)-air-quality-and-health (accessed on 8 December 2019).
  3. French National Health Agency, InVS, European Environment Agency. Available online: https://news.yahoo.com/micro-pollution-ravaging-china-south-asia-study-031634307.html (accessed on 3 March 2020).
  4. OECD. Available online: https://data.oecd.org/air/air-pollution-exposure.htm (accessed on 11 December 2019).
  5. Han, C.; Kim, S.; Lim, Y.-H.; Bae, H.-J.; Hong, Y.-C. Spatial and Temporal Trends of Number of Deaths Attributable to Ambient PM2.5in the Korea. J. Korean Med Sci. 2018, 33, e193. [Google Scholar] [CrossRef]
  6. Hwang, I.C.; Kim, C.H.; Son, W.I. Benefits of Management Policy of Seoul on Airborne Particulate Matter; The Seoul Institute Policy Research: Seoul, Korea, 2018; pp. 1–113. (In Korean) [Google Scholar]
  7. Statistics Korea Office Press Release. “Results of Cause of Death Statistics in 2019”, Statistics Korea. Available online: http://kostat.go.kr/portal/korea/kor_nw/1/6/2/index.board?bmode=read&bSeq=&aSeq=385219&pageNo=1&rowNum=10&navCount=10&currPg=&searchInfo=&sTarget=title&sTxt= (accessed on 22 September 2020). (In Korean).
  8. Joint Association of Related Korean Ministries of Korea. Comprehensive Plan for Fine Dust Management (2020–2024); Joint Association of Related Korean Ministries of Korea: Seoul, Korea, 2019. (In Korean) [Google Scholar]
  9. Xayasouk, T.; Lee, H.; Lee, G. Air Pollution Prediction Using Long Short-Term Memory (LSTM) and Deep Autoencoder (DAE) Models. Sustainability 2020, 12, 2570. [Google Scholar] [CrossRef] [Green Version]
  10. Mengara, A.M.; Kim, Y.; Yoo, Y.; Ahn, J. Distributed Deep Features Extraction Model for Air Quality Forecasting. Sustainability 2020, 12, 8014. [Google Scholar] [CrossRef]
  11. Park, S.; Shin, H. Analysis of the Factors Influencing PM2.5 in Korea: Focusing on Seasonal Factors. J. Environ. Policy Adm. 2017, 25, 227–248. (In Korean) [Google Scholar] [CrossRef]
  12. Wang, C.; Tu, Y.; Yu, Z.; Lu, R. PM2.5 and Cardiovascular Diseases in the Elderly: An Overview. Int. J. Environ. Res. Public Heal. 2015, 12, 8187–8197. [Google Scholar] [CrossRef] [Green Version]
  13. César, A.C.G.; Nascimento, L.F.C.; Mantovani, K.C.C.; Vieira, L.C.P. Fine particulate matter estimated by mathematical model and hospitalizations for pneumonia and asthma in children. Rev. Paul. Pediatr. 2016, 34, 18–23. [Google Scholar] [CrossRef] [PubMed]
  14. Kim, K.-N.; Kim, S.; Lim, Y.-H.; Song, I.G.; Hong, Y.-C. Effects of short-term fine particulate matter exposure on acute respiratory infection in children. Int. J. Hyg. Environ. Health 2020, 229, 113571. [Google Scholar] [CrossRef]
  15. Vinikoor-Imler, L.C.; Davis, J.A.; Luben, T.J. An Ecologic Analysis of County-Level PM2.5 Concentrations and Lung Cancer Incidence and Mortality. Int. J. Environ. Res. Public Health 2011, 8, 1865–1871. [Google Scholar] [CrossRef]
  16. Choe, J.-I.; Lee, Y.S. A Study on the Impact of PM2.5 Emissions on Respiratory Diseases. J. Environ. Policy Adm. 2015, 23, 155. (In Korean) [Google Scholar] [CrossRef]
  17. Ross, Z.; Jerrett, M.; Ito, K.; Tempalski, B.; Thurston, G. A land use regression for predicting fine particulate matter concentrations in the New York City region. Atmos. Environ. 2007, 41, 2255–2269. [Google Scholar] [CrossRef]
  18. Beelen, R.; Hoek, G.; Pebesma, E.; Vienneau, D.; de Hoogh, K.; Briggs, D.J. Mapping of background air pollution at a fine spatial scale across the European Union. Sci. Total. Environ. 2009, 407, 1852–1867. [Google Scholar] [CrossRef] [PubMed]
  19. Singh, V.; Carnevale, C.; Finzi, G.; Pisoni, E.; Volta, M. A cokriging based approach to reconstruct air pollution maps, processing measurement station concentrations and deterministic model simulations. Environ. Model. Softw. 2011, 26, 778–786. [Google Scholar] [CrossRef]
  20. Zhao, J.; Deng, F.; Cai, Y.; Chen, J. Long short-term memory—Fully connected (LSTM-FC) neural network for PM2.5 concentration prediction. Chemosphere 2019, 220, 486–492. [Google Scholar] [CrossRef] [PubMed]
  21. Karimian, H.; Li, Q.; Wu, C.; Qi, Y.; Mo, Y.; Chen, G.; Zhang, X.; Sachdeva, S. Evaluation of Different Machine Learning Approaches to Forecasting PM2.5 Mass Concentrations. Aerosol Air Qual. Res. 2019, 19, 1400–1410. [Google Scholar] [CrossRef] [Green Version]
  22. Qadeer, K.; Rehman, W.U.; Sheri, A.M.; Park, I.; Kim, H.K.; Jeon, M. A Long Short-Term Memory (LSTM) Network for Hourly Estimation of PM2.5 Concentration in Two Cities of South Korea. Appl. Sci. 2020, 10, 3984. [Google Scholar] [CrossRef]
  23. Air Korea. Available online: http://www.airkorea.or.kr/web (accessed on 30 January 2020). (In Korean).
  24. Korea Meteorological Agency. Available online: https://data.kma.go.kr/cmmn/main.do (accessed on 15 February 2019). (In Korean).
  25. Nullschool. Available online: https://earth.nullschool.net/ko/ (accessed on 30 January 2020).
  26. Bao, R.; Zhang, A. Does lockdown reduce air pollution? Evidence from 44 cities in northern China. Sci. Total Environ. 2020, 731, 139052. [Google Scholar] [CrossRef] [PubMed]
  27. Moritz, S.; Bartz-Beielstein, T. imputeTS: Time Series Missing Value Imputation in R. R J. 2017, 9, 207–218. [Google Scholar] [CrossRef] [Green Version]
  28. Hunter, J.S. The Exponentially Weighted Moving Average. J. Qual. Technol. 1986, 18, 203–210. [Google Scholar] [CrossRef]
  29. China National Environmental Monitoring Centre. Available online: http://www.cnemc.cn/sssj/ (accessed on 1 March 2020). (In Chinese).
  30. Hsieh, T.-J.; Hsiao, H.-F.; Yeh, W.-C. Forecasting stock markets using wavelet transforms and recurrent neural networks: An integrated system based on artificial bee colony algorithm. Appl. Soft Comput. 2011, 11, 2510–2525. [Google Scholar] [CrossRef]
  31. Franklin, J.A. Recurrent Neural Networks for Music Computation. INFORMS J. Comput. 2006, 18, 321–338. [Google Scholar] [CrossRef] [Green Version]
  32. Goldberg, Y. Neural Network Methods for Natural Language Processing. Synth. Lect. Hum. Lang. Technol. 2017, 10, 1–309. [Google Scholar] [CrossRef]
  33. Chen, G. A gentle tutorial of recurrent neural network with error backpropagation. arXiv 2016, arXiv:1610.02583. [Google Scholar]
  34. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  35. Schuster, M.; Paliwal, K. Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 1997, 45, 2673–2681. [Google Scholar] [CrossRef] [Green Version]
  36. Gonzalez, J.; Yu, W. Non-linear system modeling using LSTM neural networks. IFAC-PapersOnLine 2018, 51, 485–489. [Google Scholar] [CrossRef]
  37. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference Learn, Represent (ICLR), San Diego, CA, USA, 5–8 May 2015. [Google Scholar]
  38. Ministry of Environment. Ministry of Environment Press Release “Korea-China Joint Research Group to Reduce Fine Dust”. Available online: http://me.go.kr/home/web/board/read.do?boardMasterId=1&boardId=1201300&menuId=286 (accessed on 22 January 2020). (In Korean)
Figure 1. Effects of fine particulate matter (PM2.5) on the body. Source: French National Health Agency, InVS (Institut de veille sanitaire), European Environment Agency, and AFP.
Figure 1. Effects of fine particulate matter (PM2.5) on the body. Source: French National Health Agency, InVS (Institut de veille sanitaire), European Environment Agency, and AFP.
Sustainability 13 03726 g001
Figure 2. Spatial range of the research.
Figure 2. Spatial range of the research.
Sustainability 13 03726 g002
Figure 3. The wind direction frequency of Korea’s selected cities in 2019.
Figure 3. The wind direction frequency of Korea’s selected cities in 2019.
Sustainability 13 03726 g003
Figure 4. PM2.5 distribution maps (before and after the COVID-19 outbreak: 2019 vs. 2020).
Figure 4. PM2.5 distribution maps (before and after the COVID-19 outbreak: 2019 vs. 2020).
Sustainability 13 03726 g004
Figure 5. Visualization of China’s and Seoul’s air quality data set.
Figure 5. Visualization of China’s and Seoul’s air quality data set.
Sustainability 13 03726 g005
Figure 6. Correlation between the highest eight factors and PM2.5 concentrations in Seoul, Korea.
Figure 6. Correlation between the highest eight factors and PM2.5 concentrations in Seoul, Korea.
Sustainability 13 03726 g006
Figure 7. Origin–destination map of PM2.5 correlations between Chinese cities and Korean cities.
Figure 7. Origin–destination map of PM2.5 correlations between Chinese cities and Korean cities.
Sustainability 13 03726 g007
Figure 8. Internal structure of the recurrent neural network (RNN).
Figure 8. Internal structure of the recurrent neural network (RNN).
Sustainability 13 03726 g008
Figure 9. Internal structure of long short-term memory (LSTM).
Figure 9. Internal structure of long short-term memory (LSTM).
Sustainability 13 03726 g009
Figure 10. Internal structure of bidirectional long short-term memory (BiLSTM) and an example.
Figure 10. Internal structure of bidirectional long short-term memory (BiLSTM) and an example.
Sustainability 13 03726 g010
Figure 11. Workflow of the PCA application deep learning model for predicting PM2.5.
Figure 11. Workflow of the PCA application deep learning model for predicting PM2.5.
Sustainability 13 03726 g011
Figure 12. The PM2.5 prediction in Seoul by two cases.
Figure 12. The PM2.5 prediction in Seoul by two cases.
Sustainability 13 03726 g012
Table 1. Crisis stage standard.
Table 1. Crisis stage standard.
Crisis StagesCriteriaMain Contents
Stage 1150 μg/m3 for 2 h or longer +
75 μg/m3 for the following day
Strengthening the current system
Stage 2200 μg/m3 for 2 h or longer +
150 μg/m3 for the following day
Strengthening public sector measures
Stage 3400 μg/m3 for 2 h or longer +
200 μg/m3 for the following day
Strengthening private sector measures/disaster response
Table 2. Correlation range from Chinese cities to Korean cities.
Table 2. Correlation range from Chinese cities to Korean cities.
CitiesMinimumMaximumCitiesMinimumMaximum
Seoul0.19930.4994Busan0.13200.5084
Gwangju0.14460.4035Ulsan0.14190.5394
Daegu0.18130.5087Wonju0.18240.5550
Daejeon0.18390.5087Incheon0.25560.5415
Table 3. The ratio of variance explained by five principal components in each city.
Table 3. The ratio of variance explained by five principal components in each city.
CitiesCumulative VarianceCitiesCumulative Variance
Seoul0.9631(=96.31%)Busan0.98102(=98.102%)
Gwangju0.9553(=95.53%)Ulsan0.9655(=96.55%)
Daegu0.9770(=97.70%)Wonju0.9366(=93.66%)
Daejeon0.9539(=95.39%)Incheon0.98123(=98.123%)
Table 4. Evaluation results from PM2.5 prediction in each Korean city (Case 1).
Table 4. Evaluation results from PM2.5 prediction in each Korean city (Case 1).
CityModelRMSEMAECityModelRMSEMAE
SeoulRNN9.7307.328GwangjuRNN9.0027.472
LSTM8.0206.374LSTM7.74155.797
BiLSTM8.1016.168BiLSTM8.3006.590
DaeguRNN10.1718.110BusanRNN8.4107.224
LSTM7.6546.223LSTM7.7706.504
BiLSTM7.7076.193BiLSTM7.8976.578
DaejeonRNN9.3617.497UlsanRNN10.5588.988
LSTM7.0425.753LSTM8.6606.959
BiLSTM7.2315.927BiLSTM8.3836.772
WonjuRNN11.6039.208IncheonRNN13.68611.408
LSTM8.7186.520LSTM11.9009.828
BiLSTM8.4596.251BiLSTM10.3938.285
Table 5. Evaluation results from PM2.5 prediction in each Korean city (Case 2).
Table 5. Evaluation results from PM2.5 prediction in each Korean city (Case 2).
CityModelRMSEMAECityModelRMSEMAE
SeoulRNN11.680
(20%↑)
9.310
(27%↑)
GwangjuRNN9.492
(5.4%↑)
7.746
(3.7%↑)
LSTM7.667
(4.6%↓)
5.455
(16.8%↓)
LSTM7.148
(8.3%↓)
5.541
(4.6%↓)
BiLSTM7.567
(7.1%↓)
5.368
(14.9%↓)
BiLSTM7.110
(16.7%↓)
5.455
(20.8%↓)
DaeguRNN10.208
(0.4%↑)
7.824
(3.5%↓)
BusanRNN9.924
(18%↑)
8.316
(15.1%↑)
LSTM7.491
(2.2%↓)
5.664
(9.9%↓)
LSTM6.668
(16.5%↓)
4.881
(33.3%↓)
BiLSTM7.552
(2.1%↓)
5.703
(8.6%↓)
BiLSTM6.779
(16.5%↓)
4.999
(31.6%↓)
DaejeonRNN9.602
(2.6%↑)
7.824
(4.4%↑)
UlsanRNN11.160
(5.7%↑)
9.389
(4.5%↑)
LSTM6.967
(1.1%↓)
5.374
(7.1%↓)
LSTM8.021
(8%↓)
6.251
(11.3%↓)
BiLSTM7.098
(1.9%↓)
5.537
(7%↓)
BiLSTM7.871
(6.5%↓)
5.993
(13%↓)
WonjuRNN12.132
(4.6%↑)
9.758
(6%↑)
IncheonRNN14.744
(7.7%↑)
12.427
(8.9%↑)
LSTM8.424
(3.5%↓)
6.251
(4.3%↓)
LSTM10.205
(16.6%↓)
8.000
(22.9%↓)
BiLSTM8.345
(1.4%↓)
6.137
(1.9%↓)
BiLSTM9.709
(7%↓)
7.354
(12.7%↓)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Choi, S.W.; Kim, B.H.S. Applying PCA to Deep Learning Forecasting Models for Predicting PM2.5. Sustainability 2021, 13, 3726. https://0-doi-org.brum.beds.ac.uk/10.3390/su13073726

AMA Style

Choi SW, Kim BHS. Applying PCA to Deep Learning Forecasting Models for Predicting PM2.5. Sustainability. 2021; 13(7):3726. https://0-doi-org.brum.beds.ac.uk/10.3390/su13073726

Chicago/Turabian Style

Choi, Sang Won, and Brian H. S. Kim. 2021. "Applying PCA to Deep Learning Forecasting Models for Predicting PM2.5" Sustainability 13, no. 7: 3726. https://0-doi-org.brum.beds.ac.uk/10.3390/su13073726

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop