Predicting Daily PM2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors

Araki, Shin; Shimadera, Hikari; Hasunuma, Hideki; Yoda, Yoshiko; Shima, Masayuki

doi:10.3390/atmos13050782

Open AccessArticle

Predicting Daily PM_2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors

¹

Graduate School of Engineering, Osaka University, Suita 565-0871, Japan

²

Department of Public Health, Hyogo College of Medicine, Nishinomiya 663-8501, Japan

^*

Author to whom correspondence should be addressed.

Atmosphere 2022, 13(5), 782; https://0-doi-org.brum.beds.ac.uk/10.3390/atmos13050782

Submission received: 22 April 2022 / Accepted: 10 May 2022 / Published: 12 May 2022

(This article belongs to the Section Air Quality)

Download

Browse Figures

Versions Notes

Abstract

:

The spatiotemporal variation of PM_2.5 should be accurately estimated for epidemiological studies. However, the accuracy of prediction models may change over geographical space, which is not conducive for proper exposure assessment. In this study, we developed a prediction model to estimate daily PM_2.5 concentrations from 2010 to 2017 in the Kansai region of Japan with co-existing pollutant concentrations as predictors. The overall objective was to obtain daily estimates over the study domain with spatially homogeneous accuracy. We used random forest algorithm to model the relationship between the daily PM_2.5 concentrations and various predictors. The model performance was evaluated via spatial and temporal cross-validation and the daily PM_2.5 surface was estimated from 2010 to 2017 at a 1 km × 1 km resolution. We achieved R² values of 0.91 and 0.92 for spatial and temporal cross-validation, respectively. The prediction accuracy for each monitoring site was found to be consistently high, regardless of the distance to the nearest monitoring location, up to 10 km. Even for distances greater than 10 km, the mean R² value was 0.88. Our approach yielded spatially homogeneous prediction accuracy, which is beneficial for epidemiological studies. The daily PM_2.5 estimates will be used in a related birth cohort study to evaluate the potential impact on human health.

Keywords:

random forests; machine learning; fine particulate matter; Japan

1. Introduction

Exposure to fine particulate matter (PM_2.5) has been associated with adverse health effects in previous studies [1,2,3,4,5]. Thus, the spatiotemporal variation in PM_2.5 concentrations should be accurately estimated for proper exposure assessment. The statistical modeling approach, often referred to as land use regression (LUR) model, has been successfully used to provide spatiotemporal variations in daily PM_2.5 [6,7,8,9,10]. This approach builds a regression model with air pollutant concentrations as the dependent variable and variables that can potentially influence the concentrations as predictors. The concentrations at unmonitored locations are estimated using the model after validating its predictive performance. Recently, machine learning algorithms were used to model the relationship between air pollutant concentrations and various predictors [6,7,8,9,10].

The developed models are evaluated via various validation methods to show the overall model predictive ability by using a statistical index such as R² between the observed and predicted concentrations. However, the accuracy of the models may change over the geographical domain because of, for instance, the monitoring density, local terrain, and pollution level. Di et al. estimated the daily PM_2.5 in continental U.S. and reported that the cross-validated R² values were spatially varied depending on the elevation and the PM_2.5 level at each monitoring site [11]. A similar spatial variation in prediction accuracy was also reported even when the model performance improved [9]. Huang et al. showed the decrease in cross-validated R² as a function of distance to the nearest site [10]. Similarly, a lower predictive performance was observed where monitors are spatially sparse in the daily NO₂ model in the U.S. [12]. The spatial heterogeneity in model performance could result in biased exposure estimates and thus should be avoided, especially when the model outcomes are applied to epidemiological studies.

For better predictive accuracy, Li et al. built a generalized additive model using particulate matter (PM₁₀) as a predictor to estimate daily PM_2.5 in California, U.S. [13]. Wang and Sun used gaseous pollutants such as NO₂, SO₂, CO, and O₃ as predictor variables of a deep neural network to improve the prediction accuracy of the daily PM_2.5 model in China [14]. Particulate and gaseous pollutant concentrations were rated as important predictors in the monthly PM_2.5 model in Japan [15]. These variables are typically obtained from routine monitoring networks with continuous monitors, thereby providing a fine temporal resolution. In addition, when the monitoring network is spatially dense, the predictors can represent the spatial variations in the target pollutant at a fine spatial scale. Given the informative features of these co-existing pollutants, applying these pollutant concentrations as predictors can lead to spatially homogeneous model performance. However, the effect of these pollutants on the spatial dependency of accuracy has not been well studied.

In this study, we developed a model to estimate daily PM_2.5 concentrations in the Kansai region of Japan from 2010 to 2017 with spatially homogeneous accuracy. We used a machine learning algorithm (random forest) to model the relationship between the daily PM_2.5 concentrations and various predictors, including particulate and gaseous pollutant concentrations. The performance of the model was validated through cross-validation. We also evaluated the predictors that contribute to the spatially robust model performance. The daily PM_2.5 surface from 2010 to 2017 was estimated using the developed model as well. Furthermore, the accuracy of the developed model and the estimated temporal trend was analyzed.

2. Methods

2.1. Study Area

The study area covered the Kansai region of Japan (133.56° E–137.07° E, 33.15° N–36.04° N), where megacities such as Osaka and Nagoya are located (Figure S1). The area has a population of 32 million and an area of 53,540 km². The Japan Environment and Children’s Study (JECS) is an ongoing nationwide birth cohort study that recruited pregnant women from 15 areas across Japan from January 2011 to March 2014 [16]. This study was conducted as an adjunct study to JECS, and we aimed to estimate the daily PM_2.5 exposure of participants in Amagasaki, which is one of the JECS study areas. Amagasaki is located at the center of the study area.

2.2. Air Quality Measurements

The hourly PM_2.5 measurements were obtained from the national air quality monitoring network (AQMN) database (https://www.nies.go.jp/igreen/ (accessed on 1 October 2021) ). AQMN is maintained according to the Air Pollution Control Act (Act No. 97 of 10 June 1968). PM_2.5 monitoring started in 2010, a year after the implementation of the national air quality standard. The number of monitoring stations in the study area gradually increased from 10 in 2010 to 238 in 2017 (Figure 1). However, the monitoring locations are not distributed homogeneously and more monitors are placed in urban areas (Figure S1). We calculated the daily mean values from the hourly values. We used only daily values with a temporal coverage of more than 80% in each day for temporal representativeness. Consequently, the number of input observations for the model was 397,145. A summary of the daily values is shown in Figure S2. The annual box plot of PM_2.5 observations is shown in Figure S3.

2.3. Data

The particulate and gaseous pollutant concentrations were also retrieved from the AQMN database. We downloaded hourly suspended particulate matter (SPM), NO₂, SO₂, and O₃ concentrations, and calculated the daily mean values. According to Japanese air quality standards, SPM is defined as particulate matter with a 100% aerodynamic cut-off diameter of less than 10 µm. This corresponds to particulate matter with a 50% aerodynamic cut-off diameter of less than 7 µm (PM₇) [17]. Similar to the PM_2.5 observations, we only utilized daily mean values with a temporal coverage of more than 80% in each day.

We spatially interpolated the daily mean values of SPM, NO₂, SO₂, and O₃ via ordinary kriging, and assigned the kriged values at the PM_2.5 monitoring and prediction locations, respectively [14]. The spatially outlying values were identified and removed prior to the kriging interpolation via the spatial outlier detection method [18]. Additionally, we used the co-existing pollutants data obtained at the stations located where they are not directly affected by specific emission sources (categorized as general environment stations), thus the interpolated values can be regarded as background or baseline concentrations [18]. In instances when a PM_2.5 monitor is co-located with a co-existing pollutant monitor (this is often the case); we obtained the interpolated values (predictors) at the location without the monitored value at the location. Without this process, the kriged value will be identical to the monitored value. This would make the model validation result optimistic and result in heterogeneity in the uncertainty of the predictors. This procedure is similar to leave-one-out cross-validation and has been used to prepare predictors in previous LUR studies [19,20]. Figure S4 shows the validation results of ordinary kriging for the co-existing pollutants. The respective number of monitoring locations for SPM, NO₂, SO₂, and O₃ was 379, 360, 272, and 326, in 2017 (Figure 1). Note that the number of PM_2.5 monitors in the study area was much less than that for these pollutants.

We used the simulated daily PM_2.5 concentrations as predictors obtained from the Community Multiscale Air Quality (CMAQ) model v5.2.1 [21] coupled with meteorological fields produced by the Weather Research and Forecasting model (WRF) v3.8 [22] at 5 km × 5 km spatial resolution. Detailed information on the simulation is provided in Thongthammachart et al. [23]. We also used WRF-simulated meteorological parameters, including temperature, precipitation, wind speed, relative humidity, and planetary boundary layer (PBL) height, as model predictors. A scatter plot of the daily CMAQ-simulated and observed values is shown in Figure S5.

We downloaded National Numeric Land Information data (http://nlftp.mlit.go.jp/ksj/ (accessed on 1 September 2021)) and derived the build-up, agriculture, and green area ratio values [24]. The monthly emission intensities of PM_2.5 from large anthropogenic point sources were obtained from EAGrid2010-Japan [25]. The annual correction was not considered. These variables were processed using the moving window approach [26] to consider the distance-decay effect [24]. We calculated the road length in a grid cell and the distance from roads to the monitoring and prediction locations using road network data obtained from OpenStreetMap (https://www.openstreetmap.org (accessed on 1 September 2021)) for the categories defined in Araki et al. (highways and primary roads) [15]. Population size and elevation were also used as predictors. The predictors used in this study are presented in Table S1. The data sources are listed in Table S2.

2.4. Random Forest

Random forest is a decision tree-based non-parametric algorithm [27] that has been used in many previous studies [6,7,28,29,30] because of its ease of use, robustness against overfitting [27], and insensitivity to model parameters [31]. We built a prediction model for daily PM_2.5 using this algorithm. For the model parameters, we set the number of trees to 500 and the number of variables in the subset in each node (m_try) to 10 via preliminary test runs. The variable importance measures that indicate the predictive power of each predictor were obtained by training the random forest.

2.5. Validation

We validated the model performance using spatial and temporal cross-validation approaches. For spatial cross-validation, the monitoring locations were randomly divided into five groups. We retained observations from a group of locations and developed a model using observations from the remaining four groups of locations. We then predicted the PM_2.5 concentrations at the locations of the retained group. This process was repeated for each group. For temporal cross-validation, the input data were divided based on the monitored dates and a similar process was conducted. To eliminate the random effect of the splits, we repeated the spatial and temporal cross-validations 10 times with different random seeds. Then, R² and root mean squared error (RMSE) were calculated between the observations and the averaged predictions over the repetitions. R² value is obtained by

1 - M S E / v a r (Y)

, where Y is a vector of observed values and

M S E

is the mean squared error between the observed and predicted values. The cross-validation results were summarized by year to check the temporal stability of model performance over the study period.

2.6. Spatial Dependency in the Model Performance

The prediction performance of the model may depend on its distance to the nearest monitoring station. We calculated R² values for each monitoring location using the spatial cross-validation results and examined the relationship with the distance to the nearest monitor. Because the nearest monitor for each observed location may change during the study period because of the addition or removal of PM_2.5 monitors, the R² values were calculated only when more than 300 observations (approximately 80% of a year) were available for each monitoring location and the shortest distance to ensure robustness of the results. In addition, we conducted spatial cross-validation without particulate and gaseous pollutant concentrations and obtained R² values in the same process to examine the performance gain from these predictors.

2.7. Estimation

We estimated the daily concentrations of PM_2.5 from 2010 to 2017 in the study area at 1 km × 1 km resolution using the developed model. The total number of prediction grid cells was 53,455. The annual trend in the study area and Amagasaki were obtained by temporally and spatially aggregating the daily estimations.

2.8. Computation

The analyses were performed through R statistical software 4.1.1 (R Core Team, Vienna, Austria) [32] using sf [33] and stars [34] for spatial computation, and ranger [31] to train the random forest algorithm.

3. Results

The variable importance measures for the predictors are shown in Figure 2. The predictor variables are arranged from top to bottom according to their importance measures. As can be observed, SPM is the most informative predictor, followed by CMAQ-simulated PM_2.5. Gaseous pollutants such as SO₂, NO₂, and O₃ are ranked as important. WRF-simulated temperature is also ranked as important. These predictors have a temporal resolution of one day. Meanwhile, other predictors that were temporally constant over the study period are ranked as less important.

Figure 3 presents the results of the spatial and temporal cross-validation. The model exhibits high R² values of 0.91 and 0.92 for the spatial and temporal cross-validation, respectively. RMSE values are 2.3 and 2.2 µg m⁻³ for the spatial and temporal cross-validation, respectively. The spatial and temporal cross-validation results are summarized by year (Figures S6 and S7). Although the number of monitoring stations and concentration range differ between years, there is no clear difference in the prediction performance between years.

We calculated the R² values from the spatial cross-validation results for each monitoring location and plotted them as a function of the distance to the nearest PM_2.5 station (Figure 4). We compared the same relationship but used the spatial cross-validation results obtained without co-existing pollutant (NO₂, SO₂, O₃, and SPM) concentrations as the predictors. Although the R² values were varied for each monitoring site, they were constantly high and did not change with distance to the nearest monitor location up to 10 km. Even at distances greater than 10 km, the mean R² value was 0.88. On the contrary, when particulate and gaseous pollutant concentrations were not used as predictors, the R² values decreased with distance.

We estimated the daily PM_2.5 concentrations in the study area from 2010 to 2017 at a resolution of 1 km × 1 km. Figure 5a presents the temporally aggregated daily PM_2.5 predictions through the study period (eight-year mean). Higher concentrations are observed in the middle-western part, where a large industrial area is located. In addition, megacities such as Osaka and Nagoya exhibit relatively high concentrations. Figure S8 shows the annual average prediction maps. The estimated PM_2.5 concentrations generally decreased throughout the study area during the study period. This is also confirmed from Figure 5b, which shows the difference in annual aggregated predictions between 2010 and 2017. Figure 6 presents the temporal trend of the annually and spatially aggregated estimations in the study area. It shows a downward trend, particularly after 2013. A similar tendency, but with a slightly higher concentration level, is seen in the annual trend in Amagasaki (Figure S9).

Based on the PM_2.5 estimations, the cumulative percentage of the population and grid cells of the annual means are shown for 2013 and 2017, respectively (Figure 7). In addition, Figure 7 presents the cumulative percentage of annual mean observations. We used the results for 2013 because the number of monitors was limited before 2013 (Figure 1). In 2013, 23.4% of the grid cells exhibited annual mean PM_2.5 concentrations of more than 15 µg m⁻³, which is the national air quality standard in Japan. Meanwhile, more than 62% of the population in the study area lived in areas that exceeded the standard. Similarly, 70% of PM_2.5 monitors recorded annual means of more than 15 µg m⁻³. In contrast, in 2017, these proportions reduced drastically to 0.8%, 2.4%, and 7.8% for grid cells, population, and monitors, respectively.

4. Discussion

We developed the daily PM_2.5 estimation model and estimated the daily PM_2.5 concentrations from 2010 to 2017 in the Kansai region of Japan. We achieved high spatial and temporal cross-validated R² values of 0.91 and 0.92, respectively. No significant over- or under-estimation was observed (Figure 3). The cross-validation accuracy did not change over the years during the study period (Figures S5 and S6), indicating the temporal robustness of the developed model. Therefore, the daily estimates obtained by the model are accurate and can be applied to the exposure assessment of PM_2.5. The model performance is similar to or better than that reported in previous studies in the U.S. (R²: 0.84 [11], 0.80 [6], 0.89 [9]), Europe (R²: 0.72–0.79 [7], 0.79–0.81 [8]), China (R²: 0.83 [35], 0.88 [10]), and Japan (R²: 0.81 [23], 0.86 [36]). Note that the geographical space, sample size, data range, modeling approach, and validation methods differ between studies.

We used particulate and gaseous pollutants as predictors for the daily PM_2.5 model. These variables were rated as important predictors (Figure 2). This is consistent with the results of the monthly PM_2.5 model in our previous work [15], as well as the daily PM_2.5 model in the U.S. [13] and China [14]. Sulfate and nitrate, the major constituents of PM_2.5, are produced via oxidation of SO₂ and NO₂, respectively, where O₃ is involved [14]. The primary components of PM_2.5 share emission sources with other primary air pollutants such as NO₂ and SO₂. Because of these features, these pollutants are informative predictor variables for PM_2.5 models with various temporal scales and regions.

The CMAQ-simulated PM_2.5 concentrations were ranked as the second most important predictor variable (Figure 2). Previous studies reported that the CMAQ output is an informative predictor in the daily PM_2.5 models in the U.S. [9], China [10], and Japan [23]. These predictors, including WRF-simulated meteorological parameters, are rated as being more important than temporally invariant variables. The traffic-related variables were assumed to be constant during the study period, despite day-to-day variation. To improve the accuracy of the model, finer temporal variations in these variables should be considered.

Most of the monitoring stations are distributed in urban and densely populated areas, compared to the rural or mountainous areas. The uneven distribution results in inhomogeneous training data that does not fully represent the relationship between target air pollutant concentrations and predictors in the whole study domain. Because the relationship may change depending on regions with different characteristics, an estimation model cannot adequately learn the relationship for the areas where sufficient data are unavailable. This is a possible reason for the spatial heterogeneity in the model accuracy. This cannot be overcome solely by introducing or developing state-of-the-art models, because a model cannot learn what is not in the training data. The redistribution of monitoring stations only for prediction accuracy purpose would be difficult. Therefore, exploring effective predictors is a potential and practical solution for the spatially invariant model accuracy that enables unbiased exposure estimates and contributes to health impact assessment.

Our model exhibited accurate predictive performance regardless of the distance to the neighboring PM_2.5 monitoring site (Figure 4). This indicates that the estimations by the model are accurate even for distant areas from the PM_2.5 monitors, owing to the implementation of the co-existing pollutant concentrations as predictors. Spatial homogeneity in estimation accuracy is extremely important for prediction models. When the prediction performance varies over the study domain, there is heterogeneity in the accuracy of the exposure estimates, which may lead to unreliable health impact assessment. Therefore, the modeling approach presented in this paper can significantly contribute to epidemiological studies. The improvement by the predictors may be owing to the fact that the monitoring network of the associated pollutants are more dense than that for PM_2.5 (Figure 1). Specifically, SPM was found to be the most informative predictor in the model (Figure 2) and it has the largest number of monitoring locations among the pollutants. Although it is not clear from our results how these co-existing pollutants work as predictors when PM_2.5 monitors are as dense as those of these pollutants, it can be expected that they at least contribute to the overall model accuracy. Di et al. showed that some of the cross-validated R² values at each monitoring site were between 0.5 to 0.7, in spite of the high overall R² value close to 0.9 for the daily PM_2.5 model in the U.S. [9]. Similar spatial variation has been observed in their previous study [11]. On the contrary, in our work, 88% of the monitoring sites exhibited R² values greater than 0.85. Huang et al. showed that R² decreased for the monitoring sites with distance to the neighboring monitoring site approximately greater than 4 km in the daily PM_2.5 model in China [10]. Meanwhile, our R² values at each monitoring site remained high, with the distance to the nearest site at least up to 10 km. These results suggest that our modeling approach using co-existing pollutant concentrations as predictors enabled the spatially invariant predictive accuracy for the daily PM_2.5 model. In particular, given its importance as a predictor of the model, the kriged SPM may have contributed to the spatially homogeneous accuracy. Based on the accurate daily estimations combined with the ongoing birth cohort, critical windows of prenatal PM_2.5 exposure that has a greater impact on the birth outcomes and the health of the child can be obtained, thereby contributing to environmental policy makings.

The PM_2.5 concentrations decreased throughout the study area during the study period (Figure 5b, Figure 6 and Figure S7). PM_2.5 concentrations in Japan are influenced by long-range transport from the Asian continent [37]. The downward trend of PM_2.5 after 2013 in Japan was influenced by the reduction in emissions in China [38]. Meanwhile, a large decline in PM_2.5 was observed in areas with high pollution levels (Figure 5b), which generally reflects the locations of large industrial and densely populated areas (not shown here). Thus, the overall improvement in the PM_2.5 pollution status in the study area may be attributed to the decrease in outflow from the continent as well as local pollution.

The remarkable difference in the cumulative distribution between the grid cells and monitors (Figure 7) indicates that the monitoring data does not represent the pollution status in the entire study area. This discrepancy reflects the concentrated distribution of monitoring sites in urban areas (Figure S1). A similar situation is observed in the U.S., where monitoring site averages are higher than grid-based averages of nationwide NO₂ [12]. Although the distribution is similar between the monitored and population exposure concentrations, there is noticeable discrepancy. The proportions of the monitored concentrations are higher in both lower and higher concentration ranges. This is probably because the PM_2.5 monitors are generally located in the more populated areas, as well as in background areas for surveillance purposes. Consequently, the monitored concentrations do not adequately represent population exposure. Therefore, spatial variations should be estimated properly by a prediction model for both accurate exposure estimates and proper understanding of the pollution status.

There are some limitations of this study. First, it could be difficult for epidemiological studies to distinguish the associated health effects of PM_2.5 from that of co-existing pollutants which is used as inputs of the estimation model. This is a trade-off of the spatially invariant predictive accuracy achieved by co-existing pollutants. Meanwhile, as the importance of gaseous pollutants were relatively small in the developed model (Figure 2), its side-effect is limited in our study. Second, our approach needs to be evaluated further to improve the robustness from some perspectives. It should be validated particularly in areas where the relationship between PM_2.5 and particulate pollutant (SPM or PM₁₀) changes spatially and temporally. This may happen when the composition of the particulate varies owing to the changes in the emission characteristics. The modeling region of our study is small compared to the previous studies that reported the spatially varying model accuracy; therefore, further evaluation in a larger geographical domain is also required. The monitoring network for the co-existing pollutants were denser than that for PM_2.5 in our study area. Therefore, it should be investigated how the proposed approach works where the number of SPM or PM₁₀ monitors are similar to that of PM_2.5 monitors.

5. Conclusions

We successfully developed a daily PM_2.5 model and accurately estimated the daily PM_2.5 surface from 2010 to 2017 at 1 km × 1 km in the Kansai region of Japan. A spatially homogeneous prediction accuracy was achieved by incorporating gaseous and particulate pollutant concentrations as predictors of the model, which is beneficial for epidemiological studies. The daily PM _2.5 estimates will be further used in a related birth cohort study to evaluate the potential impact on human health.

Supplementary Materials

The following are available online at https://0-www-mdpi-com.brum.beds.ac.uk/article/10.3390/atmos13050782/s1, Figure S1: Study area, Figure S2: Histograms of the daily PM_2.5 observations used in this study, Figure S3: Boxplots of the daily PM_2.5 observations used in this study, Table S1: Predictors used in this study, Table S2: Data sources, Figure S4: Validation results of ordinary kriging for co-existing pollutants. Figure S5: Scatter plots of the CMAQ-simulated and observed daily PM_2.5 concentrations in the study area from 2010 to 2017, Figure S6: Scatter plots of the observed and predicted concentrations obtained by spatial cross-validation summarized by year, Figure S7: Scatter plots of the observed and predicted concentrations obtained by temporal cross-validation summarized by year, Figure S8: Annual prediction maps obtained by aggregating the daily predictions for the study period in each year, Figure S9: Temporal trend of PM_2.5 concentrations in Amagasaki.

Author Contributions

Conceptualization, S.A.; methodology, S.A.; validation, H.S. and M.S.; formal analysis, S.A.; data curation, H.S.; writing—original draft preparation, S.A.; writing—review and editing, H.S., H.H., Y.Y. and M.S.; visualization, S.A.; project administration, M.S.; funding acquisition, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI Grant Number JP19K12370 and JP21H03205, and the Environment Research and Technology Development Fund (JPMEERF20185002 and JPMEERF20195055) of the Environmental Restoration and Conservation Agency of Japan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rich, D.Q.; Demissie, K.; Lu, S.E.; Kamat, L.; Wartenberg, D.; Rhoads, G.G. Ambient air pollutant concentrations during pregnancy and the risk of fetal growth restriction. J. Epidemiol. Community Health 2009, 63, 488–496. [Google Scholar] [CrossRef] [PubMed]
Faiz, A.S.; Rhoads, G.G.; Demissie, K.; Lin, Y.; Kruse, L.; Rich, D.Q. Does ambient air pollution trigger stillbirth? Epidemiology 2013, 24, 538–544. [Google Scholar] [CrossRef] [PubMed]
Puett, R.C.; Hart, J.E.; Yanosky, J.D.; Spiegelman, D.; Wang, M.; Fisher, J.A.; Hong, B.; Laden, F. Particulate Matter Air Pollution Exposure, Distance to Road, and Incident Lung Cancer in the Nurses’ Health Study Cohort. Environ. Health Perspect. 2014, 122, 926–932. [Google Scholar] [CrossRef] [Green Version]
Fleischer, N.L.; Merialdi, M.; van Donkelaar, A.; Vadillo-Ortega, F.; Martin, R.V.; Betran, A.P.; Souza, J.P. Outdoor Air Pollution, Preterm Birth, and Low Birth Weight: Analysis of the World Health Organization Global Survey on Maternal and Perinatal Health. Environ. Health Perspect. 2014, 122, 425–430. [Google Scholar] [CrossRef]
Cohen, A.J.; Brauer, M.; Burnett, R.; Anderson, H.R.; Frostad, J.; Estep, K.; Balakrishnan, K.; Brunekreef, B.; Dandona, L.; Dandona, R.; et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution: An analysis of data from the Global Burden of Diseases Study 2015. Lancet 2017, 389, 1907–1918. [Google Scholar] [CrossRef] [Green Version]
Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM_2.5 Concentrations in the Conterminous United States Using the Random Forest Approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef]
Stafoggia, M.; Bellander, T.; Bucci, S.; Davoli, M.; Hoogh, K.D.; Donato, F.D.; Gariazzo, C.; Lyapustin, A.; Michelozzi, P.; Renzi, M.; et al. Estimation of daily PM₁₀ and PM_2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model. Environ. Int. 2019, 124, 170–179. [Google Scholar] [CrossRef]
Shtein, A.; Kloog, I.; Schwartz, J.; Silibello, C.; Michelozzi, P.; Gariazzo, C.; Viegi, G.; Forastiere, F.; Karnieli, A.; Just, A.C.; et al. Estimating Daily PM_2.5 and PM₁₀ over Italy Using an Ensemble Model. Environ. Sci. Technol. 2019, 54, 120–128. [Google Scholar] [CrossRef]
Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.; Kelly, J.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A.; et al. An ensemble-based model of PM_2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ. Int. 2019, 130, 104909. [Google Scholar] [CrossRef]
Huang, C.; Hu, J.; Xue, T.; Xu, H.; Wang, M. High-Resolution Spatiotemporal Modeling for Ambient PM_2.5 Exposure Assessment in China from 2013 to 2019. Environ. Sci. Technol. 2021, 55, 2152–2162. [Google Scholar] [CrossRef]
Di, Q.; Kloog, I.; Koutrakis, P.; Lyapustin, A.; Wang, Y.; Schwartz, J. Assessing PM_2.5 Exposures with High Spatiotemporal Resolution across the Continental United States. Environ. Sci. Technol. 2016, 50, 4712–4721. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.F.; Kelly, J.T.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A.; et al. Assessing NO₂ Concentration and Model Uncertainty with High Spatiotemporal Resolution across the Contiguous United States Using Ensemble Model Averaging. Environ. Sci. Technol. 2019, 54, 1372–1384. [Google Scholar] [CrossRef] [PubMed]
Li, L.; Wu, A.H.; Cheng, I.; Chen, J.C.; Wu, J. Spatiotemporal estimation of historical PM_2.5 concentrations using PM₁₀, meteorological variables, and spatial effect. Atmos. Environ. 2017, 166, 182–191. [Google Scholar] [CrossRef]
Wang, X.; Sun, W. Meteorological parameters and gaseous pollutant concentrations as predictors of daily continuous PM_2.5 concentrations using deep neural network in Beijing–Tianjin–Hebei, China. Atmos. Environ. 2019, 211, 128–137. [Google Scholar] [CrossRef]
Araki, S.; Hasunuma, H.; Yamamoto, K.; Shima, M.; Michikawa, T.; Nitta, H.; Nakayama, S.F.; Yamazaki, S. Estimating monthly concentrations of ambient key air pollutants in Japan during 2010–2015 for a national-scale birth cohort. Environ. Pollut. 2021, 284, 117483. [Google Scholar] [CrossRef]
Kawamoto, T.; Nitta, H.; Murata, K.; Toda, E.; Tsukamoto, N.; Hasegawa, M.; Yamagata, Z.; Kayama, F.; Kishi, R.; Ohya, Y.; et al. Rationale and study design of the Japan environment and children’s study (JECS). BMC Public Health 2014, 14, 25. [Google Scholar] [CrossRef] [Green Version]
Katanoda, K.; Sobue, T.; Satoh, H.; Tajima, K.; Suzuki, T.; Nakatsuka, H.; Takezaki, T.; Nakayama, T.; Nitta, H.; Tanabe, K.; et al. An association between long-term exposure to ambient air pollution and mortality from lung cancer and respiratory diseases in Japan. J. Epidemiol. 2011, 21, 132–143. [Google Scholar] [CrossRef] [Green Version]
Araki, S.; Shimadera, H.; Yamamoto, K.; Kondo, A. Effect of spatial outliers on the regression modelling of air pollutant concentrations: A case study in Japan. Atmos. Environ. 2017, 153, 83–93. [Google Scholar] [CrossRef] [Green Version]
Wu, C.D.; Zeng, Y.T.; Lung, S.C.C. A hybrid kriging/land-use regression model to assess PM_2.5 spatial-temporal variability. Sci. Total Environ. 2018, 645, 1456–1464. [Google Scholar] [CrossRef]
Chen, T.H.; Hsu, Y.C.; Zeng, Y.T.; Candice Lung, S.C.; Su, H.J.; Chao, H.J.; Wu, C.D. A hybrid kriging/land-use regression model with Asian culture-specific sources to assess NO₂ spatial-temporal variations. Environ. Pollut. 2020, 259, 113875. [Google Scholar] [CrossRef]
Byun, D.; Schere, K.L. Review of the Governing Equations, Computational Algorithms, and Other Components of the Models-3 Community Multiscale Air Quality (CMAQ) Modeling System. Appl. Mech. Rev. 2006, 59, 51–77. [Google Scholar] [CrossRef]
Skamarock, W.C.; Klemp, J.B. A time-split nonhydrostatic atmospheric model for weather research and forecasting applications. J. Comput. Phys. 2008, 227, 3465–3485. [Google Scholar] [CrossRef]
Thongthammachart, T.; Araki, S.; Shimadera, H.; Eto, S.; Matsuo, T.; Kondo, A. An integrated model combining random forests and WRF/CMAQ model for high accuracy spatiotemporal PM_2.5 predictions in the Kansai region of Japan. Atmos. Environ. 2021, 262, 118620. [Google Scholar] [CrossRef]
Araki, S.; Shima, M.; Yamamoto, K. Spatiotemporal land use random forest model for estimating metropolitan NO₂ exposure in Japan. Sci. Total Environ. 2018, 634, 1269–1277. [Google Scholar] [CrossRef] [PubMed]
Fukui, T.; Kokuryo, K.; Baba, T.; Kannari, A. Updating EAGrid2000-Japan emissions inventory based on the recent emission trends. J. Jpn. Soc. Atmos. Environ. 2014, 49, 117–125. [Google Scholar] [CrossRef]
Vienneau, D.; de Hoogh, K.; Briggs, D. A GIS-based method for modelling air pollution exposures across Europe. Sci. Total Environ. 2009, 408, 255–266. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Brokamp, C.; Jandarov, R.; Rao, M.B.; LeMasters, G.; Ryan, P. Exposure assessment models for elemental components of particulate matter in an urban environment: A comparison of regression and random forest approaches. Atmos. Environ. 2017, 151, 1–11. [Google Scholar] [CrossRef] [Green Version]
Meng, X.; Hand, J.L.; Schichtel, B.A.; Liu, Y. Space-time trends of PM_2.5 constituents in the conterminous United States estimated by a machine learning approach, 2005–2015. Environ. Int. 2018, 121, 1137–1147. [Google Scholar] [CrossRef]
Ma, R.; Ban, J.; Wang, Q.; Zhang, Y.; Yang, Y.; He, M.Z.; Li, S.; Shi, W.; Li, T. Random forest model based fine scale spatiotemporal O₃ trends in the Beijing-Tianjin-Hebei region in China, 2010 to 2017. Environ. Pollut. 2021, 276, 116635. [Google Scholar] [CrossRef]
Wright, M.N.; Ziegler, A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J. Stat. Softw. 2017, 77, 1–17. [Google Scholar] [CrossRef] [Green Version]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
Pebesma, E. Simple Features for R: Standardized Support for Spatial Vector Data. R J. 2018, 10, 439–446. [Google Scholar] [CrossRef] [Green Version]
Pebesma, E. Stars: Spatiotemporal Arrays, Raster and Vector Data Cubes, R Package Version; 0.5-2; 2021. Available online: https://r-spatial.github.io/stars/ (accessed on 9 May 2022).
Chen, G.; Li, S.; Knibbs, L.D.; Hamm, N.; Cao, W.; Li, T.; Guo, J.; Ren, H.; Abramson, M.J.; Guo, Y. A machine learning method to estimate PM_2.5 concentrations across China with remote sensing, meteorological and land use information. Sci. Total Environ. 2018, 636, 52–60. [Google Scholar] [CrossRef] [PubMed]
Jung, C.R.; Chen, W.T.; Nakayama, S.F. A national-scale 1-km resolution PM_2.5 estimation model over japan using maiac aod and a two-stage random forest model. Remote Sens. 2021, 13, 3657. [Google Scholar] [CrossRef]
Shimadera, H.; Kojima, T.; Kondo, A. Evaluation of Air Quality Model Performance for Simulating Long-Range Transport and Local Pollution of PM_2.5 in Japan. Adv. Meteorol. 2016, 2016, 5694251. [Google Scholar] [CrossRef] [Green Version]
Uno, I.; Wang, Z.; Yumimoto, K.; Itahashi, S.; Osada, K.; Irie, H.; Yamamoto, S.; Hayasaki, M.; Sugata, S. Is PM_2.5 Trans-boundary Environmental Problem in Japan dramatically improving? J. Jpn. Soc. Atmos. Environ. 2017, 52, 177–184. [Google Scholar] [CrossRef]

Figure 1. Number of monitoring locations for PM_2.5, SPM, NO₂, O₃, and SO₂ in the AQMN in the Kansai region over the study period.

Figure 2. Variable importance measures of the prediction model. The predictor variables are listed in the order of importance from top to bottom. The horizontal axis represents the measure of importance.

Figure 3. Scatter plots of the observed and predicted concentrations obtained by spatial and temporal cross-validation. Red represents higher point density and blue represents lower density. The lines in each panel represent a 1:1 line.

Figure 4. Relationship between spatially cross-validated R² values and distance to the nearest station for each monitoring location (Red dots). Similar relationship for the results obtained without co-existing pollutant concentrations is plotted as blue dots.

Figure 5. Prediction map obtained by aggregating the daily predictions in the study period (a) and the difference in the estimated concentrations between 2010 and 2017 (b). Unit is µg m⁻³.

Figure 6. Temporal trend of the PM_2.5 in the study area. The annual values were obtained by averaging the daily predictions for each year in the study area.

Figure 7. Cumulative distribution of the annual mean PM_2.5 for population, grid cells, and monitors. The vertical dashed lines represent the air quality standard in Japan.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Araki, S.; Shimadera, H.; Hasunuma, H.; Yoda, Y.; Shima, M. Predicting Daily PM_2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors. Atmosphere 2022, 13, 782. https://0-doi-org.brum.beds.ac.uk/10.3390/atmos13050782

AMA Style

Araki S, Shimadera H, Hasunuma H, Yoda Y, Shima M. Predicting Daily PM_2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors. Atmosphere. 2022; 13(5):782. https://0-doi-org.brum.beds.ac.uk/10.3390/atmos13050782

Chicago/Turabian Style

Araki, Shin, Hikari Shimadera, Hideki Hasunuma, Yoshiko Yoda, and Masayuki Shima. 2022. "Predicting Daily PM_2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors" Atmosphere 13, no. 5: 782. https://0-doi-org.brum.beds.ac.uk/10.3390/atmos13050782

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Daily PM_2.5 Exposure with Spatially Invariant Accuracy Using Co-Existing Pollutant Concentrations as Predictors

Abstract

1. Introduction

2. Methods

2.1. Study Area

2.2. Air Quality Measurements

2.3. Data

2.4. Random Forest

2.5. Validation

2.6. Spatial Dependency in the Model Performance

2.7. Estimation

2.8. Computation

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI