Given its relatively long lifetime in the troposphere, carbon
monoxide (CO) is commonly employed as a tracer for characterizing airborne
pollutant distributions. The present study aims to estimate the
spatiotemporal distributions of ground-level CO concentrations across China
during 2013–2016. We refined the random-forest–spatiotemporal kriging
(RF–STK) model to simulate the daily CO concentrations on a 0.1
Ground-level carbon monoxide (CO) is a worldwide atmospheric pollutant posing risks to human health and the environment (White et al., 1990; Reeves et al., 2002). While CO is formed naturally from the oxidation of methane and non-methane volatile organic compounds, anthropogenic emissions from incomplete combustion of fossil fuels and biofuels contribute to approximately 42 % of the total atmospheric CO (Holloway et al., 2000; Pommier et al., 2013). In spite of the slow decrease in CO concentrations in recent years based on satellite retrievals (Xia et al., 2016; Zheng et al., 2018), China is still one of the countries with the most severe CO pollution in the world, and the combustion of fossil fuels is the dominant source of anthropogenic CO emissions (Wang et al., 2004; Duncan et al., 2007a). Due to its relatively long lifetime in the troposphere (i.e., 1 to 2 months), CO is commonly employed as a tracer for characterizing pollutant transport in the atmosphere (Goldan et al., 2000; Pommier et al., 2010). It is therefore essential to obtain the spatiotemporal distribution of CO for air quality management. The national air pollution monitoring network in mainland China has been regularly observing ground-level CO concentrations since 2013 (MEPC, 2017) by the non-dispersive infrared absorption method and the gas filter correlation infrared absorption method (CNEMC, 2013), but these site-based measurements are inadequate to represent the spatially continuous distributions of CO (Xu et al., 2014).
Chemical transport models (CTMs) have been employed to estimate ground-level CO concentrations (Arellano and Hess, 2006; Hu et al., 2016). On the basis of meteorological conditions generated by climate models, CTMs simulate reactions, transport, and deposition of chemicals in the atmosphere, which generally require high computational cost and a large number of data inputs such as emission inventories. The predictive performance of CTMs tends to be affected by uncertainties in the simulation algorithms and the emission inventories (Li et al., 2010; Hu et al., 2017a). A CTM comparison study found that the difference in transport simulation resulted in considerable discrepancies between inter-model CO predictions (Arellano and Hess, 2006; Duncan et al., 2007b). It has been reported that a certain CTM underpredicted the monthly average CO concentrations in China by more than 60 % (Hu et al., 2016). Although the emission inventories for China have been refined in recent years, high uncertainties still exist (Li et al., 2017). For instance, biomass combustion, residential biofuel consumption, and transient fire events tend to be underreported, consequently leading to underestimation of CO emissions in the emission inventories (Wang et al., 2002; Streets et al., 2003). Despite underestimation by CTMs, the general patterns of CO concentrations are captured, and they can be used as the a priori information for deriving posterior estimates based on satellite retrievals (Deeter et al., 2014).
Multiple satellite instruments have been operating to measure atmospheric CO
for more than a decade, including the Measurements of Pollution in the
Troposphere (MOPITT) (Deeter et al., 2003; Worden et al., 2013a; Jiang et
al., 2015; Deeter et al., 2017), the Atmospheric Infrared Sounder
(McMillan, 2005; Wang et al., 2018), the Scanning Imaging Absorption
Spectrometer for Atmospheric Chartography (Kopacz et al., 2010; Ul-Haq et
al., 2016), and the Infrared Atmospheric Sounding Interferometer
(Fortems-Cheiney et al., 2009; Barret et al., 2016). Strong absorption
lines of CO occur in the thermal infrared (4.7
Machine learning models have been applied to predict spatiotemporal
distributions of atmospheric pollutants, such as fine particulate matter
(
The present study aims to estimate the spatiotemporal distributions of
ground-level CO concentrations across China during 2013–2016. We refined the
RF–STK model to simulate the daily gridded CO concentrations (0.1
Ground-level CO monitoring network for China in 2013–2016 with 1656 sites in total. The central Tibetan Plateau (CTP) and the North China Plain (NCP) are labeled on the map. The red dashed line represents the Heihe–Tengchong Line, which is an imagined “geo-demographic demarcation line” reflecting the disparity in the population distribution. Around 95 % of the population live to the east of the line, where 82 % of the monitoring sites are located.
Figure 1 shows the locations of the 1656 monitoring sites spread out over
all of China, which monitored the ground-level CO concentrations
(MEPC, 2017; EPAROC, 2017; EPDHK, 2017). Most of the sites were
in the cities of eastern China, leading to non-negligible sampling
biases. Hourly average CO concentrations (mg m
Correlations among the ground observations, MOPITT CO, and the RF–STK predictions (Pearson correlation coefficients).
The MOPITT operational gas correlation spectroscopy CO product (MOP02J.007),
containing retrievals of surface CO mixing ratios, was obtained from the
Atmospheric Science Data Center (ASDC, 2017). The MOPITT onboard the
Terra satellite provides tropospheric CO density with global coverage every
3 d (Edwards et al.,
2004). The CO surface mixing ratios from the Level-2 data product have a
spatial resolution of 22 km at nadir. The Level-2 product has daytime and
nighttime data fields, which are highly correlated (
According to the ideal gas law, we converted the unit of MOPITT CO data from parts per billion
(ppb, the unit presented in the MOPITT product) to milligrams per cubic meter (mg m
In order to evaluate the dependence of the MOPITT surface retrievals on the a priori information, we also extracted the averaging kernels and the a priori information from the MOPITT product. For each averaging kernel (a matrix), the sum of the elements in the row associated with the surface layer of the CO profile (hereafter referred to as the row-sum value) measures the overall dependence of the MOPITT surface CO retrievals on the a priori information (Deeter, 2017). A small row-sum value indicates strong dependence of the MOPITT retrieval on the a priori information, i.e., low sensitivity of the actual MOPITT retrieval. Please refer to Sect. S2 in the Supplement for the explanation of the averaging kernels.
The RF–STK model, consisting of a random forest (RF) submodel and a
spatiotemporal kriging model (STK), was refined to predict the daily ground-level
CO concentrations across China. The RF–STK model utilizes the strengths of
both RF and STK, which showed the capability of predicting
The RF submodel is an ensemble of regression trees. The average predictions of all the trees are output as the RF prediction. In the process of growing each tree, a random training dataset is prepared through bootstrap resampling from the original training dataset, while a random subset of the predictors is chosen in order to reduce the inter-correlation among the trees. The best split is determined at each tree node, which contributes the largest decrease in the squared error. Please refer to Sect. S3 in the Supplement for the detailed description of the RF algorithm.
As the CO concentrations approximated a lognormal distribution, they were
log transformed for variance stabilization (De'Ath and Fabricius,
2000). Leveraging variable selection was conducted based on the
pre-experiments. The out-of-bag (OOB) errors (representing the RF prediction
residuals) of the back-transformed RF predictions were filtered with the
“three-sigma rule” and subsequently interpolated with the STK submodel.
Finally, the CO concentrations were predicted as the sums of the STK
interpolations and back-transformed RF predictions. It is worth mentioning
that the RF submodel was refined in the present study by inversely weighting
each training sample with the surrounding population density to alleviate
the effects of sampling bias towards populous areas for the monitoring
network. The loss function (
The predictors of environmental conditions for the RF–STK model covered the
meteorological conditions, land uses, emission inventories, elevation,
population densities, normalized difference vegetation index (NDVI), and
road densities. The meteorological conditions included the atmospheric
pressure, air temperature, precipitation, evaporation, relative humidity,
insolation duration, wind speed, and planetary boundary layer height (PBLH).
Land uses mainly recorded the areas of forests, grasslands, wetlands,
artificial surfaces, and waterbodies. The emission inventories comprised
emission distributions of 10 major atmospheric chemical constituents, such
as CO, organic carbon, and black carbon. The meteorological conditions,
except for PBLH, were interpolated to the 0.1
The predictive performance and the predictor effects of the RF–STK model
were investigated. We compared the predictive performance of the RF–STK
models with and without the MOPITT data (either the a priori information or the
MOPITT retrievals) by using two cross-validation strategies, including the
site- and region-based cross-validation. With the 10-fold site-based
cross-validation, all the monitoring sites were approximately evenly divided
into 10 groups. In each iteration, nine groups were used to develop a
model, and the remaining group was used for validation. The training and
prediction steps were repeated 10 times so that every ground-level CO
observation had a paired prediction. While the site-based cross-validation
is a commonly used strategy, it tends to overestimate the predictive
performance given the fact that the monitoring sites tend to be clustered.
Therefore, we also employed the region-based cross-validation strategy by
following the concept of cluster-based cross-validation that was proposed to
resolve the issue of clustered sites (Young et al., 2016). Different from
the site-based cross-validation, the region-based cross-validation divided
the training data by the geographic regions (e.g., North China and East
China; Fig. 1) for the cross-validation. Various statistical metrics, such
as the coefficient of determination (
Detailed spatiotemporal analyses were performed to investigate the
correlation strength between the MOPITT data (including the a priori
information and the MOPITT retrievals) and ground-level CO observations, as
well as the distributions of the ground-level CO predictions. The whole
nation was divided into seven conventional regions, including Central, East,
North, Northeast, Northwest, South, and Southwest China (Fig. 1). For each
region, the effectiveness of the MOPITT CO was evaluated by estimating its
correlation with the ground-level CO observations at daily, seasonal, and
annual scales. In addition, the seasonal and annual average concentrations maps
were delineated based on the full-coverage CO predictions. The
population-weighted averages of MOPITT CO (
The data processing and modeling were mainly performed using Python and R (R Core Team, 2018). The scikit-learn Python package was used to develop random forests (Pedregosa et al., 2012). The spatial operations, such as spatiotemporal kriging were conducted by using the R packages of gstat (Gräler et al., 2016), rgdal (Bivand et al., 2017), and sp (Pebesma and Bivand, 2005).
Seasonal averages of the MOPITT retrieved surface CO
concentrations (mg m
The ground-level CO observations from the monitoring network show that the
average CO concentration for China was
Temporal variations of the average ground-level CO concentrations
for
The MOPITT CO data, with an overall coverage rate of
Performance of the RF–STK model in predicting
The spatiotemporal pattern of the MOPITT CO was generally consistent with
that of the ground-level CO observations in China, with
The MOPITT CO satisfactorily reflected the west–east spatial gradient and
the seasonality (i.e., low in warm seasons and high in cold seasons) of
ground-level CO concentrations (Figs. 3 and S5). Severe CO pollution in eastern China resulted from the intensive anthropogenic emissions (Fig. S6).
At both national and regional scales, the correlation coefficients between
ground-level CO observations and MOPITT CO were generally higher in winter
than the other three seasons. The stronger correlation in winter was mainly
attributed to the higher signal-to-noise ratios accompanied with the higher
CO concentrations, reflecting that the MOPITT CO was more sensitive in
measuring high CO concentrations. In addition, the correlation strength of
daily values exhibited considerable spatial heterogeneity, with
The discrepancies between the MOPITT CO and the ground-level CO observations could be mainly attributed to the low sensitivity of the satellite instrument to the ground-level CO variations and the high uncertainty associated with the a priori information for deriving the MOPITT retrievals. The low sensitivity caused high uncertainties in the measured radiances (associated with the instrumental noises) and hence led to large measurement errors (ASDC, 2017). In addition, the accuracy of the a priori information was influenced by the data quality of the emission inventory and the sophistication of the CTM (i.e., the CAM-Chem model), which subsequently affected the accuracy of the posterior estimation (Dekker et al., 2017). The CO emission amounts for China were reported to be largely underestimated (Streets et al., 2003; Wang et al., 2004), which might explain the fact that the MOPITT CO was approximately half of the ground-level CO observations. Especially for the CTP, the inadequate information about the CO emissions could be the main reason why MOPITT CO largely underestimated the ground-level CO concentrations, whereas some relatively densely populated cities (such as Naqu and Qamdo; Fig. 1) had high CO concentrations (Chen et al., 2019). The populations in Naqu and Qamdo are over 1 million, reflecting intensive anthropogenic activities (NBS, 2010). Biomass (e.g., yak dung) combustion, which is of low utilization efficiency, is widely used in the CTP for energy, resulting in considerable CO emissions (Cai and Zhang, 2006; Wen and Tu, 2011; Xiao et al., 2015). Naqu is sandwiched between the Tanggula and the Nyenchen Tanglha Mountains (Fig. 1), which is unfavorable for CO dispersion and causes CO accumulation.
On the basis of the site-based cross-validation results, the RF–STK model
showed reasonable performance in predicting the daily ground-level CO
concentrations, with
Compared to the original RF–STK model proposed in the previous study (Zhan et al., 2018), this refined RF–STK model had two major modifications, including sample weighting and logarithm transformation of the response variable (i.e., ground-level CO observations in the present study). Inversely weighting the training samples by their surrounding population densities alleviated the effects of sampling bias towards populous areas for the monitoring network. As a result, the CO monitoring data from the sparsely populated areas (e.g., the Tibetan Plateau) gained higher weights in the model training process for compensating the scarcity of the training samples, leading to more realistic predictions for those areas. In addition, observations with higher variations would naturally gain higher weights during model training given the loss function of squared errors, for which it was suggested to transform the response variable to achieve homogeneity of variance (De'Ath and Fabricius, 2000). The ground-level CO observations were heavy-tailed distributed, and hence logarithm transformation was conducted prior to training the RF submodel. Compared with the original RF submodel, the refined RF submodel showed similar performance in the cross-validation but predicted more realistic spatial distributions of ground-level CO across China (Table S4 and Fig. S8). The spatial distributions predicted by the original RF submodel showed the prevalence of higher concentrations than those predicted by the refined RF submodel, resulting from overweighting of the training data from the areas with more serve CO pollution, e.g., the NCP.
Performance comparisons of the RF–STK models with and without MOPITT data in predicting daily ground-level CO concentrations across China during 2013–2016.
Relative importance of the predictor variables in the RF–STK model. Please refer to Table S1 for the detailed descriptions of these variables.
It is noteworthy that the RF–STK model with MOPITT CO (
Annual average ground-level CO concentrations predicted by the
RF–STK model for
As a machine learning approach, the RF–STK model exhibited stable
performance across regions and seasons (Fig. S9), which was comparable or
superior to the previous CTMs or statistical methods simulating ground-level
CO concentrations (Table S5). As the simulation areas and episodes were
considerably different among these studies, their predictive performance was
not strictly comparable. A hybrid statistical model (partial least square
and support vector machine) exhibited decent goodness of fit in simulating
daily CO concentrations in Tehran, Iran, with fitting
On the basis of the variable importance evaluation, MOPITT CO was the most important predictor in the RF–STK model with relative importance of 9.4 %, and the emission-related predictors together accounted for 30.0 % of the total importance (Fig. 6). The partial dependence plots delineated the complicated relationships between the predictors and the ground-level CO concentrations, which could be difficult to be specified in parametric models (Fig. S10). While MOPITT CO contained essential information for the RF–STK model to make accurate predictions, the high uncertainties pertaining to the MOPITT retrievals prevented the MOPITT CO from playing a dominant role in the model, and the other predictors were also indispensable. Among the emission-related predictors, the spatial-convolution-processed emission of organic carbon was the most important predictor (importance: 8.5 %), which reflected the spatiotemporal patterns of anthropogenic emissions from industrial and residential sectors (Fig. S11). Given the high intercorrelations among the predictors associated with anthropogenic emissions, only the most informative predictors were retained in the model after the variable selection (Figs. S6 and S12).
As the most important group of predictors, the meteorological conditions
together accounted for 35.6 % of the total importance (Fig. 6). The
relative importance of temperature, evaporation, wind speed, atmospheric
pressure, PBLH, relative humidity, and insolation duration ranged from 2.8 %
to 8.6 %. In general, stagnant weather conditions occurred more frequently
in winter, which was characterized by shallow mixed layers, less
precipitation, and slow wind speed. These weather conditions caused
accumulation of atmospheric pollutants discharged by local emissions or
transported from outside, which aggravated local air pollution (Wang et
al., 2014). Similar to other atmospheric pollutants, the CO concentrations
were also sensitive to meteorological conditions
(Xu et al., 2011). For instance, the apparently
negative associations of the CO concentrations with the PBLH and the wind
speed were delineated by the corresponding partial dependence plots (Fig. S10). Nevertheless, it should be noted that the partial dependence plot
illustrated the overall relationship and could be distorted by spatial
and/or temporal confounders. For instance, the partial dependence plot for
temperature, with a peak around 20
Seasonal average ground-level CO concentrations (mg m
The RF–STK predictions showed similarly spatiotemporal patterns to MOPITT CO
while presented more fine-scale details (Figs. 3 and 7). The predictions of
the RF–STK adequately assimilated the information of ground-level CO
observations, with
Temporal trends of the population-weighted average ground-level CO
concentrations (mg m
During 2013–2016, the nationwide
In comparison to the RF–STK predictions (which were very similar to
ground-level CO observations given the good model fitness), the MOPITT CO
tended to underestimate the decreasing trends of ground-level CO
concentrations (Fig. 9). The absolute decreasing rate of
The issue of bias drift for the MOPITT retrievals, which could result from
long-term instrumental degradation (Deeter et al., 2017),
should also be considered in the trend analyses. The bias drift for
MOPITT CO was found to be approximately
In order to advance the knowledge of ground-level CO distributions, the
study period would be extended, and the spatiotemporal resolution would be
improved for future work. We chose the period of 2013–2016 due to the data
availability in the beginning of 2018 when we started to conduct this study.
While the air pollution in China was severer in earlier years
(Krotkov et al., 2016), no large-scale monitoring data
were available before 2013 for training the RF–STK model. Back extrapolation,
such as that in a previous study (Gulliver et al., 2016), may be conducted based
on MOPITT CO since 2000, whereas the issue of bias drift is currently
difficult to deal with. In addition, measurements or model predictions with
high spatial (e.g., 1 km) and temporal resolutions (e.g., 1 h) are
important to studies focusing on small regions, such as the CTP in this study.
In spite of its relative coarse resolution (22 km at nadir), the MOPITT
product provided the best publicly available satellite-based measurements of
surface CO for China during 2013–2016. Since July of 2018, the TROPOspheric
Monitoring Instrument onboard the Sentinel-5P satellite has been providing
the CO product at a higher resolution of 7 km
The spatiotemporal distributions of ground-level CO concentrations for China
during 2013–2016 were derived by using the RF–STK model to assimilate the
satellite and ground-based measurements. The RF–STK model showed feasible
performance in predicting the daily CO concentrations on the 0.1
On the basis of the spatiotemporal predictions, the population-weighted
average of ground-level CO concentrations was
The code for random forest is available from scikit-learn
(
The hourly CO concentration data are from the Ministry of Ecology and Environment of the People's Republic of China (MEPC, 2017)
(
The supplement related to this article is available online at:
DL and YZ performed research and wrote the paper. BD, YL, XD, and HZ analyzed data. FY and MLG provided extensive comments on the paper.
The authors declare that they have no conflict of interest.
The authors would like to thank Daniel Jaffe at University of Washington Bothell, Ya Tang and Guangming Shi at Sichuan University, and the anonymous referees for reviewing this paper. This research is supported by the National Natural Science Foundation of China (21607127, 41875162), the Fundamental Research Funds for the Central Universities (YJ201765), and the Sichuan “1000 Plan” Young Scholar Program.
This research has been supported by the National Natural Science Foundation of China (grant nos. 21607127 and 41875162) and the Fundamental Research Funds for the Central Universities (grant no. YJ201765).
This paper was edited by Alex B. Guenther and reviewed by two anonymous referees.