Exploring the uncertainty associated with satellite-based estimates of premature mortality due to exposure to fine particulate matter

The negative impacts of fine particulate matter (PM2.5) exposure on human health are a primary motivator for air quality research. However, estimates of the air pollution health burden vary considerably and strongly depend on the data sets and methodology. Satellite observations of aerosol optical depth (AOD) have been widely used to overcome limited coverage from surface monitoring and to assess the global population exposure to PM2.5 and the associated premature mortality. Here we quantify the uncertainty in determining the burden of disease using this approach, discuss different methods and data sets, and explain sources of discrepancies among values in the literature. For this purpose we primarily use the MODIS satellite observations in concert with the GEOS-Chem chemical transport model. We contrast results in the United States and China for the years 2004– 2011. Using the Burnett et al. (2014) integrated exposure response function, we estimate that in the United States, exposure to PM2.5 accounts for approximately 2 % of total deaths compared to 14 % in China (using satellite-based exposure), which falls within the range of previous estimates. The difference in estimated mortality burden based solely on a global model vs. that derived from satellite is approximately 14 % for the US and 2 % for China on a nationwide basis, although regionally the differences can be much greater. This difference is overshadowed by the uncertainty in the methodology for deriving PM2.5 burden from satellite observations, which we quantify to be on the order of 20 % due to uncertainties in the AOD-to-surface-PM2.5 relationship, 10 % due to the satellite observational uncertainty, and 30 % or greater uncertainty associated with the application of concentration response functions to estimated exposure.


Introduction
By 2030, air pollution will be the leading environmentally related cause of premature mortality worldwide (OECD, 2012).The World Health Organization (WHO) estimates that exposure to outdoor air pollution resulted in 3.7 million premature deaths in 2012.Many epidemiological studies have shown that chronic exposure to fine particulate matter (PM 2.5 ) is associated with an increase in the risk of mortality from respiratory diseases, lung cancer, and cardiovascular disease, with the underlying assumption that a causal relationship exists between PM and health outcomes (Dockery et al., 1993;Jerrett et al., 2005a;Krewski et al., 2009;Pope et al., 1995Pope et al., , 2002Pope et al., , 2004Pope et al., , 2006)).This has been shown through single and multi-population time series analyses, long-term cohort studies, and meta-analyses.
In order to stress the negative impacts of air pollution on human health and inform policy development (particularly with regard to developing strategies for intervention and risk reduction), many studies have calculated the total number of premature deaths each year attributable to air pollution exposure or the "burden of disease", through health impact assessment methods.One of the main obstacles in attributing specific health impacts of PM 2.5 is determining exposure and linking this to specific health outcomes.Jerrett et al. (2005a) suggest personal monitors would be the optimal method because it would be easier to attribute individual recorded health outcomes to specific particulate levels, but point out that the financial costs and time-intensiveness limit widespread use.Many studies have instead relied on fixedsite monitors within a certain radius to estimate population-B.Ford and C. L. Heald: Uncertainty of premature mortality estimates level exposure.However, these monitoring networks are generally located in urban regions and provide no information on concentration gradients between sites.Thus, epidemiological studies typically have to quantify the aggregate population response to an area-average concentration.Additionally, health data can be limited and therefore the responses may be determined from a subset of individuals that may not be representative of the wider population.
Estimating the burden of disease associated with particulate air pollution requires robust estimates of PM 2.5 exposure.Fixed-site monitoring networks can be costly to operate and maintain, and the sampling time period for many of these monitors in the United States is often only every third or sixth day.Due to the high spatial and temporal variability in aerosol concentrations, this makes it difficult to determine exposure and widespread health impacts.Worldwide, monitoring networks are even scarcer, with many developing countries lacking any long-term measurements."Satellitebased" concentrations are now used extensively for estimating mortality burdens and health impacts (e.g.Crouse et al., 2012;Evans et al., 2013;Fu et al., 2015;Hystad et al., 2012;Villeneuve et al., 2015).Satellite observations of aerosol optical depth (AOD) can offer observational constraints for population-level exposure estimates in regions where surface air quality monitoring is limited; however they represent the vertically integrated extinction of radiation due to aerosols, and thus additional information on the vertical distribution and the optical properties of particulate matter is required (often provided by a model) to translate these observations to surface air quality (van Donkelaar et al., 2006(van Donkelaar et al., , 2010;;Liu et al., 2004Liu et al., , 2005)).Alternatively, some studies have relied on model-based estimates of PM 2.5 exposure.Table 1 shows that the resulting estimates of premature mortality vary widely.Here, we discuss these different methods and contrast the uncertainty in these approaches for estimating exposure for both the US, where air quality has improved due to regulations and control technology, and China, where air quality is a contemporary national concern.Our objective is to investigate the factors responsible for uncertainty in chronic exposure to PM 2.5 burden of disease estimates, and use these uncertainties to contextualize the comparison of satellite-based and model-based estimates of premature mortality with previous work.As health impact assessment methods are becoming more popular in the scientific literature, a greater understanding of the uncertainties in these methods and the data sets that are used is important.

General formulation to calculate the burden of disease
To estimate the burden of premature mortality due to a specific factor like PM 2.5 exposure, we rely on Eqs. ( 1) and ( 2) (Eqs.6 and 8 in Ostro, 2004 and as previously used in van Donkelaar et al., 2011;Evans et al., 2013;Marlier et al., 2013;Zheng et al., 2015).The attributable fraction (AF) of mortality due to PM 2.5 exposure depends on the relative risk value (RR), which here is the ratio of the probability of mortality (all-cause or from a specific disease) occurring in an exposed population to the probability of mortality occurring in a non-exposed population.The total burden due to PM 2.5 exposure ( M) can be estimated by convolving the AF with the baseline mortality (equal to the baseline mortality rate M b × exposed population P ).The relative risk is assumed to change ( RR) with concentration, so that, in general, exposure to higher concentrations of PM 2.5 should pose a greater risk for premature mortality (Sect.2.4).
AF = (RR − 1)/(RR) (or the alternate form of AF = RR/( RR + 1) (1) Application of this approach requires information on the baseline mortality rates and population, along with the RR, which is determined through a concentration response function (including a shape and initial relative risk, Sect.2.4), and ambient surface PM 2.5 concentrations.

Baseline mortality and population
For population data, we use the Gridded Population of the World, Version 3 (GPWv3), created by the Center for International Earth Science Information Network (CIESIN) and available from the Socioeconomic Data and Applications Center (SEDAC).This gridded data set has a native resolution of 2.5 arcmin (∼ 5 km at the equator) and provides population estimates for 1990, 1995, and 2000, and projections (made in 2004) for 2005, 2010, and 2015.We linearly interpolate between available years to get population estimates for years not provided.Population density for China and the United States for the year 2000 are shown in Fig. 1 along with the projected change in population density by the year 2015, illustrating continued growth of urbanized areas (at the expense of rural regions in China).We also compare mortality estimates using only urban area population (similar to Lelieveld et al., 2013, which estimates premature mortality in megacities).For this, we rely on the populated places data set (provided by Natural Earth, which gives values for a point location rather than a grid and includes all major cities and towns along with some smaller towns in sparsely inhabited regions) which is determined from LandScan population estimates (Dobson et al., 2000).In the US, approximately 80 % of the population lives in urban areas.For China, 36 % of the population lived in urban areas in 2000, but this number rose to 53 % in 2013 (World Bank, 2015).
To determine baseline mortalities in the US for cardiovascular disease (ischemic heart disease and stroke), lung cancer, and respiratory disease, we use mortality rates for each Table 1.Premature mortality from PM 2.5 exposure by all-cause (All), heart disease (heart), and lung cancer (LC) as estimated in other studies for the globe, US (or North America), and China (or Asia).Values are for (× 1000 deaths per year).All cause values for this study are calculated as the sum of heart disease, lung cancer, and respiratory disease deaths (as opposed to calculating this based on an all-cause CRF). a Study provides several estimates determined using different CRFs.b Study provides several estimates from 14 different atmospheric models.cause of death for all ages from the Center for Disease Control (cdc.gov) for each year and each state.We multiply the gridded population by these state-level mortality rates to obtain the baseline mortality in each grid box.Other studies have also used country-wide (or regional) (e.g.Evans et al., 2013) or county-level (e.g.Fann et al., 2013) average deaths rates by cause.Some studies use the mortality rate for all cardiovascular diseases, which would produce larger estimates than just using ischemic heart disease and cerebrovascular disease (stroke).Additionally, some studies also only consider respiratory deaths related to ozone exposure.Mortality values are not as readily available for China, so we rely on country-wide values for baseline mortality (WHO agestandardized mortality rates by cause).Therefore, in China spatial variations in M b are only due to variations in population and not regional variations in actual death rates (i.e. we do not account for death-specific mortality rates varying between provinces).In order to account for some regional variability in mortality rates, we use a population threshold to distinguish between urban and rural regions for lung cancer mortality rates (Chen et al., 2013a).

Relative risk
The relative risk (RR) is a ratio of the probability of a health endpoint (in this case premature mortality) occurring in a population exposed to a certain level of pollution to the probability of that endpoint occurring in a population that is not exposed.Values greater than one suggest an increased risk, while a value of one would suggest no change in risk.These values are determined through epidemiological studies which relate individual health impacts to changes in concentrations, and literature values span a large range (Fig. 2).While these studies attempt to account for differences in populations, lifestyles, pre-existing conditions, and co-varying pollutants, relative risk ratios determined from each study still differ.This is likely due to variables not taken into consideration, errors in exposure estimates ("exposure misclassification") (Sheppard et al., 2012), and because, although the long-term effects of exposure to atmospheric pollutants have been well-documented, the pathophysiological mechanisms linking exposure to mortality risk are still unclear (Chen and Goldberg, 2009;Pope and Dockery, 2013;Sun et al., 2010), which make it difficult to determine how transferable results are from the context in which they were generated.
For our initial estimates, we use the integrated risk function from Burnett et al. (2014) for heart disease, respiratory disease, and lung cancer premature mortality due to chronic exposure.We also compare our results to premature mortality estimates using risk ratios determined by Krewski et al. (2009), which is an extended analysis of the American Cancer Society study (Pope et al., 1995), and from Laden et al. (2006) which is an updated and extended analysis of the Harvard Six Cities study (Pope et al., 2002).The updated Krewski et al. (2009) risk ratios have been widely used in similar studies due to the large study population with national coverage, 18-year time span, and extensive analysis of confounding variables (ecological covariates, gaseous pollutants, weather, medical history, age, smoking, etc.).However, the Burnett et al. (2014) function is becoming more widely used in the literature (e.g.Lelieveld et al., 2015;Lim et al., 2012;Apte et al., 2015) because it provides the shape of the mortality function for the global range of exposure concen-trations.Using these different risk ratios can make our results more directly comparable to studies in Table 1 which rely on the risk ratios from these four studies (Burnett et al., 2014;Krewski et al., 2009;Laden et al., 2006;Pope et al., 2002).

Concentration response function
In order to determine an attributable fraction, it is necessary to understand how the response changes with concentration (i.e.does the relative risk increase, decrease, or level off with higher concentrations?).The shape of this concentration response function is an area of on-going epidemiological research (e.g.Burnett et al., 2014;Pope et al., 2015).
In the simplest form, it might be assumed that the change in relative risk (RR, given as per 10 µg m −3 ) linearly depends on the surface PM 2.5 concentration (C, in µg m −3 ) as given in Eq. (3).
In this equation, C 0 can be considered the "policy relevant (PRB)/target", "natural background" or "threshold"/"counterfactual"/"lowest effect level" surface PM 2.5 concentration.Studies have shown that there is not a concentration level below which there is no adverse health effect for PM (e.g.Pope et al., 2002;Shi et al., 2015), and most experts in health impacts of ambient air quality agree that there is no population-level threshold (although there may be individual-level thresholds, e.g.Roman et al., 2008).However, there are few epidemiological studies in regions with very low annual average concentrations (Crouse et al., 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Carey et al., 2013 1.130 Cesaroni et al., 2013 1.040 Chen et al., 2008 1.060 Crouse et al., 2012 1.150 Dockery et al., 1993 1.130 Eftim et al., 2008 1.109 Eftim et al., 2008 1.208 Enstrom, 2005 1.010 Enstrom, 2005 1.040 Franklin et al., 2007 1.012 Goss et al., 2004 1.320 Hart et al., 2011 1.150 Hoek et al., 2013 1.060 Jerrett et al., 2005b 1.170 Jerrett et al., 2013 1.060 Krewski et al., 2000 1.140 Krewski et al., 2000 1.070 Laden et al., 2006 1.160  Turner et al., 2011 1.190 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Crouse et al., 2012 1.040 Lipsett et al., 20111.160 Miller et al., 2007 1.830 Beelen et al., 2008 1.040 Cesaroni et al., 2013 1.100 Chen et al., 2008Chen et al., 2005 0.900 Chen et al., 2005 1.420 Crouse et al., 20121.310 Crouse et al., 20121.160 Dockery et al., 19931.370 Dockery et al., 19931.180 Franklin et al., 2007 1.013 Hoek et al., 20131.110 Jerrett et al., 2005b1.120 Jerrett et al., 2005b 1.390 Jerrett et al.,  2012, does record a 1.9 µg m −3 annual concentration in rural Canada) making it difficult to determine the health risks in relatively clean conditions.How to extrapolate the relationship out of the range of observed measurements is uncertain.Therefore, rather than assuming that the function is linear down to zero, studies often set C 0 to the value of the lowest measured level (LML) observed in the epidemiology study from which the RRs are derived (e.g.Evans et al., 2013, use 5.8 µg m −3 with the RR from Krewski et al., 2009) or use the "policy relevant" background (PRB, generally 0-2 µg m −3 ) concentration.This is the level to which policies might be able to reduce concentrations and is generally determined from model simulations in which domestic anthropogenic emissions have been turned off (e.g.Fann et al., 2012).Similarly, some studies have set this value to preindustrial (1850) pollution levels (e.g.Fang et al., 2013;Silva et al., 2013).Linear response functions are generally a good fit to observed responses at lower concentrations (Pope et al., 2002).However, studies suggest that linear response functions can greatly overestimate RR at high concentrations (e.g.Pope et al., 2015), where responses may start to level off.There is uncertainty at high concentrations because most epidemiology studies of the health effects of air pollution exposure have generally been conducted under lower concentrations (i.e. in the US).In order to determine the shape of this response at higher concentrations, smoking has been used as a proxy (Burnett et al., 2014;Pope et al., 2011Pope et al., , 2009)), which does show a diminishing response at higher concentrations.Therefore, both log-linear (Eqs.4 and 5, where β = 0.15515/0.23218for heart disease/lung cancer from Pope et al. (2002) or β = 0.18878/0.21136for heart disease/lung cancer from (Krewski et al., 2009, in Eq. 5 and β = 0.01205/0.01328for heart disease/lung cancer from Krewski et al., 2009, in Eq. 4) and power law (Eq.6, where I is the inhalation rate of 18 m 3 day −1 , β = 0.2730/0.3195,α = 0.2685/0.7433for heart disease/lung cancer from Pope et al., 2011, andas used in Marlier et al., 2013) functions have been also been explored.

B. Ford and C. L. Heald: Uncertainty of premature mortality estimates
We note that Cohen et al. (2005) and Anenberg et al. (2010) reference Eq. ( 4) as a log-linear function, while Ostro (2004), Evans et al. (2013), andGiannadaki et al. (2014) use this as their linear function and instead use Eq. ( 5) as their loglinear function, we will refer to these equation numbers for clarity in other sections.Another method to limit the response at high concentrations is to simply use a "ceiling", "maximum exposure/high-concentration threshold", or "upper truncation" value in which it is assumed that the response remains the same for any value above it (e.g.Anenberg et al., 2012;Cohen et al., 2005;Evans et al., 2013).This can be a somewhat arbitrary value or the highest observed concentration in the original epidemiological study.
More recently, Burnett et al. (2014) fit an integrated exposure response (IER) model using RRs from a variety of epidemiological studies on ambient and household air pollution, active smoking, and secondhand tobacco smoke in order to determine RR functions over all global PM 2.5 exposure ranges for ischemic heart disease, cerebrovascular disease, chronic obstructive pulmonary disease, and lung cancer (Eq.7).Monte Carlo simulations were conducted in order to derive the 1000 sets of coefficients for the IER function (the coefficients are available at http://ghdx.healthdata.org/record/global-burden-diseasestudy-2010-gbd-2010-ambient-air-pollution-risk-model-1990-2010): This form is now being widely used (Apte et al., 2015;Lelieveld et al., 2015;Lim et al., 2012), and we use it here for our baseline estimates.In the following sections, we will discuss the uncertainty on the burden of disease associated with the shape of the concentration response function and threshold concentration.

Estimating surface PM 2.5
We use both a global model and satellite observations to estimate surface PM 2.5 concentrations and translate these to PM 2.5 exposure and health burden.In addition, we use surface measurements of PM 2.5 to test the accuracy of these estimates.

Unconstrained model simulation
We use the global chemical transport model GEOS-Chem (geos-chem.org) to simulate both surface PM 2.5 and AOD.
We use v9.01.03 of the model, driven by GEOS-5 meteorology, in the nested grid configuration over North America and Asia (0.5 • × 0.667 • horizontal resolution) for 2004-2011.Using this longer time period gives greater confidence in our uncertainty results.The GEOS-Chem aerosol simulation includes sulfate, nitrate, ammonium (Park et al., 2004), primary carbonaceous aerosols (Park et al., 2003), dust (Fairlie et al., 2007;Ridley et al., 2012), sea salt (Alexander et al., 2005), and secondary organic aerosols (SOA) (Henze et al., 2008).There are several regional anthropogenic emission inventories used in the model, such as BRAVO over Mexico (Kuhns et al., 2003), EMEP over Europe (Vestreng et al., 2007), CAC for Canada (http://www.ec.gc.ca/pdb/cac/ cac_home_e.cfm), the EPA NEI05 inventory (Hudman et al., 2007(Hudman et al., , 2008) ) over the US, and Streets et al. (2006) over Asia.Any location not covered by one of these regional inventories relies on the GEIA (Benkovitz et al., 1996) and EDGAR (Olivier and Berdowski, 2001;Vestreng, 2003) inventories.Biofuel emissions over the US are also from the EPA NEI05 inventory (Hudman et al., 2007(Hudman et al., , 2008) ) and anthropogenic emissions of black and organic carbon over North America follow Cooke et al. (1999) with the seasonality from Park et al. (2003).Biogenic VOC emissions are calculated interactively following MEGAN (Guenther et al., 2006), while year-specific biomass burning is specified according to the GFED2 inventory (van der Werf et al., 2006).Surface dry PM 2.5 is calculated by combining sulfate, nitrate, ammonium, elemental carbon, organic matter, fine dust, and accumulation mode sea salt concentrations in the lowest model grid box.In the following discussion, these values are referred to as the "unconstrained model".Simulated AOD is calculated at 550 nm based on aerosol optical and size properties as described in Ford and Heald (2013).

Satellite-based
We also derive a satellite-based surface PM 2.5 using satelliteobserved aerosol optical depth, with additional constraints from the GEOS-Chem model, in a similar manner to Liu et al. (2004Liu et al. ( , 2007) ) and van Donkelaar et al. (2006Donkelaar et al. ( , 2010Donkelaar et al. ( , 2011)).This method relies on the following relationship: where the satellite-derived PM 2.5 is estimated at the resolution of the unconstrained model by multiplying the satellite observed AOD by the value η, which is the ratio of model simulated surface PM 2.5 to simulated AOD at the time of the satellite overpass.This is then a combined product which relies on a chemical transport model to simulate the spatially and temporally varying relationship between AOD and surface PM 2.5 by accounting for all the aerosol properties and varying physical distribution and then constraining these results by "real" (i.e.satellite) measurements of AOD.tremely useful in regions where emissions inventories and model processes are less well known.
For AOD, we use observations from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument and from the Multi-angle Imaging SpectroRadiometer (MISR) instrument.For this work we use MODIS 550 nm Level 2, Collection 6, Atmosphere Products for Aqua as well as Level 2, Collection 5 for Aqua.We filter these data for cloud fraction (CF < 0.2) and remove observations with high AOD (> 2.0) as in Ford and Heald (2012), as cloud contamination causes known biases in AOD (Zhang et al., 2005), although we note that this could remove high pollution observations, particularly in China.For MISR, we also use the Level 2 AOD product (F12, version 22, 500 nm).We note that this is a different wavelength than from the MODIS instrument, but we neglect that difference for these comparisons.We use both of these observations for comparison as MODIS has a greater number of observations while MISR is generally considered to better represent the spatial and temporal variability of AOD over China (Cheng et al., 2012;Qi et al., 2013;You et al., 2015).Satellite observations are gridded to the GEOS-Chem nested grid resolution.We sample GEOS-Chem to days and grid boxes with valid satellite observations to calculate the η used to translate the AOD to surface PM 2.5 .
In Fig. 3, we show the long-term average (2004-2011) of satellite-based PM 2.5 for the US and China using MODIS Aqua Collection 6 and compare this to model-only estimates.In the following sections, most of our results will be shown using Collection 6; but reference and comparisons will be made to other products as a measure of uncertainty.In general the unconstrained model and satellite-based estimates show similar spatial features and magnitudes, with stronger local features apparent in the satellite-based PM 2.5 .The satellite-based estimate suggests that concentrations should be higher over much of the western US, particularly over California, Nevada, and Arizona (comparisons with surface measurements are discussed in Sect.2.5.3).In China, the satellite-derived PM 2.5 is higher in Eastern China, around Beijing and the Heibei province, Tianjin, and Shanghai, but lower in many of the central provinces.While many previous studies suggest that MODIS may be biased high (and MISR biased low) over China (e.g.Cheng et al., 2012;Qi et al., 2013;You et al., 2015) and the Indo-Gangetic Plain (Bibi et al., 2015); Wang et al. (2013) note that the GEOS-Chem model underestimates PM 2.5 in the Sichuan basin, suggesting that the MODIS satellite-based estimate could reduce the bias in this province.

Surface-based observations
We use observations of PM 2.5 mass from two networks in the United States (where long-term values are more readily available than in China) to evaluate the model and satellitederived PM 2.5 : the Interagency Monitoring of Protected Visual Environments (IMPROVE) and the EPA Air Quality System (AQS) database.The IMPROVE network measures PM 2.5 over a 24-h period every third day and these measurements are then analyzed for concentrations of fine, total, and speciated particle mass (Malm et al., 1994).We use the reconstructed fine mass (RCFM) values, which are the sum of ammonium sulfate, ammonium nitrate, soil, sea salt, elemental carbon and organic matter.
Previous studies have generally shown good agreement between measurements and GEOS-Chem simulations of PM 2.5 (e.g.Ford and Heald, 2013;van Donkelaar et al., 2010).In Fig. 4, we show the long-term average of PM 2.5 at AQS and IMPROVE sites in the US overlaid on simulated concentrations.In general, GEOS-Chem agrees better with measurements at IMPROVE sites, likely because these are located in rural regions where simulated values will not be as impacted by the challenge of resolving urban plumes in a coarse Eulerian model.There are noted discrepancies in California (Schiferl et al., 2014) and the Appalachia/Ohio River Valley region where the model is biased low.The model has a low mean bias of −25 % compared to measurements at the EPA AQS sites and a bias of −6 % compared to measurements at IMPROVE sites.Annual mean bias at individual sites ranges from −100 to 150 %.At these same AQS sites, the satellitederived PM 2.5 is less biased (−12 % using MODIS C6 or −8 % using MISR).
To estimate the uncertainty in satellite AOD, we also rely on surface-based measurements of AOD from the global AErosol RObotic NETwork (AERONET) of sun photometers.AOD and aerosol properties are recorded at eight wavelengths in the visible and near-infrared (0.34-1.64 µm) and are often used to validate satellite measurements (e.g.Remer et al., 2005).AERONET AOD has an uncertainty of 0.01-0.015(Holben et al., 1998).For this work, we use hourly Version 2 Level 2 measurements sampled to 2-hour windows around the times of the satellite overpasses.We also perform a least-square polynomial fit to interpolate measurements to 550 nm. 3 Estimated health burden associated with exposure to PM 2.5 We compare national exposure estimates for the US and China using unconstrained and satellite-based (MODIS and MISR) annual average PM 2.5 concentrations in Fig. 5, which is a cumulative distribution plot that is calculated as the sum of the population in each grid box which has an annual average concentration at or above each concentration level.For the US, satellite-based estimates suggest a slightly greater fraction of the population is exposed to higher annual average concentrations, while in China, the satellite-based estimates suggest that a lower fraction.Using MISR AOD suggests higher annual average concentrations in the US and much lower in China, as MISR has a high bias in regions of low AOD and a low bias in regions of high AOD (Jiang et al., 2007;Kahn et al., 2010).The large discrepancy between results from MISR and MODIS could be due to differencing in sampling, but studies have also shown that MODIS is biased high in China and MISR is biased low (Cheng et al., 2012;Qi et al., 2013;You et al., 2015).We further discuss the uncertainties in these estimates in Sect. 4. These exposure estimates are used to calculate an attributable fraction of mortality associated with heart disease, lung cancer, and respiratory disease attributable to chronic exposure using both model and satellite-based annual average concentrations for the US and China (Table 1).In the US, we estimate that exposure to PM 2.5 accounts for approximately 2 % of total deaths (6 % of heart diseases and 5 % of respiratory diseases) compared to 14 % (33 % of heart and 22 % of respiratory) in China using satellite-based concentrations.The Global Burden of Disease estimates for 2010, that 10 % of total deaths in China and 3 % of total deaths in the US are attributable to exposure to PM 2.5 (Lim et al., 2012).We present these as an average over the 2004-2011 time period in order to provide more robust results that are not driven by an outlier year, as there is considerable year- to-year variability in AOD and surface PM 2.5 concentrations (for example, heavy dust years in China).However, there are trends in population (Fig. 1) and surface concentrations that can influence these results.For example, there is a significant decreasing trend in AOD over the northeastern US simulated in the model which is also noticeable in the satellite observations and the surface concentrations (Hand et al., 2012).This decreasing trend can be attributed to declining SO 2 emissions in the US as noted in Leibensperger et al. (2012).Trends in China are more difficult to ascertain as emissions have been variable over this period in general (Lu et al., 2011;Zhao et al., 2013) with widespread increases from 2004 to 2008 followed by variable trends in different regions through 2011.The difference between mortality burden estimates using model or satellite concentrations is approximately 20 % for the US and 2 % for China on a nationwide basis, although regionally the difference can be much greater.A question we aim to address here is whether these model and satellitebased estimates are significantly different.We compare our results to premature mortality burden estimates from other studies in Table 1.In general, our estimates for China are higher than most previous estimates, except for Lelieveld et al. (2015) and Rohde and Muller (2015).However, these studies provide estimates for 2010 and 2014, respectively, and we did find an increasing trend in our mortality estimates over the study time period.For the US, our estimates are in the lower range of previous studies.The spread among these studies can be attributed to the data used (i.e.MODIS Collection 5 rather than Collection 6 or unconstrained model concentrations, choice of baseline mortality rates, and population), the resolution of the data, the years studied, as well as the risk ratios and response functions.For example, Evans et al. (2013) also use satellite-based concentrations (using MISR/MODIS Collection 5 and GEOS-Chem), but use a different concentration response function and regional baseline mortality rates.In the following sections, we delineate some of the uncertainty in these results and reasons for differences compared to previous studies.

Uncertainty in satellite-based PM 2.5
Uncertainties in the PM 2.5 concentrations derived from satellite observations arise from the two pieces of information which inform this estimate: (1) satellite AOD and (2) model η.Here we explore some of the limitations and uncertainties associated with each of these inputs.

Uncertainty associated with satellite AOD
While satellite observations of aerosols are often used for model validation (e.g.Ford and Heald, 2012), these are indirect measurements with their own limitations and errors.The uncertainty in satellite AOD can be due to a variety of issues such as the presence of clouds, the choice of optical model used in the retrieval algorithm, and surface properties (Toth et al., 2014;Zhang and Reid, 2006).For validation of satellite products, studies have often relied on comparisons against AOD measured with sun photometers at AERONET ground sites (e.g.Kahn et al., 2005;Levy et al., 2010;Remer et al., 2005Remer et al., , 2008;;Zhang and Reid, 2006).The uncertainty in AOD over land from MODIS is estimated as 0.05 ± 15 % (Remer et al., 2005), while Kahn et al. (2005) suggest that 70 % of MISR AOD data are within 0.05 (or 20 % × AOD) of AERONET AOD.
There are also discrepancies between AOD measured by the different instruments due to different observational scenarios and instrument design.The Aqua platform has an afternoon overpass while the Terra platform has a morning overpass.It might be expected that there would be some differences in retrieved AOD associated with diurnal variations in aerosol loading.However, the difference of 0.015 in the globally averaged AOD between MODIS onboard Terra and Aqua (Collection 5), although within the uncertainty range of the retrieval, is primarily attributed to uncertainties and a drift in the calibration of the Terra instrument, noted in Zhang and Reid (2010) and Levy et al. (2010).Collection 6 (as will be discussed further) reduces the AOD divergence between the two instruments (Levy et al., 2013).MISR employs a different multi-angle measurement technique with a smaller swath width; as a result the correlation between MISR AOD and MODIS AOD is only 0.7 over land (0.9 over ocean) (Kahn et al., 2005).Not only are there discrepancies in AOD between instruments, there are also differences between product versions for the same instrument.The MODIS Collection 6 Level 2 AOD is substantially different from Collection 5.1 (Levy et al., 2013, and Fig. 6).In general, AOD decreases over land and increases over ocean with Collection 6.These changes are due to a variety of algorithm updates including better detection of thin cirrus clouds, a wind speed correction, a cloud mask that now allows heavy smoke retrievals, better assignments of aerosol types, and updates to the Rayleigh optical depths and gas absorption corrections (Levy et al., 2013).These differences can also impact the derived PM 2.5 (and can explain some differences between our results and previous studies).In particular, because Collection 6 suggests higher AOD over many of the urbanized regions, the derived PM 2.5 and resulting exposure estimates (all other variables constant) are greater.The difference between these two retrieval products, given the same set of radiance measurements from the same platform, gives a sense of the uncertainty in the satellite AOD product (Fig. 6a).
We estimate the uncertainty in satellite AOD used here by comparing satellite observations to AERONET and determining the normalized mean bias (NMB) between AOD from each satellite instrument and AERONET for the US and China (Fig. 7).Although there is a very limited number of sites in China, from these comparisons, we find that the satellites generally agree with AERONET better in the eastern US and northeastern China than in the western US and western and southeastern China.There are larger biases in the west near deserts and at coastal regions where it may be challenging to distinguish land and water in the retrieval algorithm.NMBs at each AERONET site are generally similar among the instruments (MISR comparison not shown), with greater differences at these western sites.While Collection 6 does reduce the bias at several sites along the East Coast in the US, it is generally more biased at the Four Corners region of the US.We use these NMBs to regionally "bias correct" our AOD values and estimate the associated range of uncertainty in our premature mortality estimates.Compared to the standard MODIS AOD retrieval uncertainty, our overall NMB is less in the eastern US (−1 %) and western China (11 %) and higher in the western US (40 %) and eastern China (18 %).
There may also be biases associated with the satellite sampling, should concentrations on days with available observations be skewed.In order to assess the sampling bias, we use the model and compare the annual mean to the mean of days with valid observations (Fig. 6b).In general, sampling leads to an underestimation in AOD (average of 20 % over the US).This can partly be attributed to the presence of high aerosol concentrations below or within clouds which cannot be detected by the satellite, the mistaken identification of high aerosol loading as cloud in retrieval algorithms, as well as the removal of anomalously high AOD values (> 2.0) from the observational record.This suggests that the average AOD values can also be influenced by the chosen filtering and data quality standards.Analysis of the impact of satellite data quality on the AOD to PM 2.5 relationship is discussed in Toth et al. (2014).They find that using higher quality observations does tend to improve correlations between observed AOD and surface PM 2.5 across the US though in general correlations are low (< 0.55).

Uncertainty associated with model η
In general, the model simulates PM 2.5 well (Fig. 4) and represents important processes; but, satellite AOD can help to constrain these estimates to better represent measured concentrations (van Donkelaar et al., 2006).However, in specific regions or periods of time, errors in η could lead to discrepancies between satellite-derived and actual surface mass.Snider et al. (2015) does show some regional biases in the GEOS-Chem model η compared to η determined from collocated surface measurements of AOD and PM 2.5 .In order to assess the potential uncertainty in model-based η, we perform multiple sensitivity tests to determine the impact that different aerosol properties, grid-size resolution and timescales will have on η and, ultimately, on the resulting satellite-based PM 2.5 (listed in Table 2).These sensitivity  tests are performed solely with model output, which can provide a complete spatial and temporal record, and results from the modified simulations are compared to the standard model simulation.We note that these are "errors" only with respect to our baseline simulation; we do not characterize how each sensitivity simulation may be "better" or "worse" compared to true concentrations of surface PM 2.5 , but rather how different they are from the baseline, thus characterizing the uncertainty in derived PM 2.5 resulting from the model estimates of η.We make these comparisons for both the US and China and show results in Fig. 8.Because mass concentrations in China are generally much higher, the absolute value of potential errors can also be much greater.
The timescale of the estimated PM 2.5 influences the error metric we choose for this analysis.We use the NMB for estimating error associated with annual PM 2.5 exposure (the metric of interest for chronic exposure).This allows for the possibility that day-to-day errors may compensate, resulting in a more generally unbiased annual mean value.The error on any given day of satellite-estimated PM 2.5 is likely larger, and not characterized by the NMB used here.
Our first sensitivity tests relate specifically to the methodology.To derive a satellite-based PM 2.5 with this method requires model output for every day and that there are valid satellite observations.Running a model can be labor intensive, at the same time there are specific regions and time periods with poor satellite coverage.Therefore, it might be beneficial to be able to use a climatological η or a climatological satellite AOD.To test the importance of daily variability in AOD, we compute daily η values and then solve for daily surface PM 2.5 values using a seasonally averaged model simulated AOD (AvgAOD).This mimics the error introduced by using seasonally averaged satellite observations, an attractive proposition to overcome limitations in coverage.This approximation often produces the greatest error (∼ 20 % in the US and 0-50 % in China) especially in regions where AOD varies more dramatically and specifically where transported layers aloft can significantly increase AOD (Fig. 8).For the seasonally averaged η test (AvgEta), we estimate daily PM 2.5 values (which are averaged into the annual concentration) from the seasonally averaged η and daily AOD values.As regional η relationships can be more consistent over time than PM 2.5 or AOD, this test evaluates the necessity of using daily model output to define the η relationship.The error in the annual average of daily PM 2.5 values determined using a seasonally averaged η creates results that are very similar to the error found calculating an annual average of daily PM 2.5 values calculated using a seasonally averaged AOD.
The model η also inherently prescribes a vertical distribution of aerosol, which may be inaccurately represented by the model and introduce errors in the satellite-derived PM 2.5 .3.
Previous studies have shown that an accurate vertical distribution is essential for using AOD to predict surface PM 2.5 .(e.g.Li et al., 2015;van Donkelaar et al., 2010).We test the importance of the variability of the vertical distribution in the η relationship for predicting surface PM 2.5 concentrations by comparing values from the standard simulation against using an η from a seasonally averaged vertical distribution (Avg-Prof).For this comparison, we allow the column mass loading to vary day-to-day, but we assume that the profile shape does not change (i.e.we re-distribute the simulated mass to the same seasonally averaged vertical profile).We note that this is not the same as assuming a constant η, as relative humidity and aerosol composition are allowed to vary.Additionally, this differs from other studies (van Donkelaar et al., 2010;Ford and Heald, 2013) in that we are not testing the representativeness of the seasonal average profile, but testing the importance of representing the daily variability in the vertical profile.From Fig. 7, we see that using a seasonally averaged vertical distribution (AvgProf) can lead to large errors in surface concentrations.Information on how the pollutants are distributed is extremely important because changes in column AOD can be driven by changes in surface mass loading, but also by layers of lofted aerosols that result from production aloft or transport (and changes in the depth of the boundary layer).This is important in areas that are occasionally impacted by transported elevated biomass burning plumes or dust.Large errors often occur in China, especially during the spring when these regions are influenced by transported dust from the Taklamakan and Gobi Deserts (Wang et al., 2008).Southeastern China has the largest NMB due to not only transport from interior China, but also from other countries in Southeast Asia.There is a positive bias in most regions, because on average, most of the aerosol mass is located at the surface; therefore, using an average profile will over predict the surface concentrations.Similar to the average AOD and η (AvgAOD and AvgEta), average vertical distributions generally over predict PM 2.5 due to the presence of outliers.This stresses the importance of not only getting the mean profile correct, but the necessity of also simulating the variability in the profile on shorter timescales.
We also test the sensitivity of derived PM 2.5 to aerosol water uptake.This is done by recalculating η using a seasonally averaged relative humidity (RH) profile (AvgRH).This generally reduces the seasonally averaged AOD (less water uptake) in every season (because hygroscopic growth of aerosols is non-linear with RH).This leads to an overestimate of η when applied to the AOD values from the standard simulation and generally overestimates surface PM 2.5 in regions with potentially higher RH and more hygroscopic aerosols (eastern US and eastern China).This is because, for the same AOD, a higher η value would suggest more mass at the surface in order to compensate for optically smaller particles aloft.Western China (and some of central China) has a negative bias, suggesting that using a mean relative humidity actually underestimates PM 2.5 .However, this is because the RH is generally low but can have large variability, and concentrations (outside of the desert regions) are also low so that the NMB may be large although the absolute error is not.A higher resolution model, although more computationally expensive, will likely better represent smallscale variability and is better suited for estimating surface air quality.Punger and West (2013) find that coarse resolu-tion models often drastically underestimate exposure in urban areas.We therefore investigate the grid-size dependence of our simulated η.For this, we determine the η values from a simulation running at 2 • × 2.5 • grid resolution (with the same emission inputs and time period), re-grid these values to the nested grid resolution (0.5 • × 0.666 • ) and solve for the derived PM 2.5 concentrations using the AOD values from the nested simulation (noted as 2 × 2.5 in Fig. 8).From Fig. 8, we see larger discrepancies in regions which are dominated by more spatially variable emissions (Northeastern US and China) rather than areas with broad regional sources (Southeastern US).This is in line with Punger and West (2013) who show smaller differences due to resolution in estimated premature mortality due to PM 2.5 exposure in rural areas than in urban areas.Compared to the other sensitivity tests, using the coarser grid leads to mean errors of only 10-15 % in the US and in China, which suggests that spatially averaged η are potentially more useful than temporally averaged η for constraining surface PM 2.5 .Thompson and Selin (2012) and Thompson et al. (2014) show that coarse grids can over predict pollutant concentrations and consequently health impacts, but using very fine grids does not significantly decrease the error in simulated concentrations compared to observations.This effect is more pronounced with ozone.Additionally, their coarsest grid resolution is 36 km which they compare to results at 2, 4, and 12 km.Punger and West (2013) compare health impacts at a variety of resolutions out to several 400 km and show that coarser resolutions underestimate health impacts because concentrations are diluted over larger areas instead of allowing high concentrations to be co-located with large urban populations.
The GEOS-Chem simulation of surface nitrate aerosol over the US is biased high (Heald et al., 2012).This can be an issue in regions where nitrate has a drastically different vertical profile (or η) from other species.To test how this nitrate bias could impact η and the derived PM 2.5 , we compute η without nitrate aerosol, and then derive PM 2.5 using the standard AOD (No NO 3 ).This is not a large source of potential error (< 15 %), with only slightly larger errors in winter and in regions where nitrate has a significant high bias (central US).Furthermore, these errors are less than the bias between the model and surface observations of nitrate in the US (1-2 µg m −3 compared to 2-7 µg m −3 ), suggesting that even though there is a known bias in the model, using satellite observations may largely correct for this by constraining the total AOD when estimating satellite-derived PM 2.5 .We also did this comparison for China.Measured nitrate concentrations are not widely available for evaluation, but Wang et al. (2014) suggests that model nitrate is also too high in eastern China.The NMB is even less in regions in China (< 10 %), with negative values in eastern China (where nitrate concentrations are high) and positive values in western and central China (where nitrate concentrations are lower and have less bias compared to observations).
To further explore the role of aerosol composition (and possible mischaracterization in the model), we take the simulated mass concentrations and compute the AOD assuming that the entire aerosol mass is sulfate (SO 4 in Fig. 8) or, alternatively, hydrophobic black carbon (BC in Fig. 8).Black carbon has a high mass extinction efficiency, which is constant with RH given its hydrophobic nature; while sulfate is very hygroscopic, resulting in much higher extinction efficiencies at higher relative humidity values.Overall, assuming that all the mass is sulfate leads to low biases on the order of 15-20 % as the AOD in many regions in the US is dominated by inorganics.Errors are largest in regions and seasons with larger contributions of less hygroscopic aerosols (organic carbon and dust) and/or high relative humidity.Assuming the entire aerosol mass is black carbon can lead to greater errors than sulfate because BC has a larger mass extinction at lower relative humidity values and hydrophobic black carbon generally makes up a small fraction of the mass loading in all regions in the US and China.When RH is low, this assumption increases the AOD, which leads to an under prediction in the derived PM 2.5 .When RH is high, this decreases the AOD and leads to an over prediction in derived PM 2.5 .The largest percentage changes occur in the southwestern US and western China (∼ −30 %) due to the low relative humidity, low mass concentrations, and large contribution of dust.
We also compare these sensitivity tests on daily timescales.We do not show the results here because we rely on chronic exposure (annual average concentrations) for calculating mortality burdens.The normalized mean biases in annual average concentrations (Fig. 8) are generally much less (range of ±20 % in US and ±50 % in China) than potential random errors in daily values as many of these daily errors cancel out in longer term means.This is the case for our sensitivity tests regarding the vertical profile and relative humidity, which have much larger errors on shorter timescales.However, because our method to test the sensitivity to aerosol type assumes that all aerosol mass is black carbon or sulfate, we introduce a systematic bias that is not significantly reduced in the annual NMB.This highlights the differing potential impacts due to systematic and random errors, which is an important distinction for determining the usefulness of this method.Systematic errors may not be as obvious on short timescales compared to random errors (related to meteorology and/or representation of plumes) that can lead to large biases in daily concentrations.However, these random errors have less impact when we examine annual average concentrations and mortality burdens.Systematic errors, potentially related to sources or processes, may be harder to counteract even on longer timescales and even when the model is constrained by satellite observations.However, we also show that random daily errors can bias the long-term mean, stressing the importance of not only correcting regional biases, but also in accurately simulating daily variability.We translate this potential uncertainty in η to potential uncertainty in mortality estimates determined from the satellite-based PM 2.5 .We use the normalized mean bias in annual PM 2.5 determined from the sensitivity tests for RH, the vertical profile, grid resolution, and aerosol composition for each grid box and then use these values to "bias correct" our satellite-based annual PM 2.5 concentrations and recalculate exposure (shown in Fig. 5) and mortality (discussed in Sect.6).From Fig. 5, we see that the uncertainty in η, when translated to an annual exposure level, is larger than the differences in exposure levels estimated from model and satellite-based PM 2.5 , suggesting that satellite-based products which rely strongly on the model or which do not account for the variability in the aforementioned variables, does not necessarily provide a definitively better estimate of exposure.Secondly, these uncertainties in many regions are greater than the difference between both the model and surface PM 2.5 and the satellite-based and surface observations.While these comparisons are limited spatially and temporally, this highlights that constraining the model with the satellite observations can improve estimates of PM 2.5 but there remains a large amount of uncertainty in these estimates.

Selection of concentration response function and relative risk
The choice of the shape of the concentration response function (CRF) and relative risk ratio value explains much of the difference in burden estimated in different studies listed in Table 1.In general, it is difficult to determine risks at the population level and studies have found that using ambient concentrations tends to under predict health effects (e.g.Hubbell et al., 2009).However, personal monitoring is costly and time-intensive, and therefore, epidemiology studies generally rely on determining population-level concentration response functions rather than personal-level exposure responses.However, populations also respond differently; and therefore the shape and magnitude of this response varies among studies.The uncertainty associated with the RR determined in the original epidemiology study will impact results in any health impact assessment.
For an initial metric of the uncertainty in the risk ratios, studies often include estimates generated using the 95 % confidence intervals of the RR determined in the original study (as shown in Fig. 2).A confidence interval shows the statistical range within which the true PM coefficient for the study population is likely to lie, which could be a single city, region, or population group.The Krewski et al. (2009) study, which is a reanalysis of the American Cancer Society (ACS) Cancer Prevention Study II (CPS-II), included 1.2 million people in the Los Angeles and New York City regions, whereas the Laden et al. (2006) study, an extended analysis of the Harvard Six Cities Studies, included 8096 white participants.Using just these confidence intervals as a measure of uncertainty suggests that there exists a large range of uncertainty in population-level health responses to exposure and caution should be exercised when attempting to transfer these values beyond the population from which they  were determined in order to estimate national-level mortality burdens based on ambient concentrations.The IER coefficients from Burnett et al. (2014) are generated using the risk ratios, threshold values, and confidence intervals from previous studies and therefore also provide a large range in premature mortality estimates.To depict this range, we also include the 5th and 95th percentile estimates in addition to the mean estimate.We also show the maximum value in our sensitivity tests.
To test the impact of methodological choices associated with the burden calculation, we compare results using different concentration response functions and relative risk ratios that previous studies have used.Table 3 lists the different choices that we explore regarding the CRF and relative risk, the study that used these values, and the resulting percent change in burden compared to our initial estimates using the IER from Burnett et al. (2014).In particular we compare our results using risk ratio values from Krewski et al. (2009), Pope et al. (2002) and Laden et al. (2006), and log-linear and power law relationships.Figure 9 shows that the largest dif-ference in burden is associated with using the higher risk ratios from Laden et al. (2006) vs. using Krewski et al. (2009) or the mean estimates determined using the IER coefficients from Burnett et al. (2014), the former suggest a much greater mortality response to PM 2.5 exposure.
Our estimates of Sect. 3 also use the same relative risk values for every location.However, studies have found that different populations have varied responses to exposure (potential for "effect modification") (Dominici et al., 2003).One of the main uncertainties in these methods is relying on risk ratios that are primarily determined from epidemiology studies conducted in the US, which may not represent the actual risks for populations in China.Long-term epidemiology studies examining exposure to PM 2.5 across broad regions of China are scarce, but studies using acute exposure to PM 2.5 or chronic exposure to PM 10 or total suspended particles have suggested lower exposure-response coefficients than determined by studies conducted in the US and Europe (Aunan and Pan, 2004;Chen et al., 2013b;Shang et al., 2013)

Studies Studies
Figure 10.Burden of mortality due to outdoor exposure to fine particulate matter as determined in previous studies (Table 1, gray bars with values from individual studies designated by black lines), calculated using model (GEOS-Chem, solid) and satellite-based (hatched) annual concentrations (colored by disease, whiskers denote 5th and 95th percentile estimates generated using the Burnett et al., 2014, coefficients).
The uncertainty range on the MODIS-based estimates due to satellite AOD (taupe), model η (coral), and CRF (blue) are shown on the right.
studies conducted in the US might overestimate the health effects in China.We also explore using different "threshold" values.The IER function uses threshold values between 5.8 and 8.8 µg m −3 .In the US, higher threshold values can significantly reduce burden estimates.When we compare sensitivity tests that use the same CRF (Krewski et al., 2009) but with a regional PRB concentration instead of the lowest measured level (5.8 µg m −3 ), the premature mortality estimates are significantly reduced, suggesting that the choice of this value is very important in the US where annual mean concentrations are relatively low.However, in China these threshold values have less impact on our results because annual mean concentrations are high enough that subtracting a threshold makes little difference.Conversely, using a ceiling value of 30 or 50 µg m −3 produces no difference in the US (0 % of the population experiences annual concentration values greater than 30 µg m −3 ), while strongly reducing burden estimates in China.
We also see that the shape of the CRF produces different results between the US and China.Using a power law or loglinear (Eq.6) function increases relative risks at low concentrations and decreases risk ratios at high concentrations such that total disease burden estimates increase in the US and decrease in China.In the US, a log-linear CRF is almost equivalent to a linear response because of the low concentrations.In general, the shape of the concentration response function is more important at low or very high concentrations.

Comparison of uncertainty
Figure 10 provides a summary of the different sources of uncertainty discussed here.We show the mortality burdens for respiratory disease, lung cancer and heart disease associated with chronic exposure to ambient PM 2.5 and calculated using annual average model-based and "satellite-based" values (from MISR and MODIS) for both the US and China.We show here that the satellite-based estimates suggest slightly higher national burdens in the US and slightly lower in China.However, our values using these different annual av-erage concentrations fall within the range of values found in the literature (Table 1).
We further contrast these estimates to the range in uncertainty associated with our observations and methodology.The difference between the burden calculated using strictly the model or the satellite-based approach is greater than the uncertainty range in the satellite AOD, suggesting that this difference is outside of the scope of measurement limitations and errors.However, the potential uncertainty in the satellitebased estimate due to the conversion from AOD to surface PM 2.5 (represented by the model η) is substantially larger, larger even than the difference between model-derived and satellite-derived estimates.Therefore, while constraining the model estimate of PM 2.5 by actual observations should improve our health effect estimates, the uncertainty in the required model information may limit the accuracy of this approach.Again, we stress that these are "potential" model uncertainties which may overestimate the true uncertainty in regions where the model accurately represents the composition and distribution of aerosols.We also acknowledge that we have investigated a limited set of factors; additional biases may exacerbate these uncertainties.However, adding additional observational data and model estimates can also help to better constrain these satellite-based PM 2.5 estimates (Brauer et al., 2012(Brauer et al., , 2016;;van Donkelaar et al., 2015a, b).
Figure 10 also conveys the range in mortality estimates for the US and China that can result from varying choices for the risk ratio or shape of the concentration response.While epidemiology studies attempt to statistically account for differences in populations and confounding variables, there is still a large spread in determined risk ratios.Just as important, or perhaps more so than determining ambient concentrations, applying response functions is a determining factor in quantifying the burden of mortality due to outdoor air quality.Differences in exposure estimates can be overshadowed by these different approaches.As an added example, we calculated the mortality burden using only populated places, similar to Lelieveld et al. (2013) and Cohen et al. (2004) and find that for the US this decreased the burden by 13 %, (satellitebased, 18 % for model).For China, this reduces the burden estimate by 72 %.Differences in our estimates here and those found in the literature can be partly attributed to differences in application of the CRF function, along with differences in baseline mortalities and population estimates.Disease burdens estimated in various studies can therefore only be truly compared when the methodology is harmonized.

Conclusions
Calculating health burdens is an extremely important endeavor for informing air pollution policy, but literature estimates cover a large range due to differences in methodology regarding both the measurement of ambient concentrations and the health impact assessment.Satellite observations have proved useful in estimating exposure and the resulting health impacts (van Donkelaar et al., 2015b;Yao et al., 2013).However, there remain large uncertainties associated with these satellite measurements and the methods for translating them into surface air quality that needs to be further investigated.Our goal with this work is to explore how mortality burden estimates are made and how choices within this methodology can explain some of these discrepancies.We also aim to provide a context for interpreting the quantification of PM 2.5 chronic exposure health burdens.
While we have discussed several potential sources for uncertainty in calculating health burdens with satellite-based PM 2.5 , there are still a significant number of other sources of uncertainty that we did not explore.There are processes that could impact the AOD to PM 2.5 relationship in the model, such as different emissions and removal processes.Additionally, our sensitivity test results are likely partly tied to the spatial resolution of the model and the satellite AOD, and their ability to capture finer spatial variations in pollution in regions with high populations.However, Thompson et al. (2014) suggest that uncertainty in the CRF will likely still have a larger impact than uncertainties in populationweighted concentrations due to model resolution.
Satellite measurements have provided great advancements in monitoring global air quality, providing information in regions with previously few measurements.However, further progress still needs to be made in determining how to characterize exposure to ambient PM 2.5 using these satellite observations, especially as they are becoming more widely used in epidemiological studies and health impact assessments.Reducing uncertainty, even at the lower concentrations observed in the US, is important if these methods and data sets are to be used for policy assessment or air quality standards.However, as air pollution is a leading environmentallyrelated cause of premature mortality, the difficulties in applying these data should not negate the importance of this endeavor.Overcoming sampling limitations in satellite observations and better accounting for regional biases could help to reduce the uncertainty in satellite-retrieved AOD and adding additional observational data and model estimates can help to better constrain satellite-based PM 2.5 estimates (Brauer et al., 2012(Brauer et al., , 2016;;van Donkelaar et al., 2015a, b).Future geostationary satellites will also be critical to advance this methodology and will provide extremely valuable information for daily monitoring and tracking of air quality.Furthermore, these geostationary observations, in concert with greater surface monitoring, will offer new constraints for epidemiological studies to develop health risk assessments and lessen the uncertainty in applying concentration-response functions and determining health burdens.

Figure 1 .
Figure 1.Population density (per km 2 ) for the year 2000 from the GPWv3 data for (a) the continental US and (c) China.The projection for increase in population density by the year 2015 for (b) the continental US and (d) China.

Figure 3 .
Figure 3.Long-term average (2004-2011) unconstrained model simulation of PM 2.5 for the (a) continental US and (b) China, along with the (MODIS-Aqua Collection 6) satellite-based PM 2.5 for the (c) continental US and (d) China, and the difference between the satelliteconstrained and unconstrained model PM 2.5 concentrations.

Figure 5 .
Figure 5. Percent of the population exposed to different annual PM 2.5 concentrations in the US (a) and China (b).Lines denote estimates using the unconstrained GEOS-Chem simulation (red) or using satellite-based estimates with MODIS (green) and MISR (blue).Shading represents potential uncertainty associated with the model η (described in Sect.4.2) and dashed black lines represent national annual air quality standards.
Figure 6.(a) Percent difference between annual mean AOD from MODIS Collection 6 and Collection 5 and (b) simulated bias in satellitederived annual average surface PM 2.5 associated with satellite sampling.

Figure 7 .
Figure 7. Normalized mean bias in AOD between MODIS-Aqua Collection 6 and AERONET sites for (a) the US and (b) China.

Figure 8 .
Figure 8. Distribution of normalized mean biases in annual average PM 2.5 for grid boxes in different regions of the US (top row) and China (bottom row) determined from sensitivity tests to investigate the uncertainty in η.Sensitivity tests are described (and abbreviations defined) in Table3.

Figure 9 .
Figure 9. Premature mortality estimates for (a) the US and (b) China determined using different RR, CRFs, and threshold/ceiling values, as described in Table 3. Colors represent cause of death estimated using PM 2.5 concentrations from unconstrained model simulations (solid) and satellite-based estimates (hatched).
, indicating that assessments which use CRFs from B. Ford and C. L. Heald: Uncertainty of premature mortality estimates Table3provides additional information on the data sources and concentrations response functions used in these studies.

Table 2 .
List of model sensitivity tests and descriptions with results shown in Fig. 8.
Calculate AOD and η without the contribution of nitrate.

Table 3 .
Input for premature mortality burden estimate sensitivity tests and the resulting percent change in mortality due to chronic exposure determined from satellite-based concentrations.Parentheses are for values determined from model simulated concentrations.