Final Author Comments: acp-2015-118 – “Use of North American and European air quality networks to evaluate global chemistry-climate modeling of surface ozone

We test the current generation of global chemistry-climate models in their ability to simulate observed, present-day surface ozone. Models are evaluated against hourly surface ozone from 4,217 stations in North America and Europe that are averaged over 1° x 1° grid cells, allowing commensurate model-measurement comparison. Models are generally biased high during all hours of the day and in all regions. Most models simulate the shape of regional summertime diurnal and annual cycles well, correctly matching the timing of hourly (~15:00) and monthly (mid-June) peak surface ozone abundance. The amplitude of these cycles is less successfully matched. The observed summertime diurnal range (~25 ppb) is underestimated in all regions by about 7 ppb, and the observed seasonal range (~21 ppb) is underestimated by about 5 ppb except in the most polluted regions where it is overestimated by about 5 ppb. The models generally match the pattern of the observed summertime ozone enhancement, but they overestimate its magnitude in most regions. Most models capture the observed distribution of extreme episode sizes, correctly showing that about 80% of individual extreme events occur in large-scale, multiday episodes of more than 100 grid cells. The models also match the observed linear relationship between episode size and a measure of episode intensity, which shows increases in ozone abundance by up to 6 ppb for larger-sized episodes.


Introduction
We test simulated present-day surface ozone in global chemistry-climate models on temporal scales from diurnal to multi-year variability and on statistics from median geographic patterns to the timing and size of extreme air quality episodes. The tests use gridded hourly surface ozone abundances based on a decade of observations from 4217 air quality monitoring sites in North America (NA) and Europe (EU). Chemistry-climate models provide a valuable means for projecting future air quality in a changing climate (Kirtman et al., 2013), but recent assessments have lacked commensurate observational comparisons to establish their credibility in reproducing current cycles of surface ozone over polluted regions . Model-measurement comparisons to date have identified model faults, yet they have often been limited to monthly statistics, biased to picking clean-air sites over limited parts of the continents Doherty et al., 2013), and avoided evaluating diurnal cycles and the patterns of major pollution episodes (Schnell et al., 2014. The factors driving future surface ozone (O 3 ) changes include (1) local-to-regional emissions, (2) global-scale emissions of air pollution transported across continents and oceans, (3) global emissions and physical climate change that alters the hemispheric-scale abundances of tropospheric O 3 , and (4) climatic shifts in the meteorology that creates the worst pollution episodes. Factors (1), (2), and (3) have been studied extensively with global chemical transport models (CTMs) and chemistry-climate models (CCMs), and there is some agreement on model projections given an emissions scenario (e.g., Prather et al., 2003;Reidmiller et al., 2009;HTAP, 2010;Wild et al, 2012;Doherty et al., 2013;. The importance of (4), however, lies in the recognition that air quality extremes (AQX), the worst pollution episodes in a decade, are triggered by meteorological conditions. Air quality absolute exceedances are known to occur in multi-day, spatially extensive episodes over the USA (Logan, 1989;Seinfeld et al., 1991), but it was not until the regular gridding of all station data over North America and Europe and the statistical definition of extremes in S2014 that the extent, coherence, and decadal variability of the episodes became clear. If climate change increases the duration and/or extent of the worst decadal AQX episodes, then the overall health impact of poor air quality may be worse than expected based on precursor emission changes alone (Fiore et al., 2012). A warming climate appears to increase the number of stagnation days (Horton et al., 2014) and may decrease the frequency of ventilating midlatitude cyclones (e.g., Mickley et al., 2004), but it is unclear how these meteorological indices relate to surface O 3 or particulate matter, especially with respect to the worst AQX episodes as identified in S2014.
The models in the Atmospheric Chemistry and Climate Model Intercomparison Project (ACCMIP; Lamarque et al., 2013) were used in the recent assessment of the Intergovernmental Panel on Climate Change (IPCC; Kirtman et al., 2013) and represent the most advanced attempt to simulate global surface O 3 in a future climate. However, in order to place any confidence in their projections, their ability to simulate the observed, present-day surface O 3 climatology must be evaluated. In this paper we present the first such modelmeasurement comparisons, specifically addressing (4) by applying the methodologies from S2014 to the current generation of CCMs in an effort to quantify their ability to simulate the decadal statistics of the AQX episodes. Due to the complexity and nonlinearity of the underlying processes, accurately simulating surface O 3 over both clean and polluted environments is a formidable task for global models with resolutions of 100 km at best. For example, it has been shown that choices in the parameterization of surface deposition can shift modeled surface O 3 levels by 10 ppb or more (Val Martin et al., 2014). Moreover, there are new, phenologically based land-surface models for interactions between atmospheric chemistry and the biosphere (Büeker et al., 2012) that have yet to be fully implemented in global models. In any case, both recent and future land-use change is expected to impact surface O 3 abundances (Ganzeveld et al., 2010). Thus, we recognize that this model-measurement comparison is just one of the first steps in evaluating global model simulations of surface O 3 pollution. A summary of the observational and model data sets as well as a brief overview of the methods developed in S2014, and used here, is presented in Sect. 2. Model-measurement comparisons are presented in Sect. 3 with concluding remarks and further discussion in Sect. 4.

Observations of surface O 3
We use 10 years (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) of hourly surface O 3 measurements from air quality networks in NA and EU. Following S2014, in NA we use 1633 stations from the US Environmental Protection Agency's (EPA) Air Quality System (AQS) and also increase the spatial coverage in NA by including 92 stations from the US EPA's Clean Air Status and Trends Network (CASTNet) and 207 stations from Environment Canada's National Air Pollution Surveillance Program (NAPS). The data sets used for EU remain the same as S2014: 2123 stations from the European Environment Agency's air quality database (AirBase) and 162 stations from the European Monitoring and Evaluation Programme (EMEP; Hjellbrekke et al., 2013). Table 1 provides a summary of the observational data sets.
A major advance by S2014 was the generation of average surface O 3 abundance in a grid cell from observational products, one that could be directly compared to gridded model output. The station measurements are used to gen-  erate a 1 • × 1 • hourly grid-cell average surface O 3 product over NA and EU using the interpolation scheme described in S2014. The interpolation is similar to an inverse distance-weighted interpolation but additionally incorporates a declustering technique employed to reduce data redundancy, similar to that of Kriging (Wackernagel, 2003). The method also avoids disproportionately representing stations that often are preferentially placed in the most polluted urban environments. S2014 first derived the maximum daily 8 h averages (MDA8) of the individual stations and then interpolated onto the 1 • × 1 • grid, while here we interpolate the hourly measurements and subsequently derive the MDA8 at each grid cell. Differences between the two methods are small (e.g., some missing station data, different 8 h periods for nearby stations), but the new approach allows modeled diurnal cycles to be analyzed. The effects of (i) the new hourly 1 • × 1 • cells being used to calculate MDA8 and (ii) the addition of CASTNet and NAPS stations on the decadal 25th, 50th, and 95th percentiles at each grid cell in NA are shown in Fig. S1 in the Supplement. Overall, the difference (this work minus S2014) is about −0.6 parts per billion (ppb) O 3 for each of the three percentiles. These decreases are most likely a result of deriving MDA8 from the interpolated hourly abundances rather than first deriving each station's MDA8 and then interpolating. Other notable changes are the northeast edge of the domain (−5 ppb) for all three percentiles due to the generally lower O 3 abundances of Canadian NAPS stations, and Wyoming and Colorado at the 25th percentile (+5 ppb) possibly from CASTNet stations, reflecting either cumulative production of O 3 as polluted air reaches them or else more prevalent stratospheric influx.

Description of models (ACCMIP + UCI CTM)
The ACCMIP consists of 16 global models (12 CCMs, two CTMs, and two chemistry general circulation models, CGCMs) and was designed with the intent to better understand the relationships between atmospheric chemistry and climate change . We focus on the acchist experiment, designed to test the models' ability to reproduce the observed climatology of quantities specifically relevant to chemistry modeling . We use the eight ACCMIP models (six CCMs, one CTM, and In any case, all ACCMIP simulations are climatologically representative of the average 2000s with respect to meteorology and emissions. Table 2 provides a brief summary and the references of the models used in this study. Detailed descriptions of the AC-CMIP models can be found in Lamarque et al. (2013) and references therein. We also include a hindcast simulation over the same period as the observations from the University of California Irvine Chemical Transport Model (UCI CTM) performed at T42L60 resolution (Holmes et al., 2013) to both compare our model with the current generation models and to highlight differences between model simulations using free-running and hindcast meteorological conditions. The UCI CTM had many updates since the 1 • × 1 • × L40 version (Tang and Prather, 2010) used by S2014, but calculates similar, not unexpectedly high-biased patterns of surface O 3 .
For commensurate comparison of the models and measurements, we regrid the modeled hourly O 3 abundances (typically at 2 to 3 • resolution) to the same 1 • × 1 • cells as the observations using first-order conservative mapping (i.e., proportion of overlapping grid-cell areas). Modeled hourly abundances are adjusted by 1 h per 15 • longitude to be consistent with the local time of the observations. Our two major domains are NA bounded by 25-49 • N and 125-67 • W and EU bounded by 36-71 • N and 11 • W-34 • E. A further masking drops coastal grid cells for which the quality of prediction index, Q P < 2/3 (the number of independent stations at an effective distance of 100 km used to calculate the grid-cell values), see S2014 and Fig. S2 in the Supplement. Table S1 in the Supplement provides the latitudes and longitudes used in the final masking for both domains. Because of their differing chemical regimes, some of our analyses split the NA domain into western (WNA) and eastern (ENA) regions at 96 • W, and EU into southern (SEU) and northern (NEU) regions at 53 • N.

Air quality extremes
We define AQX events on a daily basis using local (i.e., gridcell) climatologies to identify the 10 times N worst days (i.e., highest MDA8) in an N-year period (i.e., the ∼ 97.3 percentile; e.g., the 100 worst days in a decade). The space-time connectedness of the AQX events into episodes is defined using a hierarchal clustering algorithm described in S2014. Because AQX episodes span across the regions, statistics for these analyses are done only on the two major domains NA and EU. The total size of an AQX episode (S, units = km 2 days) is calculated by integrating the areal extent of an episode (km 2 ) through time (days). For a given set of episodes, the mean size (S) is calculated as a weighted geometric mean, with the weights equal to the AQX episode sizes (Eq. 6 in S2014). Because the lower native resolutions of the models typically map onto four to eight contiguous 1 • × 1 • grid cells, the modeled episode sizes have artificial minima; however, S2014 demonstrated that this has little effect on the resultant episode size distributions.

Diurnal cycles
We test the models' abilities to reproduce the observed shape (i.e., phase and amplitude) of the diurnal cycle, averaged over summer (JJA) and winter (DJF) months. For each of the four regions, average hourly values (local solar time) are calculated as the area-weighted mean of all grid cells' O 3 abundances. We calculate the phase (h, hour of peak O 3 abundance, with h = 0.0 corresponding to 00:00 LT) and peak-topeak amplitude (H , ppb difference from minimum to maximum) of the diurnal cycle using a cosine fit with a period of 24 h. Although the diurnal cycle could be more accurately represented by a higher-order fit, this simple method provides objective and continuous measures of h and H for each data set, avoiding subjective, ambiguous results in cases of flat and/or multiple maxima. Figure 1a-h show the diurnal cycle of the observations and models averaged over JJA (top row) and DJF (second row) in WNA, ENA, SEU, and NEU (columns from left to right). A triangle for each data set is plotted as (x, y) = (h, H ). The large number of data points (∼ 10 6 × 24 h per model) provides a smooth and robust estimate of each data set's diurnal cycle. The color scheme and model abbreviations in the legend of Fig. 1 are common to all similar figures and text throughout. The Taylor diagrams (Taylor, 2001) in Fig. S3ah in the Supplement show an alternate, commonly used summary of the results in terms of the correlation coefficient (R), the normalized standard deviation (NSD), and centered rootmean-square difference. Figures 1 and S3 in the Supplement show very similar quantities (e.g., model-measurement discrepancies in h and H roughly correspond to R and NSD, respectively); however, we consider the representation in Fig. 1 to be more useful. The panels of Fig. S3 in the Supplement correspond to panels in Fig. 1 in terms of region and variable. Summary statistics on diurnal cycles, annual cycles, and AQX events for ENA are presented in Table 3, with all regions and additional statistics provided in Tables S2-S4 in the Supplement.
The shape of the diurnal cycle of O 3 is driven primarily by sunlight, meteorology (e.g., temperature and variations in boundary layer mixing), surface deposition, and the daily cycle of precursor emissions. The hour of the maximum phase h occurs when these factors align, usually in midafternoon. Indeed, for seven of eight region-seasons in Fig. 1a  is no obvious diurnal cycle in observations and the double minimum may simply reflect the titration of O 3 from the morning and afternoon peaks in transport NO x emissions. In this case there is little information from the diurnal cycle except that the amplitude H is small. The ACCMIP models, but not the UCI CTM, mostly show h within ± 1 h, generally later than observed (Tables 3 and S2 in the Supplement). Although the ACCMIP models' diurnal phase closely matches the observed, the peak-to-peak amplitude H is less successfully simulated. For JJA the observed H is 27, 29, 24, and 14 ppb in WNA, ENA, SEU, and NEU, respectively; for DJF, H is 10, 9, 5, and 0.2 ppb. We characterize the three largest H 's as high-photochemical region-seasons (JJA in WNA, ENA, and SEU) and the remaining five as lowphotochemical. In this sense JJA in NEU is closer to DJF in ENA in terms of near-surface O 3 production. The AC-CMIP models generally underestimate H by about 7 ppb in the highest three region-seasons but cluster around H for the lowest five. Model A is the only ACCMIP model to overestimate H in any of the highest three, possibly as a result of its large total VOC (volatile organic compounds, excluding methane) emissions (55 % larger than the average of the other seven models). The 24 h mean bias (MB, see Tables 3 and S2 in the Supplement) for the ACCMIP models is typically positive in all eight region-seasons (up to 28 ppb), although some models (e.g., C and E in JJA, E in DJF) show little or no mean bias, even though they underestimate H in JJA by about 25 % like all ACCMIP models.
The underestimation of the summertime diurnal amplitude H by most ACCMIP models suggests that they either underestimate net daytime production or have too little nighttime loss of O 3 or its precursors through either in situ chemical loss or dry deposition. From the derivative of the diurnal cycles in Fig. 1a-d, there are two periods of modelobservation discrepancy: in the early morning (∼ 06:00 LT) models underestimate the observed slope and in the early evening (∼ 19:00 LT) they overestimate it. The models generally match the observed slope to within ± 1 % h −1 during midday and throughout the night. Thus the model error is to underestimate net O 3 production in the early morning and overestimate it in early evening, which may be caused by the lack of a diurnal emission cycle in these global models. The mismatch of the slope in the early morning, during which the boundary layer grows rapidly, may be caused by the models underestimating entrainment of free troposphere air. We find no clear evidence that modeling errors in the nocturnal planetary boundary layer (Lin et al., 2008) or missing near-surface processes affect the diurnal cycle on a regional average.
Underestimated daytime production could result from limited representation of VOC chemistry, since discrepancies are largest in summer when VOCs play a larger role. Indeed, model A, which simulates the most chemical species of all the ACCMIP models in addition having the largest VOC emissions, is one of the few models to consistently overestimate H . In contrast, C and E are two of the better-performing models despite their comparatively simple representation of VOC chemistry (C -only isoprene, E -none). The only models to include the small and relatively uncertain fractional yield of HNO 3 from the reaction of HO 2 and NO are A, G, and H . This reduces daytime production and could partly explain why the models G and H consistently underestimate H more than others, however model A overestimates H .
The ACCMIP models reproduce the phase of the observed diurnal cycle in both seasons despite not accounting for hourly variation in emissions. The weekly emission-driven cycles in MDA8 O 3 were diagnosed by S2014, but we do not apply that diagnostic here because the models did not include such variability in emissions. The lack of hourly variation of emissions may account for the overall underestimates of H by the ACCMIP models, since NO emissions can be lost heterogeneously at night, less effectively than those during the morning and afternoon peaks in traffic. In addition, if the early morning peak in transport NO x emission was included, the modeled morning rise in O 3 would most likely  Menut et al. (2013) notes that at least for one model, increasing its vertical resolution results in very small surface O 3 changes. The UCI CTM's values of h and H show that it drastically overestimates net daytime O 3 production, especially during early morning hours. Its values of h are about 2.4 h earlier than observed in the highest three region-seasons, and in contrast to the ACCMIP models its H values are too large by 10 s of ppb. This diagnostic identifies a serious problem with the UCI CTM diurnal cycle over polluted regions that needs to be investigated (e.g., missing heterogeneous loss of NO 2 at night, capped boundary layer in the morning) and which will be done after publication of this research. S2014 found that the UCI CTM accurately hindcast the summertime probability distribution of MDA8 O 3 , the occurrence of AQX events, and the size of these episodes, albeit with high bias of about +29 ppb in JJA over both NA and EU. This new diurnal diagnostic has clearly identified model errors and pathways to improve our model as well as models like G, which gravely underpredicts the amplitude of the diurnal cycle. The tests shown here emphasize a large-scale average over different photochemical regimes in the four regions, and thus individual model developers may wish to analyze the observations for smaller regions using the data sets generated here, which are available by request from the corresponding author.

Annual cycle
We test the models' abilities to reproduce the observed phase and amplitude of the annual cycle over the four regions. Average monthly values for each region are calculated as the area-weighted mean of all encompassed cells' MDA8 O 3 abundance, reflecting the EPA air quality metric (www.epa. gov/air/criteria.html). Similar to the diurnal cycle, we derive the phase (m, month of peak O 3 abundance, with m = 0.0 corresponding to 1 January) and peak-to-peak amplitude (M, ppb difference from minimum to maximum) using a cosine fit assuming 12 equally spaced monthly means. Figure 1i-l show the annual cycle of the observations and models over our four regions with triangles plotted for each model and data set as (x, y) = (m, M). The filled gray curve shows ± 1 standard deviation of each monthly mean based on 10 years of observations. This interannual variability is quite narrow, much less than the spread across models. As for the diurnal cycle, the Taylor diagrams in Fig. S3i-l show an alternate presentation of the annual cycle results with summary statistics given in Tables 3 and S3 in the Supplement.
In northern midlatitudes, processes that drive the shape of the annual cycle are similar to those of the diurnal cycle (i.e., sunlight, temperature, and precursor emissions) but occur on continental to hemispheric scales. Dry deposition through stomatal uptake and large-scale meteorological conditions including stratosphere-troposphere exchange and the position of the jet stream (Barnes and Fiore, 2013) also play important roles. These surface observations show the same well-known cycle that has been seen in the northern hemispheric midlatitude troposphere from ozone sondes and clean-air remote sites (Logan, 1989;Fiore et al., 2009): lowest values in late fall (ND), increasing through winter (JFM) followed by a broad flat peak over spring-summer (AMJJA). The lower reactivity region NEU peaks in April and declines until January, indicating meteorologically driven increases through the winter (e.g., stratospheric influx). The observations show a phase m = 5.6, 5.3, 5.5, and 4.3 month of year for WNA, ENA, SEU, and NEU, respectively; and corresponding amplitudes M = 22, 21, 26, and 17 ppb. By fitting a cosine curve to each grid cell's time series, we find that in terms of specific locations, the earliest m occur in Canada, Florida, and NEU while the latest m occur in California, south-central NA, and SEU (not shown). Most ACCMIP models have m within ± 1 month of the observations, generally earlier in NEU, later in ENA and SEU, and split in WNA. Models C and G have difficulty producing the observed seasonal cycles, and their derived phases are not meaningful.
The amplitude M is controlled by both meteorology and photochemistry. For the very large regional values of M, it is clearly chemical, occurring in regions with large O 3 precursor emissions: California with ∼ 40 ppb, the Great Lakes region with ∼ 30 ppb, and northern Italy with ∼ 45 ppb (not shown). The smallest values of M (∼ 15 ppb) are found in northwest and southeast NA and in NEU. The ACCMIP models generally underestimate M by about 5 ppb in WNA, SEU, and NEU, while they overestimate it by about 5 ppb in ENA. The low values of M for C and G suggest they are either overestimating net production of O 3 in winter or underestimating it in summer; however, their wintertime biases (see Fig. 1e-h, Tables 3 and S2 in the Supplement) indicate that wintertime production or representation of wintertime physical climate could be causing the low M values.
The annual cycles here are constructed using the MDA8 O 3 derived from hourly data. Many models, including eight other ACCMIP models not analyzed here, do not report hourly surface O 3 but only monthly means (i.e., the average of all hours within a month). We chose MDA8 values to conform to the US EPA primary air quality standards and statistics, but if we used monthly averages then more models could be evaluated. Unfortunately, without at least daily diagnostics (e.g., daily mean or maximum value) analysis of percentile patterns and AQX events and episodes (see Sects. 3.3-3.7) are precluded. Further, we tested the difference in annual cycles diagnosed both ways and found that the bias of a model can differ and thus these two diagnostics cannot be mixed. For example, the ACCMIP ensemble mean bias for JJA using MDA8 averages is 2, 11, 11, and 8 ppb in WNA, ENA, SEU, and NEU, respectively; however, the corresponding bias using 24 h averages is consistently larger at 6, 14, 13, and 9 ppb. This result was expected since the ACCMIP model ensemble generally has the largest biases outside of MDA8 hours. These conclusions are generally true for all seasons and models, as illustrated in Fig. S4 in the Supplement, which shows the mean bias (model minus observed) of MDA8 minus 24 h average for each model, season, and region.
For the UCI model, excess production in the diurnal cycle is also evident in the annual cycle, overestimating M in all regions, most in ENA (+44 ppb) and least in NEU (+9 ppb). In addition, the month of peak abundance is always later than observed, sometimes by more than 1 month. Not unexpectedly, the bias in M using 24 h averages is significantly less than that using MDA8 (e.g., +30 vs. +44 ppb in ENA) because the largest errors occur near midday. We conclude that using 24 h averages to construct the annual cycle is basically a different, almost independent diagnostic than that constructed from the daily MDA8 O 3 , and further it would predict different health impacts if used to project summertime surface O 3 in a future climate.

AQX events
Next, we test the models' ability to reproduce the annual cycle of the individual AQX events, identified for each grid cell as the 100 days with the highest MDA8 in the decade (40 in 4 years for A, 50 in 5 years for G). Figure 1m-p show the annual cycle of AQX events for the observations and models over our four regions. The filled gray curve shows ± 1 standard deviation for each month based on 10 years of observations. The interannual variability is much larger than that seen in the observed MDA8 cycle with most models falling in its range in SEU and NEU but not in WNA or ENA. An alternate presentation as Taylor diagrams is shown in Fig. S3 in the Supplement, and the summary statistics are given in Tables 3 and S4 in the Supplement. The month of maximum AQX events for most models is within ± 1 month of that observed in each region (m AQX in Tables 3 and S4 in the Supplement). Based on S2014, we expect the annual cycle of AQX events to be highly correlated with that of MDA8, as the observations show correlations R MDA8 (i.e., AQX vs. MDA8) of 0.81 to 0.87 for all regions. For the ACCMIP models this correlation is not as good, but they still show R MDA8 > 0.70 (Tables 3 and S4 in the Supplement). Models whose monthly MDA8 correlates well with observed MDA8 also have monthly AQX events that correlate well with observed. Nevertheless, matching the AQX events annual cycle is more difficult than matching the cycle of MDA8 (Tables 3,  S3, S4, and Fig. S3 in the Supplement) because AQX events are driven by meteorological extremes which are not necessarily represented in these climatological simulations.
The UCI CTM also reproduces the annual AQX events well, and since it is a hindcast, we can extend the analysis to how well it identifies each AQX event on an exact-match basis ("model skill" by S2014). For a climatological model that exactly matches the annual cycle (i.e., matching the number of AQX events in each month) but is synoptically random in each month, a skill score of ∼ 8 % is expected; however, the UCI hindcast correctly identifies 28, 33, 33, and 21 % of AQX individual cell events in WNA, ENA, SEU, and NEU, respectively.

Mapping O 3 percentiles and enhancements
We can define baseline levels of O 3 from observations as the statistically lowest percentiles (National Research Council, 2009). Baseline levels are independent of attribution to specific emissions or policy relevance implied by US EPA's use of the term background. We can expect, or possibly assume, that baseline levels are not influenced by recent, locally emitted or produced pollution (HTAP, 2010). To estimate the daytime enhancement in summertime O 3 , presumably caused by continental emissions, we first want to define a baseline level for each grid cell as a lower percentile of the daily surface O 3 . We seek a percentile that represents the cleanest air possible over the summer season (even if it is never realized during the summer), and one that does not change across years. We use MDA8 rather than 24 h average data to prevent nighttime values from determining the baseline. We calculate percentiles for each cell on an annual basis and then derive regional area-weighted averages of the percentiles. The resulting percentiles by region (Fig. 2) show that the year-to-year variability is small below the 40th percentile, but the largest pollution years are evident at and above the 50th percentile. Thus, we select the 30th percentile as each grid cell's baseline level, which corresponds roughly to the lower levels of spring-fall days. One might argue choosing, for example, the 10th percentile of JJA to estimate summertime enhancement; however, this assumes JJA in all models is the peak of the annual cycle and still sees clean air. We define O 3 enhancement (E X , unit = ppb) here as the difference between the 30th percentile and any larger value, where subscripts will describe the reference value.
To estimate the summertime O 3 enhancement from local to continental-scale pollution, we assume that the 92 days of JJA are the highest O 3 values of the year, pick their median value (87th percentile), and subtract from it the spring-fall baseline (30th percentile). Maps of the summer enhancement E JJA (i.e., 87th minus 30th percentile) in NA and EU in observations and models are shown in Fig. 3  large anthropogenic sources. Two models (C, G) are unusually uniform across NA (except California). Surprisingly, this sorting of the models does not hold for EU. For example, there must be some clue as to why model B greatly overestimates E JJA over NA but underestimates it over EU. Such behavior from model C (uniform E JJA ) may be expected since the tropospheric VOC chemistry is highly simplified. The uniform pattern of E JJA is also somewhat evident in EU for model E, which has even simpler VOC chemistry compared to C, although this may be due to biases in the representation of physical climate rather than chemistry. The E JJA diagnostic provides an excellent geographically resolved test for CCM development. It also provides a useful measure of O 3 regional pollution changes in a future climate with shifting O 3 baselines due to hemispheric-scale changes in methane, water vapor, temperature, and stratospheric influx. Over each of our four regions, we calculate the average summertime enhancement E JJA (see Tables 3 and S3 in the Supplement), expecting to find the values and model-measurement differences similar to those found in the seasonal amplitude M. Indeed, this is true, albeit E JJA is generally smaller than M. In addition, the spatial pattern of the values and model-measurement differences are also consistent between E JJA and M (not shown).

AQX episode size
We examine the models' ability to simulate the observed distribution of AQX episode sizes over the decade 2000-2009. Our hierarchical clustering analysis identifies connected-cell, multi-day AQX episodes of size S (given here in units of 10 4 km 2 days). We do not split the NA and EU domains here because episodes span across regions. Figure 4a-b show the distribution of episode sizes in the observations and each model as the complementary cumulative distribution (CCD, %), i.e., the fraction of total AQX area-day events occurring in episodes of size S or larger.
Atmos. Chem. Phys., 15, 10581-10596, 2015 www.atmos-chem-phys.net/15/10581/2015/ For NA observations, the fraction of AQX area-weighted events that occur in episodes with S > 100 × 10 4 km 2 days (CCD 100 ) is 79 %; and those with S > 1000 × 10 4 km 2 days (CCD 1000 ) is 38 %. For EU observations, most AQX events also occur in large episodes: CCD 100 = 80 % and CCD 1000 = 35 %. Model C is aberrant in having extremely large episodes (e.g., NA CCD 100 = 93 %), which fall mostly in the spring rather than summer months (see Fig. 1m-p). This may result from the model's simplified chemistry or unrealistic widespread stratospheric intrusion of O 3 . In any case, this model's summertime high ozone events are obscured. Model A, with much more complex chemistry, however, shows significantly smaller episodes. For CCD 100 , the other models (B, D-I) are close to the observed: 73-85 % for NA and 71-85 % for EU. For CCD 1000 , however, this model spread diverges substantially: 13-69 % for NA and 15-70 % for EU. In general, models A, B, E, and G do not produce the larger episodes and thus their physical climate may lack the synoptically correlated persistent stagnation episodes. The UCI CTM, using observed meteorology, captures the shape of the observed CCDs extremely well compared to the freerunning climate of the ACCMIP models.
Integrating over all episodes, we calculate the weighted geometric mean size (S) (see S2014). Observations have mean episode sizes (S) of 415 (10 4 km 2 days) and 444 in NA and EU, respectively. Models C, D, F, H, and I are biased high in (S), while models A, B, E, and G are biased low for both NA and EU ( Fig. 4a-b, Tables 3 and S4 in the Supplement).

Non-stationarity and possible trends
One problem with diagnosing decadal AQX size statistics is that they can be biased when more AQX events occur at one end of the decade due to a trend in O 3 precursor emissions. A greater density of events in one summer generally means larger episodes. A linear fit of annually derived O 3 percentiles calculated over years 2000-2008 (2009 was excluded due to lack of NO x and VOC emission data, see below) for each of the four regions (Fig. S5 in the Supplement) shows clearly decreasing surface O 3 abundances at the higher percentiles (see also Fig. 2), presumably through reductions in NO x and VOC emissions (Hudman et al., 2009;Xing et al., 2014). To test if these trends are emission-driven or artifacts of the meteorological time slice, we analyze the UCI CTM results (dashed lines, Fig. S5 in the Supplement), which are forced by observed meteorology but have constant anthropogenic pollution emissions over the time period. We also obtain total NO x and VOC emissions from version 4.2 of the Emission Database for Global Atmospheric Research (EDGAR, EC-JRC/PBL, 2009) for years 2000-2008 (2009 was unavailable at time of publication) and calculate their trends over the period. Over WNA and ENA, meteorology seems to be driving the small positive trends at lower O 3 percentiles (where UCI and observed trends roughly agree), but above the 60th percentile (where UCI and observed trends diverge) emissions reductions are the most likely cause. In SEU and NEU the trends are less conclusive for either meteorology or emission based, but most EU NO x reductions occurred prior to 2000 (Xing et al., 2015). Koumoutsaris and Bey (2012) compare GEOS-Chem hindcasts with NA and EU trends at a limited number of stations from CASTNet and EMEP (∼ 40 in each domain) and find similar trends. They also attribute the negative trends at high percentiles to reduced precursor emissions; however, they attribute the positive trends at low percentiles to changing background O 3 as opposed to changing meteorology posited here.
In an effort to correct the AQX decadal statistics for changes in O 3 precursors, we searched for correlations on a cell-by-cell basis between high-percentile MDA8 O 3 vs. NO x emissions on an annual basis for years 2000-2008. No simple linear relation emerged, and we could find no satisfactory way to "correct" the observations for this regionally varying, monotonic but nonlinear, decline in NO x and VOC emissions that did not corrupt the data. The post-CMIP5 plans for the Chemistry-Climate Model Initiative (CCMI) include hindcast simulations with time-dependent emissions that will allow for the simulation of the observed O 3 nonstationarity.
One option for analyzing extremes in a non-stationary decadal data set is to define AQX events annually on a 10per-year basis. This approach greatly dampens the observed episode mean size and across-year standard deviation from 415 ± 307 (100 per decade) to 249 ± 67 (10 per year) in NA and from 444 ± 720 to 355 ± 48 in EU. Moreover, it gives a false positive impression of the severity of air pollution in extreme years. Thus, we maintain our primary analysis with AQX defined as 100-per-decade. In parallel with Fig. 4a-b, we show the CCDs using a 10-per-year basis for AQX in Fig. S6a-b in the Supplement.

Severity of pollution in largest episodes
As a measure of O 3 produced during AQX events/episodes, we map out the enhancement at the AQX threshold level E AQX (∼ 97.3 percentile) as shown in Fig. S7 in the Supplement (parallel to Fig. 3, also relative to the local 30th percentile). We also calculate the average AQX enhancement E AQX over our regions (Tables 3 and S4 in the Supplement). For ENA, the ACCMIP modeled range of E AQX is 29-52 ppb, spanning the observed of 35 ppb (Table 3). This average result is encouraging for the ACCMIP models except that, as for E JJA (Fig. 3), the pattern match is not as good (Tables 3, S3 and S4 in the Supplement).
Of the 100 AQX events in each cell, many will lie above the local AQX threshold value. We expect that larger, longerduration episodes accumulate more O 3 , and thus these super episodes might have O 3 enhancements (relative to the 30th percentile) well above the AQX threshold enhancement, E AQX . For each AQX event, we calculate an enhancement  Figure 4c-d plot the observed density distribution of all E S , quantized every 2.5 ppb for E S and every decade in 10 4 km 2 days for S. These plots show large variability in the observed E S frequency (gray pixels) and yet a consistent picture of the mean enhancements as a function of S (open circles). For episode sizes of 0.3 (i.e., 0.1 to 0.99), 3, and 30, E S is almost constant (∼ 32 ppb for both NA and EU), but for sizes 300 and 3000 it increases almost linearly per decade. We calculate this slope E S as the average of the 30-to-300 increase (1 decade in S) plus half of the 30-to-3000 increase (2 decades), getting values of 2.9 (NA) and 1.7 (EU) ppb increase per decade of episode size. Similar results are seen for the 10-per-year AQX definition (Fig. S6c-d in the Supplement), with E S of 2.7 (NA) and 3.3 (EU). The slope E S is not simply an expected result from our statistical sorting since in NA we find that compared to the observations, model C has slope that is a factor of about 4 smaller, while A has a slope nearly a factor of 4 larger, and F has a negative slope.
The models generally produce the shape of E S vs. S, although most models (except A and I, see Fig. 4 caption) underestimate the enhancement for all sizes. The obvious discrepancies are for NA episodes, where many models predict that the largest enhancements occur in the smallest episodes (S = 0.3). This anomaly does not occur for EU episodes. These small episodes are uncommon, representing only a small fraction of events (see Sect. 3.5), and we find them mostly along the coasts at the edge of the mask. We understand them to be the effect of very polluted air masses being advected to the neighboring ocean cells which are typically low-O 3 regions with very low 30th percentile baselines, resulting in large enhancements from the highly polluted air. The observations are interpolated and not capable of following a pollution shift offshore. Thus the models are probably correct, but the method of masking and station interpolation makes this discrepancy a systematic feature. The lack of such a feature in EU can be understood by the lack of such sharp coastal gradients. Overall, most models agree with the observations, showing that the super-episodes have the largest O 3 enhancements.

Conclusions and discussion
Confidence in modeled projections of future air quality is based fundamentally on our ability to accurately simulate the present-day observed climatology of surface O 3 and particulate matter over NA and EU where dense, long-term, reliable measurements are available. In this work we evaluate the surface O 3 climatologies from eight global models (six CCMs, one CTM, and one CGCM) that reported hourly surface O 3 as part of the ACCMIP. In addition we test the UCI CTM simulation as an exact hindcast of the 2000-2009 decade of observations used here. Our tests follow the unique approach of S2014 in which over 4000 heterogeneously spaced air quality stations are used to calculate the hourly O 3 averaged over 1 • × 1 • grid cells that can then be compared unambiguously with the modeled grid. Diagnostics include the hourly diurnal cycle, monthly seasonal cycles, and sizes and intensity of air quality extreme episodes. For the most part, the models are biased high during all hours of the day, all months of the year, and in all regions.
Averaged over large regions, the ACCMIP models simulate the shape of the observed summertime diurnal cycle well, with the hour of maximum within ± 1 h of observed (∼ 15:00 LT). The observed peak-to-peak amplitude (25 to 29 ppb over the more polluted regions) is not as well matched and typically underestimated by about 7 ppb. The UCI CTM hindcast, which performed well in the S2014 tests except for a uniform high bias, clearly fails these new diurnal tests and indicates model error in the morning boundary layer chemistry. In general, the ACCMIP models simulate the observed regional annual cycle of monthly mean MDA8 O 3 . They match the month of maximum to within ± 1 month of observed (mid-June), although two models are in error with almost no annual cycle and no clear maximum. The other models overestimate the peak-to-peak amplitude of the observed cycle by about 5 ppb (20 %) in the most polluted region (eastern North America) while underestimating it by about 5 ppb in the other three regions. Model skill in matching the annual cycle of AQX events is fair but not good. This annual cycle has much larger interannual variability than that of MDA8 O 3 , and many models shift the month of maximum AQX events to later in the summer than is observed.
Measures of the enhancement in surface O 3 driven by pollution are derived from the statistics of the decade of daily gridded MDA8 values. For our measure of summertime enhancement (87th minus 30th percentile), the models generally replicate the observed spatial structures but overestimate the magnitude in the most polluted regions. Two models are surprisingly uniform across both continents and fail to highlight areas with the largest emissions of O 3 precursors. Typically, modeled high biases appear in the upper percentiles, not the 30th percentile, which appears to be a good measure of the baseline O 3 across the decade.
About 80 % of the AQX events in NA and EU occur in large, connected, multi-day episodes consisting of 100 grid cells or more. This result is closely matched by all but two models, with C producing much larger episodes and A much smaller ones. It remains unclear whether such errors result from chemical or physical processes. The observations show that super-sized episodes of 100 cells or more have successively greater O 3 levels as they become bigger, with the 100times-larger episodes having 4-6 ppb greater O 3 . Most, but not all of the models match this increase. It is likely that larger, longer-lasting episodes allow for greater accumulation of O 3 from neighboring pollution sources.

What are the best air quality diagnostics for model development?
For testing and identifying the model strengths and weaknesses and improving simulations of air quality, modelers save a large number of diagnostics during the model development process. This typical model development process is far less limiting than the experiment analyzed here, which is based on the voluntary contributions of many models and many terabytes of diagnostics imposed in the ACCMIP. We would still recommend saving the diagnostic of hourly surface O 3 over a decade or more of simulation from which all of the primary diagnostics here can be readily derived and compared with the observations. To segue from the surface O 3 over NA and EU to the sondes and remote sites, a monthly averaged 3-D O 3 would probably suffice. Hourly data observed at coastal or mountain sites likely include a diurnal meteorology that is not represented in the global models, even at a resolution of 0.5 • × 0.5 • . Furthermore, the 24 h and MDA8 averages show different biases and should not be treated as the same diagnostic. There may be inventive ways to avoid the massive hourly data sets by storing the diurnal cycle as a monthly mean and calculating MDA8 inline or just storing the maximum daily O 3 value, which would then require similar analysis of the observations.
The open questions are what model simulations are practical and which would be most useful to identify model errors. The ACCMIP simulations forced by a decade of 2000s climate-model sea surface temperatures are useful in comparing decadal statistics, but the UCI CTM hindcast provides unique tests on the ability to simulate specific events and years. Even if the observed sea surface temperatures were used, the synoptic extreme events would not likely coincide with the observed, so a hindcast meteorology based on reanalysis for forecast fields provides an important test of the model.
The surface O 3 data here are based on an interpolation algorithm that was optimized for the 50-100 km scale averages. Thus, the supplied grid-cell averaged data could be regenerated at 0.5 • resolution, but if one wants 10 km cell averages for regional models then the parameters in the current algorithm would need to be revised and re-optimized. The surface O 3 data set will be expanded to include more than 2 decades  and thus longer simulations would be desirable to investigate interannual variability.

What are the most important tests for these
chemistry-climate models, assuming that hindcasts and detailed emission data are not being used?
Another major question is what emissions to use. With AC-CMIP the choice of a single year of representative emissions for the decade was the optimal choice. The downward trending emissions in NA and EU over the 2000-2009 decade, however, created a non-stationary data set. Going to a longer data set, 1993-2015, will make the comparison between models and measurements more awkward. Model developers will need to take some account of this non-stationarity, possibly as a sensitivity study using two different emissions sets representative of the early and late periods of observations, when not tracking emission changes each year. An emissions problem not resolved here is whether the modeled diurnal cycle over heavily polluted regions in summer would be affected by imposing a more accurate diurnal and weekly cycle in emissions. This is probably beyond what can be imposed in a model intercomparison project such as the ACCMIP but should be part of the individual model development as a sensitivity assessment.
The four-region decadal average statistics here provide a fairly broad view of the models' ability to predict the buildup of O 3 and extreme events in polluted regions. Clear examples of model error are identified. The general agreement of the diurnal cycle between models and measurements still needs to be tested with diurnal emissions. Going beyond the mean regional cycles, the ability to test models at the grid-cell level provides clear geographic coverage, identifying patterns of the discrepancy that are sometimes disturbing, as shown in Fig. 3, but not developed further in this paper. The next study of the CMIP-generated surface O 3 needs to evaluate this.

What tests provide the best confidence in model prediction of future air quality?
Accurate projections of future air quality rely on our ability to predict the changes in both baseline level and pollution buildup in response to both specified future climatic conditions and a change in local-to-global emissions. Both the baseline and the amount of O 3 produced from pollution are likely to change and need to be assessed separately. For that purpose, we find that the maps of summertime (87th percentile) and baseline (30th percentile) and their difference are one of the more important tests of a model's simulation of the present day. The annual cycle of monthly means is also in some way a measure of the summertime enhancement but not as useful as the percentiles. One key measure of future change would be in the size and intensity of extreme episodes. The intensity needs to be assessed relative to the baseline, but the size of the episodes clearly relates to their intensity and would be independent of shifts in baseline. Thus the AQX statistics based on the daily MDA8 values here are an important model test.
The Supplement related to this article is available online at doi:10.5194/acp-15-10581-2015-supplement.