Skill in forecasting extreme ozone pollution episodes with a global atmospheric chemistry model

. From the ensemble of stations that monitor surface air quality over the United States and Europe, we identify extreme ozone pollution events and ﬁnd that they occur predominantly in clustered, multiday episodes with spatial extents of more than 1000 km. Such scales are amenable to forecasting with current global atmospheric chemistry models. We develop an objective mapping algorithm that uses the heterogeneous observations of the individual surface sites to calculate surface ozone averaged over 1 ◦ by 1 ◦ grid cells, matching the resolution of a global model. Air quality extreme (AQX) events are identiﬁed locally as statistical extremes of the ozone climatology and not as air quality exceedances. With the University of California, Irvine chemistry-transport model (UCI CTM) we ﬁnd there is skill in hindcasting these extreme episodes, and thus identify a new diagnostic using global chemistry–climate models (CCMs) to identify changes in the characteristics of extreme pollution episodes in a warming climate.

With climate change, several factors may affect local pollution: changing meteorological conditions, shifting background atmospheric composition, and chemistry-climate interactions that control the efficacy or residence time of pollutants. All of these factors may alter the efficiency of local emissions in generating pollution events (Weaver et al., 2009) and need systematic evaluation. Thus, global chemistry-climate models (CCMs) are a necessary component in projecting future air quality on a continental scale Kirtman et al., 2013). Here, we provide an approach that can evaluate CCMs in terms of their ability to match this new observed climatology of ozone pollution, one that specifically examines how climate change might alter the meteorological conditions that create the multiday, large-scale extreme ozone episodes found in the US and Europe (EU) today (e.g., Barnes and Fiore, 2013).
Even at their best typical resolution (∼ 1 • ≈ 100 km), current global chemistry models are known to have high biases in their production of global tropospheric ozone from pollution (Wild and Prather, 2006). This high bias in production extends to surface ozone on a continental scale (e.g., Nolte et al., 2008;Appel et al., 2012;Lamarque et al., 2012;Rasmussen et al., 2012), although in one case the bias is negligible (Mao et al., 2013). These chemistry-transport models (CTMs) or CCMs also have serious limitations in modeling peak ozone levels (Dawson et al., 2008). The use of such global models for air quality projections is seen as being limited until such errors are accurately diagnosed and corrected Murazaki and Hess, 2006;Reidmiller et al., 2009). There is a need for observation-based tests of the ability of atmospheric chemistry models to simulate pollution episodes over the time-and space scales possible in a global model. In this study, we develop such diagnostics, specifically a grid-average climatology of daily surface ozone concentrations, with a focus on CTMs that should be able to simulate past events (hindcasts) using a meteorology representative of the time of the observations (e.g., ERA-Interim or GEOS MERRA). The goal is to characterize statistical errors and systematic biases in the hindcast and to provide clear metrics that can document improvements in the model.
Observations of surface O 3 from monitoring stations provide the basis for testing models, but measurements at individual stations are generally not representative of model grid cells (Valari and Menut, 2008;Dennis et al., 2010). This problem is referred to as "incommensurability" or "change of support" (Gelfand et al., 2001;Swall and Foley, 2009) and prevents ready quantitative assessment of model errors. If station observations are used to generate an observed ozone product that is directly comparable to what a model predicts, viz. the average O 3 concentration in a grid cell, then geographic patterns and statistics of the pollution episodes can be readily and commensurably tested. In Sect. 2, we present our new algorithm for mapping the individual station data onto cell averages on a regular grid. As part of this analysis we generate an objective measure, the quality of prediction (Q P ), for the mapping of each cell (i.e., how many independent points were used and how far away they are). This gridcell product has the added advantage of allowing direct and commensurate comparison of independent sets of overlapping but not collocated observing sites, and we examine the biases between the two European ozone networks (European Monitoring and Evaluation Programme (EMEP) and Air-Base) for both clean and polluted periods. This assessment uses a full decade of observations (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) from three networks (Environmental Protection Agency (EPA) over the US).
In Sect. 3, we compare the maximum daily 8 h average (MDA8) grid-average observations over the US and Europe with the University of California, Irvine chemistry-transport model (UCI CTM)-simulated values for years 2005-2006. The model errors are diagnosed in terms of location, time of year, and pollution level by comparing different percentiles at each grid cell while maintaining exact-day matches (concurrent sampling) over the 2 years. Simple comparison of highand low-end statistics of the ozone distribution is found to be misleading. In Sect. 4 we define extreme pollution events for each grid cell in a climatological sense, as the 100 worst days (i.e., highest MDA8 concentrations) in a decade (∼ 97.3 percentile) or the 20 worst days in 2 years when comparing the observations to the UCI CTM. We then identify the struc-ture of the multiday, continental-scale pollution episodes that make up most of these events. The CTM's ability to match these extreme episodes is shown to have considerable skill, which degrades as the quality of prediction of the cell decreases and as random noise is added to the observations. In Sect. 5, we develop statistics of the extreme events from a decade of observations that can be used without hindcasting to compare with free-running chemistry-climate models. Using clustering algorithms, we define the size in space and time of the episodes and the fraction of all events that occur within large clusters. In Sect. 6 we conclude and discuss how to use the current climate archive Coupled Model Intercomparison Project Phase 5/Atmospheric Chemistry and Climate Model Intercomparison Project (CMIP5/ACCMIP), or to design the next-generation chemistry-climate simulations, to assess climate-driven changes in extreme ozone pollution episodes.

Observations of surface O 3 over the US and EU
For our observations of surface O 3 we use 10 years (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) of hourly surface O 3 measurements from air quality networks in the United States and Europe (see Table 1 for summary of data sets). For the US we primarily use the EPA's Air Quality System (AQS). The EPA's Clean Air Status and Trends Network (CASTNET) is used for independent evaluation as described in Sect. 2.3. For EU we combine EMEP (Hjellbrekke et al., 2013) and the European Environment Agency's AirBase network except in Sect. 2.4, where we compare these two independent but overlapping data sets. The AirBase data set includes information on the zoning type of the stations (e.g., rural, suburban, urban, traffic), and we choose to use all but the traffic stations for the most complete and representative data, a decision corroborated by Pirovano et al. (2012). The hourly measurements from EMEP and Air-Base are reported as µg m −3 and are converted to parts per billion (ppb = 10 −9 mol mol −1 = nmol mol −1 ) using a temperature of 20 • C; essentially mass concentrations are multiplied by 0.5 ppb ug −1 m 3 .
From these data sets we calculate the maximum daily 8 h average O 3 concentration (MDA8), which is the primary air quality standard for the US (www.epa.gov/air/criteria.html) and is commonly used in human and agricultural health studies (Chan and Wu, 2005;Bell et al., 2006) and climate studies (e.g., Tagiris et al., 2007). We calculate the MDA8 by beginning the 8 h averaging period at 24:00 LT and calculating 17 8 h averages for each day, picking the maximum of those 17 (i.e., the averaging only considers windows that fully reside within 1 day). Thus the maximum can occur during different 8 h intervals at adjacent sites or on consecutive days at the same station, although afternoon and early-evening maxima are most common (Bruntz et al., 1974). The location of the stations and their 10-year mean MDA8 surface O 3 concentrations are shown in Fig. 1.  The mask for interpolating the 1 • × 1 • grid cells is also shown, with light gray indicating cells with Q P < 0.67 used here (see text).

Choosing a method for interpolating grid-cell averages
We develop an interpolation scheme that provides grid-cellaverage values of surface O 3 over the US and EU domains, essential to compare observations to a gridded model. Our goal is to use all representative station data, recognizing the heterogeneity of surface O 3 that must be averaged over to compare with gridded model simulations. The most commonly used technique used to compare observations with a gridded model is to simply average all observing sites within the grid cells to be compared (e.g., Fiore et al., 2002). This results in an incomplete domain as well as the calculated averages disproportionately representing urban stations, especially in areas where exceedances are likely to occur. Fiore et al. (2003) accounts for the clustering of urban stations by first averaging the station observations on a finer grid (0.5 • × 0.5 • ) and then averages those cells to match the coarser model grid. In any case, Diem (2003) notes that almost all ozone-mapping methods have major problems and that this is neither a simple nor a solved task. The task here is very different from that of interpolating spatial extremes to infer regions of O 3 exceedance (e.g., Cooley et al., 2007;Padoan et al., 2010).
Inverse distance weighting (IDW) and ordinary Kriging are the most common interpolation techniques, with generally small or modest differences found between the two (Rojas-Avellaneda and Silvan-Cardenas, 2006). Both produce estimates at unmeasured points using a weighted linear combination of the values at neighboring sites, determined by some function of the separation between the unmeasured point and observation sites. The difference is that the weights in Kriging are formulated to minimize the variance in the estimated values (error) using a predefined model of the spatial covariance of the data, while the weights in IDW are determined without specific need for the covariance function.
Kriging is often favored as it provides prediction error estimates and incorporates a declustering mechanism designed to account for data redundancy, effectively treating highly clustered data more like a single site (Wackernagel, 2003). Since many observation sites in the US and EU data sets are located in close proximity to one another, some form of declustering is desired in our interpolation. Isaaks and Srivastava (1989) note that, when the effect of data clustering is accounted for in IDW, the advantages of using Kriging are slight. In addition, the covariance function required for Kriging can easily be modeled incorrectly, especially at short separation distances (Diem, 2003), when many sites are close in geographic space but their reported values differ by a large amount, as in the case of air pollution. Many of the geographically clustered sites in our data sets are located in urban areas associated with high variability, so the covariance function could easily be incorrectly modeled at short separation distances. Consequently, the Kriging weights given to these clustered stations would not necessarily provide the desired declustering. For this reason, we use a modified from of IDW that incorporates a declustering scheme without the need to model the underlying covariance function.
From O 3 observations Z k at sites x k , we interpolate the O 3 mole fraction at an unobserved location x as a weighted sum of the observations where K is the number of observations sites and weights w k are defined as follows. In standard inverse-distance weighting w k = |x −x k | −β , with β typically in the range 1 ≤ β ≤ 4. We optimize β as described below after adjusting the weights for distant and clustered observations. Weights are set to zero when |x − x k | exceeds a threshold L to avoid meaninglessly small contributions from distant sites. We choose L = 500 km based on the typical scale of synoptic meteorology that influences surface O 3 and test other choices below. We also reduce the weights of clustered stations, which tend to lie in urban areas, to avoid excessive influence of the cluster on surrounding rural regions and to avoid the shielding effect whereby an observation site screens all those that are located immediately behind it (Falke, 1999). The weight of each observation site is reduced by a factor M k that is the number of other observation sites located within a distance D of site k. We choose D = 25 km as a typical size scale for urban areas and test other choices below. Furthermore, all observation sites within the region |x − x k | < D are given equal weight to avoid singularities in the interpolation. Taken together, the weights in Eq. (1) are If the sum of the weights for point x from sites k is zero, a null value is given to that point. Our interpolation algorithm calculates values at points for a single day using only measurements from that day. Implementation of spatiotemporal interpolation is complex, with no specific implementation well agreed upon for applications to air quality data  (Huang and Hsu, 2004). Falke (1999) incorporates a temporal component by reducing the weights of highly variable (mostly urban) sites using the variance of the sites. We do not include this since we assume urban sites are representative of the true processes controlling surface O 3 . In addition, the weights of these sites are often already significantly reduced by the declustering scheme. We optimize the interpolation parameters using the leavek-out cross-validation scheme (Cressie, 1993). This involves removing k = 10 % of observation sites and predicting their values using the remaining observations and IDW interpolation defined above, recording the root mean square error (RMSE) of the predicted sites. This is done for 365 randomly selected sample days from 2000 to 2009 with different randomly selected k sites for each day. The primary optimization is for β, keeping D = 25 km and L = 500 km fixed. All tested β values use the same days and prediction sites. Where there are many nearby sites, the RMSE is at a minimum of about 6 ppb (see Fig. 2 and discussion of quality of prediction below) and does not change much for the range of 2.5 < β < 3.5. The use of large β values can lead to sharp gradients near sites, and, since we seek an average concentration over a grid cell, we select the lower value of the shallow minimum, β = 2.5. Subsequently we look at the error for a range of D and L values, and find it relatively insensitive (< 10 % change from the mean) over reasonable values (D = 10 km, 25 km, 50 km; L = 250 km, 500 km) and β = 2.5 (see Table S1 in the Supplement). Thus we retain our original estimates for D and L.
To obtain grid-cell-average values, we use the IDW procedure above to determine the ozone values at 25 equally spaced points in latitude and longitude within each cell and then use trapezoidal integration over the area, similar to block Kriging (Cressie, 1993). The 4-corner points are each shared with 4 grid cells, and the 12-edge points shared with 2 cells. The trapezoidal integration weights account for latitudinal variation of the points. Thus the weight w * i of each point x i for i = 1 : 25 in the grid cell X is where θ i is the latitude and T i is the trapezoidal integration weight, which takes values of 0.25 for corner points, 0.5 for edge points, and 1.0 for the interior points. The calculation of the average ozone value at the grid cell X, ((Z)(X)), is then the weighted sum of ozone at points x i , Z i : We do not report (Z)(X) for grid boxes where over half of the interior points Z(x i ) are zero.

Quality of prediction and the interpolation mask
The interpolation procedure should be limited to the region being modeled and where a reliable prediction can be made. We begin with a desired mask of 1 • × 1 • cells and then check if the interpolation is adequate. For the US, we use the landmass of the contiguous states (CONUS) and include ocean cells adjacent to CONUS. For EU we draw a similar mask but also include areas in the North Sea and in the Mediterranean Sea west of Italy. We then calculate a measure of the quality of prediction, Q P , for the points within this desired mask to determine the final grid masks for the US and EU. We define Q P as the effective number of independent stations at a distance of 100 km that went into the interpolation.
Thus, for β = 2.5, one station at 50 km or less distance counts as 5.7 stations, and one at 200 km counts as 0.18 stations. Grid-cell-average Q P values are calculated in the same manner as the average O 3 in Eq. (4). The observing sites do not always provide continuous daily data for the decade 2000-2009, and thus the numbers of sites that go into the daily interpolation of each grid cell may vary. In order to keep the masking consistent over the period, it is based on the location of all observing sites, effectively the largest possible Q P values over the time period. The declustering weighting for each site, M k , is recomputed on a daily basis. The Q P values reflect the ability of the observing network to predict O 3 ; the highest (lowest) Q P values have the smallest (largest) RMSE (Fig. 2). Using this relationship and with the intent of providing as nearly contiguous a grid for EU and the US as possible, we select the value of Q P = 0.67 as the cutoff for our masks. Figure 1 shows the constructed masks (gray boxes) for the EPA (Fig. 1a) and combined EU ( Fig. 1b). When comparing the EU observations with the UCI CTM, we truncate the mask northward of 65 • N. Note that the mask over the US excludes parts of Montana that are too distant from sites. Figure S1 in the Supplement shows the logarithm of Q P values for all of the retained grid cells for the US and EU. The lowest Q P values for our US mask (apart from the coasts) are found from west-central Texas and north, due to the low density of observing sites in this area. The lowest values in EU are found in the northernmost and easternmost edges of the domain for the same reason.

Interpolation error
The error of our interpolation method can be objectively measured for the individual sites as described in Sect. 2.1. The average RMSE for the sites can be plotted as a function of our estimate of the quality of the interpolation (Q P ) as shown in Fig. 2. For large values of Q P the RMSE levels off at about 6 ppb. This is a measure of the small-scale, nearest-neighbor variability in ozone that is simply not resolved by our interpolation. Our analysis does show that the RMSE begins to increase when Q P falls below about 30 (effective number of independent sites at a distance of 100 km). Note that the lowest Q P value for the US is about 3, because the sites tend to be located near one another. Thus Q P is a measure of error in interpolation.
Deriving an error for the interpolated grid-cell-average values is more difficult since we have no objective measure of the cell-averaged ozone values. Clearly the minimum RMSE of 6 ppb for individual sites is an exaggeration of the error when averaging over a 1 • grid cell (∼ 10 4 km 2 ). Using the error analysis done for the sites (removing randomly 10 % of the sites), we can examine how the cell-average values change relative to standard result using the full set of sites. The RMSE for this case is also plotted in Fig. 2. It provides a measure of the error in the cell-average ozone, but is at best a lower limit. The RMSE remains small, at about 1 ppb or less, for Q P = 0.7 to 100 and increases to 2 ppb for Q P = 0.33. It is encouraging that relative error estimates can be made and that our cutoff of Q P = 0.67 is a good choice. Note that this approach does not inform us about extrapolation error arising from, e.g., gradients near the coasts. Results for both the US and EU are similar, and the range of Q P is much larger than in the site-error analysis because we are trying to interpolate cells that are distant from sites.
With the daily MDA8 O 3 values interpolated, we can begin to analyze the results for each domain. direction. This is even more pronounced for cell C on the edge of the domain. Figure 4 shows the gridded, masked MDA8 ozone concentrations for both the US (Fig. 4a, c, e) and combined EU ( Fig. 4b, d, f) data sets for two representative percentiles, the 95th ( Fig. 4a-b) and 25th , and their differences (95th minus 25th, Fig. 4e-f). The percentiles here are calculated with respect to years 2005 and 2006, since these are to be compared with the CTM hindcast. The highest 95th-percentile values (∼ 70 ppb) occur in California and then in a broad swath from Texas to New England. For EU they lie mostly around the Mediterranean. The lowest 95th percentiles occur in the northern latitudes for both the US and EU. The 25th percentile represents clean air, typically in winter, and here the largest concentrations (∼ 40 ppb) in the US occur over the Rocky Mountains and the plains to the east, while for the EU ozone concentrations greater than 30 ppb are found only at the southern extent of the mask. Note that Greece and southern Italy stand out as maximal in both percentiles. The difference, 95th minus 25th percentile, is a measure of the pollution buildup, and it tends to follow the regions of largest emissions. California, the Midwest, and the eastern seaboard have the greatest differences in the US (> 40 ppb), while in EU the greatest differences are concentrated in central countries (e.g., France, Germany, northern Italy).

Comparison of overlapping observational O 3 networks
The grid-cell-average O 3 MDA8 product developed here provides a ready comparison of the two independent but overlapping networks, for which individual adjacent stations are not available. For the comparison, we calculate Q P values for each data set and apply a mask using a cutoff of 0.33 rather than 0.67 in order to examine a larger area. We define the bias as AirBase minus EMEP and present biases for the 25th, 50th, and 95th percentiles calculated with respect to years 2000. Note that these comparisons are not exact-day matches, and hence each percentile may correspond to a different day. The AirBase data set is mostly biased low over all three percentiles, with greatest differences (below −10 ppb) for the 25th percentile in Alpine regions. In this case the area-weighted mean bias (MB) is background O 3 , so they are placed at more remote, higheraltitude locations; while AirBase is selected to reflect population exposure, so stations are more readily placed in the valleys, where the population is greater). The bias could also reflect interpolation errors at the edge of the EMEP domain, as there are far fewer stations than in AirBase. Differences between AirBase and EMEP are much smaller in the 50th and 95th cases, with MBs of −2.7 ± 1.9 and −1.7 ± 2.2 ppb, respectively. The biases could also be due the cumulative production of O 3 as polluted air disperses since the EMEP sites are located in rural areas while AirBase sites are generally in or near populated areas.
We also present the difference between the interpolation using only AQS data compared to using only CASTNET data in Fig. S2 in the Supplement. We present the bias (= AQS minus CASTNET) for the 25th, 50th, and 95th percentiles calculated using independent sampling with respect to years 2000-2009. For the comparison, we calculate Q P values for each data set and apply a mask using a cutoff of 0.10 rather than 0.67 to examine a larger area. In addition, this value of Q P corresponds to having one station at a distance of 250 km (i.e., the station is representative of a ∼ 5 • × 5 • grid cell). This figure shows that the AQS interpolation is systematically lower than the CASTNET one for almost all locations and percentiles, particularly over California and from the central plains east to New York City. The bias is least for the most polluted times (95th percentile). Similar to the EMEP-AirBase comparison, CASTNET sites are located in rural areas while AQS sites are generally in or near populated areas, and thus we believe this difference is due to the titration of O 3 by NO x emissions and then the cumulative production of O 3 as polluted air disperses.
Overall, these comparisons show excellent agreement across the networks, particularly in the high-O 3 events. Further comparisons of the AirBase and EMEP networks and the AQS and CASTNET networks could use a smaller mask with higher-quality score and focus on exact-day matches (concurrent sampling) as we do with the CTM hindcasts below.

UCI CTM simulation of years 2005-2006
We use the gridded daily O 3 observations described above to evaluate the UCI CTM. This model is a tropospheric CTM driven by meteorology from the European Centre for Medium-Range Weather Forecasts (ECMWF) Integrated Forecast System. The model is configured as described by Prather (2010, 2012a, b). Simulations are 1 • × 1 • resolution with 40 vertical layers, which is amongst the highest resolution for current global chemistry models, and covers 2005-2006, which is the duration of the high-resolution meteorological fields. The lowest model layer is about 80 m thick, and we use that layer-mean value as the surface O 3 concentration. MDA8 values are calculated from hourly simulated mole fractions in the same way as the observations. As noted above, the MDA8 most often occurs during the afternoon, which coincides with periods of a deep convective boundary layer and avoids problems with the poorly modeled nighttime boundary layer (Lin et al., 2008b;Lin and McElroy, 2010). The present model configuration was designed for studies of stratosphere-troposphere exchange, rather than for surface air quality analysis. As a result, emissions are specified monthly, based on the Quantifying the Climate Impact of Global and European Transport Systems (QUANTIFY) inventory (Hoor et al., 2009), and do not account for daily, weekly, or monthly cycles. Because the surface O 3 simulation has not been optimized, the CTM performance described below may be similar to chemistry-climate models that are used for present to future scenarios.

Evaluating the central tendency of O 3 in models
Many global chemistry models, including the UCI CTM, predict surface O 3 concentrations that are higher than observations (Dawson et al., 2008;Nolte et al., 2008;Zanis et al., 2011;Appel et al., 2012;Lamarque et al., 2012;Rasmussen et al., 2012). The CTM grid-cell O 3 averaged over years 2005-2006 is larger than observed everywhere for both US and EU, in both summer and winter (see Fig. 6; Table S2 Fig. 6a-b) is +19 ± 6 ppb (standard deviation across the grid cells) for the US and +18 ± 5 ppb for EU. The high-latitude background air (northern EU, upper Midwest US) has only a small bias (5-15 ppb); but air coming in from the mid-latitude oceans (east and west coast US, southern EU) has a higher bias (20-30 ppb) and extends beyond just polluted regions. The winter domain model correlation coefficient (MCC) derived from the daily time series of MDA8, shown in Fig. 6e-f, shows relatively good model hindcasting with an average MCC of 0.47 ± 0.13 for the US and 0.61 ± 0.10 for EU. MCC is greatest for the most part where Q P is large and lowest in coastal areas. For wintertime, most of the variability is driven synoptically by large-scale gradients in background O 3 .
The summer domain average MB (Fig. 6c-d) is larger than in winter: +30 ± 14 ppb for the US and +29 ± 8 ppb for EU. Here the largest biases are often in polluted regions, like the Los Angeles basin and the Chicago-to-New York corridor, and the easternmost part of the EU domain. This pattern indicates exaggerated photochemical production of O 3 in the model, possibly a consequence of NO x plumes being spread over the 100 km model grid or other nonlinear interactions involving hydrocarbons and NO x (Lin et al., 2008b;Pusede and Cohen, 2012;Rasmussen et al., 2012). Supporting this hypothesis, the model's summertime bias for the US has a similar pattern to our measure of pollution buildup (95th minus 25th percentile, Fig. 4e, the two maps have a correlation coefficient, r = 0.66). For EU, this conclusion is less obvious (Fig. 4f, r = 0.20). In terms of MCC, the verisimilitude of the model hindcast of daily summertime pollution is quite good (Fig. 6g-h) because in this case the variability is driven synoptically by buildup of regional pollution: MCC = 0.60 ± 0.16 for the US and 0.55 ± 0.19 for EU. In addition, the bias for each month of the year at three representative percentiles (84th, 50th, and 16th) can be derived from Table S2 in the Supplement.

Developing objective measures of model biases
While evaluation of the central tendency of a model provides an important test and can be used to identify bias in either hindcasts or climate simulations, it is the distribution of extremes, both high and low, that we want our climate models to simulate accurately. The lows tell us about baseline (cleanair) O 3 , and the highs show the efficiency of O 3 production from the local emissions. Here we examine the distribution of MDA8, combining the daily gridded US and EU values for a season over the 2 years 2005-2006 from both observations and the CTM hindcast. The probability distribution functions (PDFs) for winter (DJF) and summer (JJA) months are shown in Fig. 7. The observations, sorted into percentile bins (0-5 %, 5-10 %, etc.) calculated separately for each grid cell and plotted relative to the median, are shown in red; the CTM values, sorted independently of the observations, are in blue; and the CTM values sorted according the observed percentiles (concurrent sampling) are in green. For concurrent sampling, the CTM values are averaged for exactday matches for each day and location of the observations that fall in that percentile bin. In a perfect model, the green and red curves would match, meaning that the CTM predicts changes relative to the median at the right time and place. The blue curve treats the CTM effectively like a climate simulation and does not try to locate the high-O 3 periods over the correct cells at the correct time. Because the CTM hindcast has errors, the sorting by observed percentiles will always result in a shallower curve, which may not even be monotonic. From Fig. 7 we conclude (correctly) that during summer the CTM has a uniform bias of +30 ppb over the full range about the median (−15 to +20 ppb), but that during winter it has serious errors beyond the median bias of +17 ppb probably related to the baseline tropospheric O 3 . If we had done this as a climatology comparison, we would have completely reversed this diagnosis. We show maps of model bias as calculated using independent and concurrent sampling and their difference at five representative percentiles (5th, 25th, 50th, 75th, and 95th) for the US and EU in Figs. S3 and S4 in the Supplement, respectively. Biases at the 5th percentile calculated using independent sampling are 7 ± 3 ppb (5 ± 2 ppb) less than concurrent sampling for the US (EU); however for increasing percentiles the trend reverses, with biases for independent sampling at the 95th percentile 9 ± 5 ppb (8 ± 4 ppb) greater than concurrent sampling for the US (EU). We conclude that O 3 PDFs simply cannot be used in comparing observations with climate models.

Identifying and characterizing extreme events
To determine if air quality extreme (AQX) events involving high O 3 concentrations are changing with climate, we must be able to characterize those AQX events observed today and demonstrate that global chemistry models can reproduce them. As demonstrated for the UCI CTM above, surface O 3 concentrations in global chemistry models are often biased high, with higher biases often occurring during peak pollution episodes, but there is skill in hindcasting pollution variability. These biases hinder the ability to predict AQX events based strictly on absolute concentrations (Dawson et al., 2008;Nolte et al., 2008;Zanis et al., 2011). We define AQX events based on the local PDF of O 3 concentrations, rather than based on exceeding a concentration threshold. This enables us to identify linked extreme events whose absolute magnitudes evolve over space and time. For example, Fig. 8 shows daily MDA8 O 3 for June 2002 in four grid cells in the Midwest and eastern US (Chicago, IL; Cincinnati, OH; New York, NY; and rural Virginia). The time series are highly correlated across these sites, but the peak magnitudes differ across sites. In Chicago, MDA8 values above 67 ppb exceed the local 97.3 percentile and frequently occur a few days before local maxima in New York and Virginia, due to west-to-east motion of weather systems. If extremes were identified based on an absolute threshold (e.g., 75 ppb), then the peak values in Chicago might not be labeled as extremes, and their connection to extremes in the eastern US might be overlooked.

Defining individual, grid-cell level ozone pollution extremes
We define the threshold value for AQX events as a frequency (return time) based on the local climatology. This is shown in Fig. 8  arbitrary, and such choices can have undesirable results in some cases (e.g., Coles, 2001). While the top 2.7 % of O 3 MDA8 may seem extreme, most of these events occur during the summer, and hence the AQX events are essentially the upper 10 % of summer days. In general, the wider the range for defining an extreme event, the easier it will be for the model to simulate.

Skill of the CTM
We define the skill of the CTM for each grid cell as the percentage of events that match the day of the observed AQX events. With this definition a random model is expected to correctly identify 2.7 % of events. This metric does not take into account the geographic pattern or persistence of AQX, for which we apply clustering algorithms (see Sect. 4.4). Skill here is calculated over all months of both years (2005)(2006), although almost all AQX events occur from May to September. Figure 9 shows the geographic pattern of CTM skill for US and EU domains. For the US it is 24.4 ± 12 % (standard deviation across grid cells) and a min-to-max range of 0 to 65 % for the grid cells (Fig. 9a). The CTM skill was slightly better for EU: 32.2 ± 17 % (Fig. 9b). For the wider AQX threshold of 94.5th percentile, the skill increases as expected and the standard deviation is reduced: 35.6 ± 11 % for the US and 37.5 ± 14 % for EU. While CTM skill at individual grid cells in the US shows no distinct pattern, that in EU shows a strong east-west trend, with significantly higher skill to the west. These patterns of skill are evident for both threshold choices with correlations (R 2 ) between them of 0.86 for the US and 0.87 for EU. The east-west gradient in EU, as well as the lack of pattern in the US, can partly be understood from the relationship between skill and Q P . Low CTM skill is caused by model errors as well as errors in observations and interpo-lation. As shown in Fig. S5 in the Supplement, the CTM skill is largest in grid cells with large Q P and small interpolation errors.

Organized episodes of AQX events
The AQX events often occur as clustered, multiday episodes with spatial extents of more than 1000 km (note that an event is a single identified AQX event and an episode is a grouping of AQX events). Figure 10 shows an example of one of the larger episodes of the 2005-2006 period for EU, 3-8 July 2006. The episode, although not completely shown, is one of the largest observed, with a size of 1500 × 10 4 km 2 -days, and also the largest in the CTM hindcast, at 1700 × 10 4 km 2days (10 4 km 2 is our basic areal unit since our grid resolution is 1 • ). The skill of the CTM on these 6 days was 75.4 %, with both data sets showing the episode's structure and trajectory. These extreme events are connected in space-time and can be reproduced in a hindcast by a global model. These attributes provide an opportunity to develop a climatology of extreme ozone episodes (e.g., areal extent, duration, intensity, seasonal cycles) that can be used as metrics to test global chemistry climate models' (GCCMs) future climate simulations.
The size of the largest AQX episodes (defining an episode as connected events as in Fig. 10) is driven by a combination of meteorology as well as regionally connected emissions and active photochemistry. To objectively identify these episodes we use an agglomerative hierarchal cluster analysis. Ideally, the clustering algorithm will connect AQX events occurring within a large, slow-moving, stagnant, highpressure system over several days. Locations and times of AQX events are provided to the clustering algorithm, which then groups them into clusters that we call AQX episodes. The linkage criteria that define the clusters are flexible, and  we choose AQX events to be clustered if they are within a predefined cutoff in both space and time. We use the Chebyshev (maximum coordinate difference) distance metric and the single (nearest-neighbor) linkage criterion. We prescribe a cutoff value of 1 (i.e., events are not connected at greater than 1 • and 1 day ahead or behind). We recognize two obvious limitations to using this linkage method: (1) we have essentially considered time as another dimension in space (i.e., 1 • = 1 day), and (2) geographic distance between two grid cells varies with latitude and is not accounted for in the clustering. We consider the former to be of no consequence since a time separation cutoff of less than 1 day is not possible using daily MDA8 values to identify AQX events. Also, a larger cutoff value would be unfavorable since events could be statistically linked even if they occurred at the same grid cell and were separated by a full day. We avoid problems associated with latitudinal variations by developing statistical measures that are independent of resolution (see Sect. 5.2).
Since we want to characterize AQX episodes by their size, effectively a measure of their areal extent (km 2 ) and duration (days), the robustness of the clustering algorithm, particularly the linkage across days, needs to be examined. Most episodes showed a progression of area vs. time that resembled a normal distribution. Occasionally episodes resemble Table 2. Domain mean number of air quality extreme events (AQX) defined for the grid-cell interpolated MDA8 O 3 series and the MDA8 O 3 concentration (ppb) corresponding to the 84th, 50th, and 16th percentiles for each month of the year and day of the week for the 2000-2009 observations in the US and EU. The 84th-and 16th-percentile values are given relative to the 50th percentile. Correlation coefficients (R 2 ) are defined with respect to the number of AQX events per month of the year or day of the week. −10.5 −10.8 −11.3 −11.4 −11.5 −11.3 −11.0 0.00 a multi-peaked or bimodal distribution. In our first algorithm these bimodal episodes were counted as a larger, single episode, but human discernment identifies them as two different episodes adjoined by only a small number of events. Our revised algorithm defines a cutoff in order to separate these dangling episodes. For each episode identified with the primary algorithm, we calculate the area of the events for each day and the area of events that are shared with the previous day (i.e., the same grid cell on 2 consecutive days). If the ratio of the shared area divided by the total area of that day is less than 0.10, we truncate the episode at the previous day and start a new episode on the current day. We do not apply this secondary algorithm to the first 2 or last 2 days of an episode, to provide flexibility for formation and dissipation. In addition, this detaching can occur more than once as we follow the evolution of an episode.

Developing climatologies
The grid-cell-average statistics for MDA8 developed here provide a climatology of surface O 3 that can be used to test and evaluate CCMs. This approach holds promise given that one global CTM has skill in hindcasting specific years and events in spite of some large systematic errors in surface O 3 abundance. Here we seek to develop climate records for surface O 3 over the US and EU that can be used to improve both CTMs and CCMs and to develop confidence in CCM projections of changing air quality in a warming climate. First, we develop statistics for the basic cycles of O 3 over a week, a season, and a year, using a decade of observa-tions (Sect. 5.1). These statistics present a useful climatology for testing the means and perhaps standard deviations (see Chang and Hanna (2004) for more examples), but extreme high-and low-probability events are not so useful as a climatology (Sect. 3.2). The characterization of AQX events as large-scale, multiday episodes is investigated with clustering algorithms (Sect. 5.2), and we develop climate statistics on the scale of these episodes as a new data set to evaluate CCMs (Sect. 5.3) and opening a novel test of whether climate change alters theses extreme episodes.

Weekly and annual cycles
The well-known weekly and annual cycles (Bruntz et al., 1974) in MDA8 O 3 concentrations are summarized for our decadal data sets in Table 2, where we combine typical measures (16th, 50th, 86th percentiles in ppb) with AQX frequencies (based on 100 per decade). Higher percentiles are of interest, but then the geographic patterns need to be examined. The table gives an average over the entire domain (US or EU), and the results for each grid cell or region can be derived from the supplementary data, but are not shown here. The day-of-the-week and month-of-the-year statistics include a decade of observations (years 2000-2009). The direct comparison with the CTM, for weekly and annual cycles using only statistic from years 2005-2006, is in the supplementary material (Table S2 in the Supplement) and shows excellent agreement, except for the weekly cycle, an expected result (see below). For Table 2, the annual cycle of the number of AQX events in the US follows a normal distribution with most events identified in June, while in EU the cycle is slightly weighted towards spring months. Similar patterns are seen in the 84th-and 50th-percentile values, while the highest values in the 16th percentile are slightly weighted towards the spring. These MDA8 values corresponding to these percentiles show excellent agreement with the monthly AQX frequencies. For the 2005-2006 case (Table S2 in the Supplement), July dominates in the EU observations due to the 2006 summer having 14 out of 20 of the events, while in the CTM June had the most, with 2006 having slightly less events than the observations at 12 out of 20 events.
The weekly cycle is also evident in both observational data sets. The largest values of AQX events, the 84th percentile, and the 50th percentile, generally occur at the end of the week (Friday, Saturday, Sunday), a phenomenon termed "the weekend effect" with lower values in the beginning of the week Karl, 1978;Tonse et al., 2008;Pierce et al., 2010). For the 16th percentile, the trend is less obvious. The 84th-percentile values show excellent agreement with the day-of-week AQX frequencies. As expected, we did not see significant evidence of a weekly cycle in the CTM, as there is not a parameterization for the day of the week within the model. The mean skill of the CTM was generally higher for months and days that had higher combined numbers of events. Although seemingly trivial, this result provides us with assurance that the CTM is accurately representing the mechanisms responsible for the ozone episodes' formation and not just representing general interannual cycles.
In Table 3, the AQX frequencies for each year clearly show the extraordinary 2003 and 2006 summer heat waves in Europe, as well as a declining number of events throughout the decade (more evident for the US than EU), associated with reductions in criteria pollutants like NO 2 (see www. epa.gov/airtrends/nitrogen.html and www.epa.gov/airtrends/ ozone.html; Hudman et al., 2009). We also show the annual mean summertime (June, July, August) MDA8 concentrations from our interpolated product and the raw station data, both of which show excellent agreement with the annual AQX values.

Size distribution of extreme episodes
We define the size of an episode as the integral of AQX area over time (km 2 -days). The area of a low-latitude grid cell in the US is about 10 4 km 2 , while that in EU northern latitudes is about 0.6 × 10 4 km 2 . From size we can estimate two additional metrics -mean daily areal extent (km 2 ) and duration (days) of the episode. Since we only want the effective duration (i.e., the time frame that includes the majority of the episode), we do not take the total duration from first to last day. Instead, we define the duration of the peak episode to be 2 times the weighted standard deviation of the time indices, where the weight for each time index is the areal extent of the episode on that day. This method reduces the effect of the tails of the episode (early and late days with few events), providing a more robust measure of the duration of extreme pollution. The mean daily areal extent is simply the total size divided by the duration. Finally, we define the mean episode size, (S), over a given time frame (e.g., individual years, full decade) as the weighted geometric mean of AQX episodes: where n is the number of episodes and S i is the size of the episode. Equation (6) was chosen over the simple arithmetic mean to reduce the influence of the numerous small episodes while giving more weight to larger episodes. The majority of AQX events are grouped into large-area, multiday clusters that we define as AQX episodes. The complementary cumulative distribution function (CCDF = 1 minus cumulative distribution function) of the percentage of the total areal extent of all events as a function of episode size is shown in Fig. 11. For years 2005-2006 and gridded US observations, about 74 % of all events occurred in episodes greater than 100 × 10 4 km 2 -days and about 31 % in episodes greater than 1000 × 10 4 km 2 -days; for the CTM, the corresponding fractions are 66% greater than 100 × 10 4 km 2 -days and 37 % greater than 1000 × 10 4 km 2 -days (Fig. 11a). For years [2005][2006] and gridded EU observations the fractions are 84 and 67 %, respectively; for the CTM, the fractions are 73 and 42 %, respectively (Fig. 11b). In EU, the events are clustered into larger-size episodes. Figure 11 also shows that the decadal climatology (years 2000-2009) of episode sizes (green) is quite different from the 2 yr climatology (blue) that overlaps with the CTM hindcast. Thus, interannual variability is an important factor that must be considered, but interannual variability is also an important diagnostic that provides a key test for the CCMs as well as a metric that can help assess the significance of changes between two different decades. This is especially evident when each year's individual CCDF is examined (see Fig. S6 in the Supplement). In addition to climate variability in AQX episodes, there is the problem of stationarity in the observations due primarily to continuing mitigation of emissions. For the US, a clear pattern of decreasing episode sizes for successive years in the decade can be seen, consistent with reductions in precursor emissions. For EU, this pattern is less apparent; however the standout features are the CCDFs for 2003 and 2006, which have much larger episodes than other years. The annual number of AQX events and (S) values support this conclusion, as seen in Table 3.
The sensitivity of these diagnostics to grid resolution needs to be determined as we have differing resolution across CCMs and the climatology is a useful model diagnostic only if it is robust across different model resolutions. We create a 2 • × 2 • data set (typical of CCM resolution) using simple means of the MDA8 concentrations from the 1 • × 1 • observational data set. AQX events and episodes are defined as Table 3. Climatology of O 3 air quality and extreme episodes (AQX) observations over the US andEU (2000-2009). Each grid cell has AQX events defined as the 100 worst days per decade, except for AQX yr , which is normalized to have 10 events per year. The mean AQX size (S) ((S) yr for the 10-events-per-year case) is computed from Eq. (6) after the clustering algorithm that couples nearest neighbors and successive days, with units of 10 4 km squared days (km 2 d), where 10 4 km 2 is about a 1 • × 1 • grid cell. Average summertime (JJA) MDA8 O 3 (ppb) from the grid-interpolated data (grid) is area weighted, but the station average (station) is raw with all stations equally weighted. The mean (µ) and standard deviation (σ ) of the annual values over the decade are given. Correlation coefficients (R 2 ) are defined with respect to the number of AQX events per year. Using the stations' redundancy weightings derived here gives a slightly greater R 2 , but still less than that for the gridded O 3 . before (note that the clustering cutoff distance is essentially 2 • = 1 day). The resulting episode size CCDFs are shown in Fig. 11 (red) and are extremely similar to the 1 • × 1 • case. This is encouraging for CCM comparisons. From our 1 • × 1 • CTM simulation (black) we find too many small episodes, but the correct likelihood for the larger episodes that comprise about 50 % of all AQX events. This test does not use the hindcast, exact-day matching and thus should be a robust climate statistic that can test CCMs in the CMIP5 archive.

Developing climate statistics of AQX episodes
The episode size distributions in Fig. S6 in the Supplement show clear differences across the years; however we need an objective measure of these differences. The Anderson-Darling (AD) test (Anderson and Darling, 1952) compares two CDFs (equivalently CCDFs) and gives a confidence level that they occur from the same underlying and unknown distribution (the AD null hypothesis). The AD test is nonparametric, distribution free, does not require normality, and it is more sensitive to differences in the tails of the distribution than the widely used Kolmogorov-Smirnoff test (Engmann and Cousineau, 2011). We compare the distributions in Fig. 11 for episodes larger than 10 × 10 4 km 2 -days (10 to 16 connected grid cells) since we are mostly interested in the largest episodes and, further, more than 90 % of the events are in episodes of size greater than this. For the US, the CTM hindcast was found to be statistically different (p < 0.05) from the observations, while for EU both distributions are the same (p < 0.05).
By defining AQX events as the 100 worst days per decade, we can quantify interannual variability in the number of events or large episodes per year. If we wish to ascertain whether individual years have differences in their pollution episodes in terms of areal extent or duration, then the events need to be renormalized (i.e., 10 worst days per year). In the 100-per-decade case, those years with more events will more likely to have bigger episodes, with all else being equal. This can easily be seen by the CCDFs in Fig. S6 in the Supplement and the (S) values in Table 3. Even when each year is forced to have the same number of events, the CCDFs for each of the years are not similar (see Fig. S7 in the Supplement). Using these renormalized AQX episode size distributions, we test if we can statistically identify "good" and "bad" years (based on row one of Table 3) by comparing the individual years to one another. The AD test shows that, in EU, year 2006 (a relatively bad year) was statistically different from several years (2000,2001,2002,2004,2005) (Table 3, denoted (S) yr for the 10-per-year case) also varies from year to year and shows a strong agreement with the annual number of AQX events in the 100-per-decade case. This agreement provides strong evidence that the severity of a given year is largely dependent on its meteorology, since all years' values of (S) yr are derived using the same number EU OBS − 10yr @ 1° x 1°E U OBS − 2yr @ 2° x 2°E U OBS − 2yr @ 1° x 1°E U CTM − 2yr @ 1° x 1°F igure 11. Complementary cumulative distribution function of the percentage of the total areal extent of all individual AQX events as a function of AQX episode size (10 4 km 2 -days) they are clustered into for the (a) US and (b) EU. Results are shown for the 2 yr observations at 1 • and 2 • , the CTM at 1 • , and the 10 yr observations at 1 • . Note: only latitudes <65 • N were used for the 10 yr EU OBS. of events. These tests, among others to be further developed, provide us with a measure of the interannual variability of meteorologically driven AQX episodes and thus allow us to test different decades from the ACCMIP climate simulations to detect a shift in such episodes that falls outside the expected variations.

Conclusions
In evaluating a future scenario for air quality, one can identify four major contributing factors: (1) global emissions that alter atmospheric composition and thence baseline levels (lowest percentiles) of near-surface O 3 and particulate matter (PM); (2) global changes in climate that also alter these baselines (e.g., temperature, water vapor, convection, lightning, biogenic emissions); (3) climate-driven changes in the meteorological regimes over polluted regions that lead to AQX episodes; and (4) changes in the efficacy of local emissions to generate pollution within a governance region (e.g., air quality management district, an EU country). While these factors are all part of a coupled system, an integrated model that combines all would be almost impossible to verify. Thus an assessment approach would be to evaluate each of them separately using observations and an ensemble of models (e.g., HTAP, 2010; Kirtman et al., 2013). This paper focuses on factor (3), providing clear measures of bias and skill in global chemistry models run in hindcast mode, and developing climatologies that can be used to test climate models and to detect a climatic shift in AQX episodes.
The approach developed here establishes a reliable method for gridding the air quality station observations so that direct comparison with global atmospheric chemistry models can be made. We then examine climatologies of surface ozone (percentiles, seasonality, probability distributions, AQX episodes) based on the observations and use them to test a chemistry-transport model (UCI CTM) run in hindcast mode, attempting to simulate each day's MDA8 O 3 concentrations for the years [2005][2006]. Surprisingly, we find that the often-used test of the probability distribution of MDA8 O 3 values over a region gives different results when testing a hindcast model than when treating the identical model simulation as climate statistics. Nevertheless, comparing the gridded observations directly with the hindcast MDA8 O 3 values clearly defines model deficiencies in terms of biases, baseline values (lowest percentiles at ocean boundaries), seasonality, and the ability to predict the relative increase in O 3 during high-pollution events. When used to test a chemistry-climate model, more caution is needed.
AQX events are defined here in terms of the return time of such events for each cell (i.e., as in climate extremes) rather than as an absolute O 3 threshold. Such definition clearly identifies large-scale pollution episodes associated with stagnant meteorological regimes. The AQX events (10 worst days per year = 97.3 percentile) contain a disproportionately large fraction of the excess MDA8 O 3 . We test the ability the UCI CTM to hindcast the 1000 km, multiday giant AQX episodes that include most of the individual, cell-based AQX events. Although we have no formal error estimate of the gridding procedure, we feel our quality of prediction (Q P ) provides a similar quantity, as shown with both the observations themselves (Fig. 2) and with the ability of the UCI CTM to hindcast AQX events (Fig. S4 in the Supplement).
We also tested our interpolation algorithm by applying random noise to the raw station data and then recalculating the cell-average values. This analysis, although not shown, revealed the CTM's skill did not significantly degrade until the amplitude reached ±10 ppb.
Our goal of providing observational validation of the air quality simulated by the chemistry-climate models is centered on the size and duration of AQX episodes and their interannual variability. This is a bias-free test as shown with the UCI CTM, and should be able to identify when more bad years occur in a decade under a future climate, independent of global changes in baseline levels of pollutants. Our statistics will be used to test the chemistry-climate models used in the recent IPCC assessment (CMIP5/ACCMIP).
One advantage of the approach here is that it can be readily applied to satellite observations. The regridding allows for somewhat sparse measurements resulting from day-today cloud obscuration to be filled to a regular grid with a measure of the quality of the prediction (Q P ). Our definition of AQX events takes into account natural gradients in aerosol optical depth or tropospheric ozone column.
Uncertainties and unresolved issues remain. Although Q P provides a measure of the cell-averaged data, it still lacks a formal uncertainty estimate. The decade analyzed here (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009) has an apparent trend in O 3 concentrations driven at least in part by reductions in precursor emissions (Turner et al., 2013). For climate statistics, this nonstationary pattern needs to be recognized and if possible corrected for. One could remove a linear trend from the station observations prior to their use in the interpolation or calculate a fit to the O 3 precursor emissions over the decade and adjust the data year by year. In terms of AQX events, one could define them on a year-by-year basis and look at size only; however the absolute interannual variability over a decade remains a very important test of the models.
The Supplement related to this article is available online at doi:10.5194/acp-14-7721-2014-supplement.