Insights into the deterministic skill of air quality ensembles from the analysis of AQMEII data

. Simulations from chemical weather models are subject to uncertainties in the input data (e.g. emission in-ventory, initial and boundary conditions) as well as those in-trinsic to the model (e.g. physical parameterization, chemical mechanism). Multi-model ensembles can improve the forecast skill, provided that certain mathematical conditions are ful(cid:2)lled. In this work, four ensemble methods were applied to two different datasets, and their performance was compared for ozone (O 3 ), nitrogen dioxide (NO 2 ) and particulate matter (PM 10 ). Apart from the unconditional ensemble average, the approach behind the other three methods relies on adding optimum weights to members or constrain-ing the ensemble to those members that meet certain conditions in time or frequency domain. The two different datasets were created for the (cid:2)rst and second phase of the Air Quality Model Evaluation International Initiative (AQMEII). The methods are evaluated against ground level observations col-lected from the EMEP (European Monitoring and Evaluation Programme) and AirBase databases. The goal of the study is to quantify to what extent we can extract predictable signals from an ensemble with superior skill over the single models and the ensemble mean. Veri(cid:2)cation statistics show that the deterministic models simulate better O 3 than NO 2 and PM 10 , linked to different levels of complexity in the represented processes. The unconditional ensemble mean achieves higher skill compared to each station’s best deterministic model at no more than 60 % of the sites, indicating a combination of members with unbalanced skill difference and error dependence for the rest. The promotion of the right amount of accuracy and diversity within the ensemble results in an average additional skill of up to 31 % compared to using the full ensemble in an unconditional way. The skill improvements were higher for O 3 and lower for PM 10 , associated with the extent of potential changes in the joint distribution of accuracy


Introduction
Uncertainties in atmospheric models, such as the chemical weather models, whether due to the input data or the model itself, limit the predictive skill. The incorporation of data assimilation techniques and the continued effort in understanding the physical, chemical and dynamical processes result in better forecasts (Zhang et al., 2012). In addition, ensemble methods provide an extra channel for forecast improvement and uncertainty quantification. The benefits from ensemble averaging arise from filtering out the components of the forecast with uncorrelated errors (Kalnay, 2003).
The European Centre for Medium-Range Weather Forecast (ECMWF) reports an increase in forecast skill of 1 day per decade for meteorological variables, evaluated on the geopotential height anomaly (Simmons, 2011). The air quality modelling and monitoring has a shorter history that does not allow a similar adequate estimation of such trends for the numerous species being modelled. Moreover, the skill changes dramatically from species to species and is strongly connected to the availability of accurate emission data. Results for ozone suggest that medium-range forecasts can be performed with a quality similar to the geopotential height anomaly forecasts (Eskes et al., 2002). Aside from the continuous increase in skill due to the improved scientific understanding, harmonized emission inventories, more accurate and denser observations, as well as ensemble averaging, an extra gain of similar magnitude can be achieved for ensemble-based deterministic modelling using conditional averaging (e.g. Galmarini et al., 2013;Mallet et al., 2009;Solazzo et al., 2013).
Ideally, for continuous and unbiased variables, the multimodel ensemble mean outscores the skill of the deterministic models provided that the members have similar skill and independent errors (Potempski and Galmarini, 2009;Weigel et al., 2010). Practically, the multi-model ensemble mean usually outscores the skill of the deterministic models if the evaluation is performed over multiple observation sites and times. This occurs because over a network of stations, there are some where the essential conditions (e.g. the skill difference between the models is not too large) for the ensemble members are fulfilled, favouring the ensemble mean; for the remaining stations, where the conditions are not fulfilled, local verification identifies the best model, but generally no single model is the best at all sites. Hence, although the skill of the numerical models varies in space (latitude, longitude, altitude) and time (e.g. hour of the day, month, season), the ensemble mean is usually the most accurate spatio-temporal representation.
One of the challenges in multi-model ensemble forecasting is the processing of the deterministic model datasets prior to averaging in order to construct another dataset for which its members ideally constitute an independent and identically distributed (i.i.d.) sample (Kioutsioukis and Galmarini, 2014;Bishop and Abramowitz, 2013). This statistical process favours the ensemble mean at each observation site. Two basic pathways exist to achieve this goal: model weighting or model sub-selecting. There are several methods to assign weights to ensemble members, such as the singular value decomposition (Pagowski et al., 2005), dynamic linear regression Djalalova et al., 2010), Kalman filtering (Delle Monache et al., 2011), Bayesian model averaging (Riccio et al., 2007;Monteiro et al., 2013) and analytical optimization (Potempski and Galmarini, 2009), while model selection usually relies on the quadratic error or its proxies in time (e.g. Solazzo et al., 2013;Kioutsioukis and Galmarini, 2014) or frequency space . Kioutsioukis et al.: Insights into the deterministic skill of air quality ensembles 15631 The majority of those ensemble studies focus on O 3 , and only recently the studies also involve particulate matter (Djalalova et al., 2010;Monteiro et al., 2013).
In this work, we apply and intercompare both approaches (weighting and sub-selecting) using the Air Quality Model Evaluation International Initiative (AQMEII) datasets from phase I and phase II. The ensemble approaches are evaluated against ground level observations from the EMEP (European Monitoring and Evaluation Programme) and Air-Base databases, focusing on the pollutants O 3 , NO 2 and PM 10 that exhibit different levels of forecast skill. The differences between the multi-model ensembles of phase I (hereafter AQMEII-I) and phase II (hereafter AQMEII-II) originate from many sources, related to both the input data and the models: (a) the simulated years are different (2006 vs. 2010); therefore, the meteorological conditions are different. (b) Emission methodologies have changed, (c) boundary conditions are very different, (d) the composition of the ensembles is different, (e) the models in AQMEII-II use online coupling between meteorology and chemistry, and (f) the models may have been updated with new science processes apart from feedback processes. The uncertainties arising from observational errors are not taken into consideration.
In spite of these differences we consider the analysis of the two sets of ensembles revealing. In detail, the objectives of the paper are (a) to interpret the skill of the unconditional multi-model mean within AQMEII-I and AQMEII-II, (b) to calculate the maximum expectations in the skill of alternative ensemble estimators and (c) to evaluate the operational implementation of the approaches using cross-validation. The originality of the study includes (a) the comparison of several ensemble methods on pollutants of different skill using different datasets, (b) the introduction of an approach based on high-dimension spectral optimization, and (c) the introduction of innovative charts for the interpretation of the error of the unconditional ensemble mean with respect to indicators reflecting the skill difference and error dependence of the models as well as the effective number of models. Therefore, we carry out an analysis of the performance of different ensemble techniques rather than a comparison of the results from the two phases of the AQMEII activity.
The paper is structured as follows: Sect. 2 provides a brief description of the ensemble's basic properties through a series of conditions expressed by mathematical equations. In Sect. 3, the experimental setup is described. Results are presented in Sect. 4, where the skill of the deterministic models, the unconditional ensemble mean and the conditional ensemble estimators are analysed and intercompared. Conclusions are drawn in Sect. 5.

Minimization of the ensemble error
The notation conventions used in this section are briefly presented in the following section. Assuming an ensemble composed of M members (i.e. output of modelling systems) denoted as f i , i = 1, 2, . . . , M, the multi-model ensemble The weights (w i ) sum up to one and can be either equal (uniform ensemble) or unequal (non-uniform ensemble). The desired value (measurement) is µ. Assuming a uniform ensemble, the mean squared error (MSE) of the multi-model ensemble mean can be broken down into three components, namely, the average bias (first term), the average error variance (second term) and the average error covariance (third term) of the ensemble members (Ueda and Nakano, 1996): (1) The decomposition provides the reasoning behind ensemble averaging: as we include more ensemble members, the variance factor is monotonically decreasing and the MSE converges towards the covariance factor. Covariance, unlike the other two positive definite terms, can be either positive or negative; its minimization requires an ensemble composed by independent, or even better, negatively correlated members. In addition, bias correction should be a necessary step prior to any ensemble manipulation. More details regarding this decomposition within the air quality ensemble context can be found in Kioutsioukis and Galmarini (2014). In a similar fashion, the squared error of the multi-model ensemble mean can be decomposed into the difference of two positive-definite components, with their expectations characterized as accuracy and diversity (Krogh and Vedelsby, 1995): This decomposition proves that the error of the ensemble mean is guaranteed to be less than or equal to the average quadratic error of the component models. The minimum ensemble error depends on the right trade-off between accuracy (first term on the r.h.s. (right-hand side) of Eq. 2) and diversity (second term on the r.h.s. of Eq. 2). If the evaluation is applied to multiple sites, then Eqs. (1) and (2) should be replaced with their expectations over the stations. An error decomposition approach can also be applied to the spectral components (SC) of the observed and modelled time series. The data can be spectrally decomposed with the Kolmogorov-Zurbenko (kz) filter (Zurbenko, 1986), while the original time series can be obtained with the linear combination of the spectral components. Assuming the pollution data at the frequency domain yield N principal spectral bands, the squared error of the multi-model ensemble mean can be broken down into N 2 components Solazzo and Galmarini, 2016): This decomposition shows that the error of the ensemble mean could be split into the sum of N errors associated with different parts of the spectrum (first term), provided the spectral components are independent (the covariance term is zero). The minimization of the error at each spectral band can be achieved with another approach such as the decompositions presented in Eqs (1) and (2).
The three decompositions presented assume uniform ensembles, i.e. all members receive equal weight. For the case of a non-uniform ensemble, the MSE of the multi-model ensemble mean can be analytically minimized to yield the optimal weights, provided that the participating models are biascorrected (Potempski and Galmarini, 2009): where, w is the vector of optimal weights, K is the error covariance matrix and l is the unitary vector. In its simplest form, the equation assigns one weight for each model at each measurement site; more complicated versions, like multidimensional optimization for many variables (e.g. chemical compounds) at many sites simultaneously, are not discussed here.
Unlike the straightforward calculation of the optimal weights, the sub-selecting schemes make use of a reduceddimensionality ensemble. An estimate of the effective number of models (N EFF ) sufficient to reproduce the variability of the full ensemble is calculated as (Bretherton et al., 1999) where s i is the eigenvalue of the error covariance matrix. Theoretical evidence shows that the fraction of the overall variance expressed by the first N EFF eigenvalues is 86 %, provided that the modelled and observed fields are normally distributed (Bretherton et al., 1999). The highest eigenvalue is denoted as s m . It is apparent from the considerations above that the skill of the unconditional ensemble mean has the potential for certain advantages over the single members, provided some properties are satisfied. As those properties are not systematically met in practice, superior ensemble skill can be achieved through sub-selecting or weighting schemes presented in this section. An intercomparison of the following approaches in ensemble averaging is investigated in this work using observed and simulated air quality time series: -The unconditional ensemble mean (mme) is investigated.
-The conditional (on selected members) ensemble mean in time domain (mme<) is investigated. The optimal trade-off between accuracy and diversity (Eq. 2) is identified across all possible combinations of the available M models (Kioutsioukis and Galmarini, 2014). The number of members in the ensemble combination that give the minimum error will be used as the effective number of models (N EFF ) rather than their estimate based on the independent components of the ensemble (Eq. 5).
-The conditional (on selected members) ensemble mean in frequency domain (kzFO) is investigated. Following Eq.
(3), an ensemble estimator is synthesized from the best member at each spectral band . The original time series are decomposed into four spectral components (see Appendix), namely the intra-diurnal, diurnal, synoptic and longterm components, using the Kolmogorov-Zurbenko filter (Zurbenko, 1986).
-The conditional (on selected members) ensemble mean in frequency domain (kzHO) is investigated. It is an extension of the kzFO, where the spectral components of the ensemble estimator are averaged from N EFF members at each spectral band (rather than the best).
The skill of the models and the examined ensemble averages was scored with the following statistical parameters: (1) normalized mean square error (NMSE), i.e. the mean square error (MSE) divided by OM, where O and M are the mean value of the observation and the model respectively, (2) probability of detection (POD) and false alarm rate (FAR), i.e. the proportion of occurrences (e.g. events exceeding threshold value) that were correctly identified and the proportion of non-occurrences that were incorrectly identified, and (3) Taylor plots (Taylor, 2001), which summarize standard deviation, root mean square error (RMSE) and Pearson productmoment correlation coefficient in a single point on a twodimensional plot. AQMEII phase I: Standard boundary conditions, provided from GEMS project (Global and regional Earth-system Monitoring using Satellite and in situ data). Refer to Schere et al. (2012) for details. * Standard anthropogenic emissions and biogenic emissions derived from meteorology (temperature and solar radiation) and land use distribution implemented in the meteorological driver. Refer to Solazzo et al. (2012a, b) and references therein for details. AQMEII phase II: Standard boundary conditions, 3-D daily chemical boundary conditions were provided by the ECMWF IFS-MOZART model run in the context of the MACC-II project (Monitoring Atmospheric Composition and Climate -Interim Implementation) every 3 h and 1.125 spatial resolution. Refer to Im et al. (2015a, b) for details. Standard emissions, based on the TNO-MACC-II (Netherlands Organization for Applied Scientific Research, Monitoring Atmospheric Composition and Climate -Interim Implementation) framework for Europe. Refer to Im et al. (2015a, b) for details.

Setup: experiments, models and observations
The two AQMEII ensemble datasets have simulated the air quality for Europe (10 • W-39 • E, 30-65 • N) and North America (125-55 • W, 26-51 • N. Despite the common domains, the modelling systems across the two phases have profound differences. The simulation year was 2006 for AQMEII-I and 2010 for AQMEII-II; therefore, the two sets are dissimilar with respect to the input data (emissions, chemical boundary conditions, meteorology). Boundary conditions are obtained from GEMS (Global and Regional Earth-System Monitoring using Satellite and in situ data) in AQMEII-I and MACC (Monitoring Atmospheric Composition and Climate) in AQMEII-II. The air quality models of the second phase are coupled with their meteorological driver (chemistry feedbacks on meteorology), while those of the first phase are not. The participating models are also different. Detailed analysis of the emissions, boundary conditions and meteorology for the modelled year 2006 (AQMEII-I) is presented in Pouliot et al. (2012), Schere et al. (2012) and Vautard et al. (2012). For 2010 (AQMEII-II), similar information is presented in Pouliot et al. (2015),  and Brunner et al. (2015).
The participating models follow a restrictive protocol concerning the emissions and the meteorological and chemical boundary conditions. In AQMEII-I, meteorological models applied nudging to the NCEP GFS (National Centers for Environmental Prediction, Global Forecast System) meteo- Table 2. The statistical distribution of (a) the normalized mean square error (NMSE) of the best model (NMSE BEST ), (b) the ensemble average NMSE (<NMSE>) and (c) the skill difference indicator (NMSE BEST /<NMSE>). In addition, the coefficient of variation (CoV = standard deviation divided by the mean) of the number of cases where each model was identified as best is shown. All indicators were evaluated at each monitoring site for the examined species of the two AQMEII phases. rological analysis. In AQMEII-II, the simulations were run more in a way as if they were real forecasts; meteorological boundary conditions for the majority of the models were from the ECMWF operational archive (see Tables 1 and 2 in Brunner et al., 2015), and no nudging or FDDA (fourdimensional data assimilation) was applied. However, the driving meteorological data were analysis (but no reanalysis) for all simulations, with exception of the COSMO-MUSCAT run. Hence, the runs from AQMEII-II are more like forecasts than those from AQMEII-I. Recent studies with regional air quality models yielded that the full variability of the ensemble can be retained with only an effective number of models (N EFF ) on the order of 5-6 (e.g. Solazzo et al., 2013;Kioutsioukis and Galmarini, 2014;Marécal et al., 2015). The minimum number of ensemble members to sample the uncertainty should be well above N EFF ; for this reason, we focus on the European domain (EU) due to its sufficient number of models for forming the ensemble. Table 1 summarizes the features of the modelling systems analysed in this study with regard to O 3 , NO 2 and PM 10 concentrations in the EU. The modelling contribution to the two phases of AQMEII consists of 12, 13 and 10 models for O 3 , NO 2 and PM 10 respectively in AQMEII-I, while 14 members were available for all species in AQMEII-II. Several discrete simulations of WRF-Chem with alternative chemistry and physics configurations are included in AQMEII-II San José et al., 2015;Baró et al., 2015).
Following the statements of Sect. 2, each model was biascorrected prior to the analysis, i.e. its own mean bias over the examined 3-month period was subtracted from its modelled time series at each monitoring site. For each modelling system, its long-term systematic error is a known quantity estimated during its validation stage; therefore, the subtraction of the seasonal bias does not restrict the generality of the study. Actually, the requirement for bias removal is a necessary condition only for the weighted ensemble mean. In the results section we will address this issue and its effect on the skill of the ensemble estimators.
The observational datasets for O 3 , NO 2 and PM 10 derived from the surface Air Quality monitoring networks operating in the EU constitute the same dataset used in the first and second phases of AQMEII to support model evaluation. All monitoring stations are rural and have data at least 75 % of the time. The network is denser for O 3 (451 and 450 stations in AQMEII-I and II) for which there are as many monitoring stations as for NO 2 (290 and 337 stations in AQMEII-I and II) and PM 10 (126 and 131 stations in AQMEII-I and II) combined, with PM 10 having the fewest observations. Figure 1 compares the statistical distribution of all three species between the two AQMEII phases, through the cumulative density function composed from the mean value at each percentile of the observations. The Kolmogorov-Smirnov test CDF Concentration (ug m3 ) -1 Figure 1. The cumulative density functions of the observations (O 3 , NO 2 , PM 10 ) in the two AQMEII phases (Phase I: filled circles, Phase II: unfilled circles). Each bullet represents the median at the specific percentile. (Massey, 1951) yields that only the PM 10 distributions differ at the 1 % significance level. This results from the unavailability of data for France and the UK in AQMEII-II for PM 10 (station locations are shown in Fig. 3).

Results
In this section we apply the conceptual context briefly presented in Sect. 2 to investigate the effect of the differences in the ensemble properties within each of the two AQMEII phases (Rao et al., 2011) in the skill of the unconditional multi-model mean. The potential for improved estimates through conditional ensemble averages and their robustness is ultimately assessed.
From the station-based hourly time series provided, we analysed one season (3-month period) with continuous data and relatively high concentrations: for O 3 , June-July-August was selected, while September-October-November was used for NO 2 and PM 10 .

Single models
The distributions of each model's NMSE for O 3 , NO 2 and PM 10 over all monitoring stations are presented in Fig. 2 as box-and-whisker plots. On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles respectively. The whiskers extend to the most extreme data points not considered outliers (i.e. points with distance from the 25th and 75th percentiles smaller than 1.5 times the interquartile range). Among the examined pollutants, the models simulate the O 3 concentrations better, as is evident from the axis scale. The highest variability in the skill between and within the models is observed for NO 2 .
The distribution of average NMSE at each station (<NMSE>) has a median on the order of 0.1 for O 3 and 0.5 for NO 2 and PM 10 for both phases ( Table 2). The application of the Kolmogorov-Smirnov test (Massey, 1951) to the <NMSE> distributions across AQMEII-I and AQMEII-II shows that there are no statistically significant differences in the <NMSE> distributions between the two ensemble datasets at the 1 % significance level. The same also applies for the statistical distribution of the minimum NMSE at each station (NMSE BEST ) at each monitoring station. Hence, despite the different modelling systems and input data, the <NMSE> and NMSE BEST distributions between AQMEII-I and AQMEII-II are indistinguishable for the three examined pollutants.
Aside from <NMSE> and NMSE BEST , we evaluate the percentage of cases each model identified as being "best" and calculate the coefficient of variation (CoV = std/mean) of this index for each ensemble. If models were behaving like i.i.d., the probabilities of being best would be roughly equal (∼ 1/M) for all models and the CoV would generally be well below unity for the examined range of ensemble members. As can be inferred from Table 2, the proportion of equally good models is higher for O 3 and NO 2 in the second dataset. Among the pollutants, the CoV of NO 2 exhibits the most dramatic change.

Pitfalls of the unconditional multi-model mean
The skill of the multi-model mean was compared to the skill of the best deterministic model, independently evaluated at each monitoring site (hereafter bestL). The geographical distribution of the ratio RMSE(mme)/RMSE BESTMODEL is presented in Fig. 3. The indicator does not exhibit any longitudinal or latitudinal dependence. Summary statistics indicate that the mme outscores the bestL at roughly half of the stations for O 3 (namely 52 and 49 for AQMEII-I and II) and at approximately 40 % of the stations for PM 10 (38 and 42). The same statistic for NO 2 varies considerably (39 and 64). The Kolmogorov-Smirnov test yields that the corresponding distributions (pI and pII) are different at the 1 % significance level, but the t test demonstrates that the mean of the distributions differ significantly only for NO 2 . The reason behind the skill of mme with respect to the bestL is investigated next with respect to the skill difference and the error dependence of each ensemble.
The skill difference between the best model and the average skill is inferred from the indicator NMSE BEST /<NMSE> (Table 2). High values of the indicator correspond to small skill differences between the ensemble members (desirable). The distribution of the NMSE BEST /<NMSE> at each station has a median on the order of 0.6-0.8, variable with respect to the dataset and the pollutant. The spread of the indicator, measured by its interquartile range, is higher for NO 2 and lower for O 3 .
The eigenvalues of the covariance matrix calculated from the model errors provide information on the member diversity and the ensemble redundancy (Eq. 5). Following the eigenanalysis of the error covariance matrix at each station separately and converting the eigenvalues to cumulative amount of explained variance, the resulting matrix is presented in a box and whisker plot (Fig. 4). The error dependence of the ensemble members is deduced from the explained variation by the maximum eigenvalue s m . Low values of the indicator correspond to independent members with small error dependence (desirable). The average variation explained by s m ranges between 65 and 79 %, taking the lower values for NO 2 . The spread of the indicator, measured by its interquartile range, is higher for NO 2 and lower for O 3 .
All species demonstrate smaller skill difference and higher error dependence in the AQMEII-II dataset. The Kolmogorov-Smirnov test yielded that the difference in the corresponding distributions of the indicators between AQMEII-I and AQMEII-II is significant at the 1 % level. However, it is the joint distribution of skill difference and error dependence that modulates the mme skill with respect to the bestL, as seen in Fig. 5. Shifts in the distributions of the indicators at opposite directions eventually cancel out, yielding no change in the mme skill. This case is observed for O 3 and PM 10 . For NO 2 , skill difference was improved more than error dependence was worsened, yielding a net improvement of mme in AQMEII-II.
The area below the diagonal in Fig. 5 corresponds to monitoring sites with disproportionally low diversity under the current level of accuracy. This area of the chart indicates high spread in skill difference and relatively highly dependent errors. This situation practically means a limited number of skilled models with correlated errors, which in turn denotes a small N EFF value, as demonstrated in Fig. 6. The opposite state is true for the area above the diagonal. It corresponds to locations that are constituted from models with comparable skill and relatively independent errors, reflecting a high N EFF value. This matches the desired synthesis for an ensemble.
The cumulative distribution of N EFF from the error minimization (i.e. the optimal trade-off between accuracy and diversity) across all possible combinations of M models at each site is also presented in Fig. 4 (solid line). At over 90 % of the stations, we do not need more than five members for O 3 , six members for PM 10 and six to seven members for NO 2 . Furthermore, from a pool of 10-14 models, the benefits of ensemble averaging cease after five to seven members (but not five to seven particular members across all stations).

Conditional multi-model mean
Following the identification of the weaknesses in the ensemble design, the potential for corrections through more so-phisticated schemes is now investigated. We consider the skill of the multi-model mean as the starting point, and we investigate pathways for further enhancing it through the non-trivial problem of weighting or sub-selecting. The optimal weights (mmW) are estimated from the analytical formulas presented in Potempski and Galmarini (2009). The sub-selection of members was built upon the optimization of either the accuracy-diversity trade-off (mme<) (Kioutsioukis and Galmarini, 2014) or the spectral representation of first-order components from different models (kzFO) . Another approach built upon higher order (namely, N EFF ) spectral components (kzHO) is also investigated. In this section we mark the boundaries of the possible improvements for different ensemble mean estimators applicable to the AQMEII datasets and their sensitivity to suboptimal conditions using cross-validation.
The global skill of all the single models and the ensemble estimators, evaluated at all stations, is presented in Fig. 7 in the form of Taylor plots. For O 3 , the deterministic models have standard deviations that are smaller compared to observations and a narrow correlation pattern (∼ 0.7) that is slightly deteriorated in AQMEII-II. For NO 2 , members with higher and lower variance than the observed variance exist in the ensemble, while the correlation spread becomes narrower in AQMEII-II and demonstrates a minor improvement. Last, simulated PM 10 from the deterministic models displays smaller standard deviation compared to observations with a wide correlation spread (0.3-0.6). The multi-model mean is always found closer to the reference point, in an area that incorporates lower error and increased correlation but at the same time generally low variance. The examined ensemble estimators (mmW, mme<, kzFO, kzHO) are horizontally shifted from mme; hence, they demonstrate even lower error and increased correlation and variance. Among them, the highest composite skill was found for mmW, followed by kzHO.
A comparison between the skill of the examined ensemble estimators versus the mme and the best single model is now conducted ( Table 3). The best single model is evaluated globally (bestG is the average across all stations) and locally (bestL is at each station separately). The former estimates the best average deterministic skill among the candidate models; the latter provides a useful indicator for controlling whether the anticipated benefits of ensemble averaging holds. The skill scores were evaluated against the guaranteed minimum gain of the ensemble (<MSE>), the ensemble mean (mme) and the best single model globally (bestG). The estimations calculated from the unprecedented AQMEII datasets (2 years of hourly measurements and simulations from two different ensembles of 10-14 models each at over 450 stations for three pollutants) allows the following interpretation: -The mme always achieves a lower error than bestG.
The advancement is higher for O 3 (9-22 %), followed by NO 2 (7-9 %), while the PM 10 demonstrate the least 1 Figure 2. Model skill difference via the NMSE. For each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25th and 75th percentiles respectively. The whiskers extend to the most extreme data points not considered outliers and the outliers (points with distance from the 25th and 75th percentiles larger than 1.5 times the interquartile range) are plotted individually using the "+" symbol.
skill improvement (1-3 %). With respect to bestL, the mme generally attains similar or slightly higher MSE. Hence, the average error over multiple stations statistically favours the ensemble mean over the single models but the comparison at each site generally does not as it depends on the skill difference and the error dependence of the models.
-The skill score of mme over <MSE> (i.e. the guaranteed upper ceiling for the MSE of mme, from Eq. 2) ranges between 15 and 30 %, higher for NO 2 and lower for PM 10 . According to Eq.
(2), this number also represents the diversity as percentage of the accuracy. Therefore, aside from improving the single models, their combination in an ensemble confines the mme skill if their diversity is limited.
I. Kioutsioukis et al.: Insights into the deterministic skill of air quality ensembles -The skill score of the examined ensemble estimators (mmW, mme<, kzFO, kzHO) over <MSE> ranges between 25 and 50 %, higher for O 3 and NO 2 and lower for PM 10 . Among them, the improvement is higher for mmW and lower for mme< and kzFO. Thus, the promotion of accuracy and diversity within the ensemble almost doubles the distance to <MSE> compared to mme and results in an additional skill over the mme between 14 and 31 % (for mmW).
-The improvement of the ensemble estimator using selected N EFF members (mme<) over all members (mme) is illustrated in Fig. 8 in the context of skill difference and error dependence. The charts demonstrate no points below the diagonal, i.e. the sub-selection results in an ensemble constituted from models with comparable skill and relatively independent errors (compared to the full ensemble).
-The theoretical minimum MSE of mme for the case of unbiased and uncorrelated models (from Eq. 1) is far from being achieved from all ensemble estimators. The statistical distributions of the skill scores of the examined ensemble estimators (mmW, mme<, kzFO, kzHO) over mme are well bounded from higher than unity values to lower than unity values (Fig. 9). The only exception exists for roughly 10 % of the stations for all pollutants, where kzFO demonstrates higher MSE compared to mme. Unlike the other ensemble estimators, kzFO utilizes independent spectral components, each obtained from a single model, eliminating the possibility for "cancelling out" of random er-rors. All cases belonging to this 10 % of the samples (lower tail of the cdf) demonstrate high N EFF , where the benefits from unconditional ensemble averaging are optimal (Kioutsioukis and Galmarini, 2014). Conversely, for another 10 % of the stations (upper tail of the cdf), there is an abrupt improvement from the conditional ensemble estimators. Those cases demonstrate low N EFF , where the benefits from unconditional ensemble averaging are minimal.

Explained variation Explained variation
Explained variation Explained variation Figure 5. Interpretation of Fig. 3: the explanation of the mme skill against the best local deterministic model with respect to skill difference (evaluated from MSE BEST /<MSE>) and error dependence (evaluated from the explained variation by the highest eigenvalue).
The ability to simulate extreme values is now examined through the POD and FAR indices. Two thresholds were utilized for each pollutant, being 120 and 180 µg m −3 for O 3 , 25 and 50 µg m −3 for NO 2 , and 50 and 90 µg m −3 for PM 10 . The average 90th percentile across the stations was 129 and 117 µg m −3 (AQMEII-I and II) for O 3 , 30 and 26 µg m −3 for NO 2 and 52 and 33 µg m −3 for PM 10 (Fig. 1). Hence, the thresholds fall into the upper 10 % of the distributions, being even more extreme for PM 10 in AQMEII-II. The numbers in Table 4 give rise to the following inferences: -For O 3 and NO 2 , mme achieves somewhat higher POD than bestG at the lower threshold, but the order is reversed at the higher threshold. For PM 10 , bestG always performs better than mme for values exceeding the lower threshold. As we move towards the tail, the POD of bestG dominates over the mme. Thus, the ranking of mme and bestG at the extreme percentiles and on average (seen earlier) are opposite.
-The mme< generally achieves somewhat higher POD than bestL at the lower threshold, but the order is re-

Explained variation Explained variation
Explained variation Explained variation Figure 6. Like Fig. 5 but showing the N EFF with respect to skill difference and error dependence.
versed at the higher threshold. Over that level, kzFO and mmW are the only estimators with POD higher than bestL.
-As we move towards higher percentiles, the first-order spectral model (kzFO) has higher POD than the higherorder spectral model (kzHO) due to the averaging in the latter. In addition, the frequency domain averaging (kzHO) had slightly higher POD compared to the time domain averaging (mme<).
-The mmW, aside from its lower MSE, has the highest POD among all models and ensemble estimators.
-The variation of FAR was very small between all examined models and ensemble estimators.
The combination of the results from the average error and the extremes identify mmW as the estimator that outscores the others across all percentiles. kzFO has a high capacity for extremes but requires attention for the limited sites with high N EFF , where its skill is inferior to mme. kzHO and mme< have both high skill across all percentiles (better for kzHO), but they could have inferior POD compared to bestL at the extreme percentiles. With respect to the pollutants, the advancement of mmW skill over mme was higher for O 3 . The additional skill over mme in the range between 8 and 31 % from the statistical approaches applied to a pool of ensemble simulations identifies the upper ceiling of the im-provements from the corrections in the skill difference and the error dependence of the ensemble members. The bound results from the removal of the seasonal bias from the time series and the optimal training of the methods. We now proceed with splitting the datasets into training and testing, and we explore the sensitivity of the mmW skill arising from improper bias removal and weights. Both factors are estimated Table 3. The MSE from (a) the best deterministic models globally (bestG) and locally (bestL), (b) the unconditional ensemble mean (mme), and (c) the four conditional ensemble estimators (mme<, kzFO, kzHO, mmW). In addition, the bounds for the MSE of the ensemble mean are also presented. The maximum value (<MSE>) arises for ensemble members without diversity and the minimum value (mmeMIN) was estimated from the variance term only (i.e. calculated for unbiased and uncorrelated ensemble members). The ability of the estimators is evaluated through their skill scores (SS REF (Kioutsioukis and Galmarini, 2014); kzFO: conditional spectral ensemble mean with first-order components ; kzHO: conditional spectral ensemble mean with second-and higher-order components (kzHO); mmW: optimal weighted ensemble (Potempski and Galmarini, 2009).
on the training set for variable time series length that is progressively increasing from 1 to 60 days, for all monitoring stations and pollutants. The evaluation period for all training windows is the same 30-day segment, not available in the training procedure. The analysis will provide a perspective on applying the techniques in a forecasting context, although the examined simulations did not operate in forecasting mode. The interquartile range of the day-to-day difference in the weights is calculated and its range over all stations is displayed in Fig. 10. No convergence occurs; however, the vari-ability of the mmW weights is notably reduced after a certain amount of time. If we set a tolerance level at the second decimal, to be satisfied at all stations, we need at a minimum 20-45 days of hourly time series. The variability of weights is smaller for O 3 and higher for NO 2 and PM 10 , explained by the larger NMSE spread in the latter case. The identification of the necessary training or learning period will be assessed by its effect on the mmW skill. Table 5 presents the mmW skill obtained from training over time series of different lengths varying from 5 to 60 days. For O 3 , mmW trained over 10 days yields similar results with mme, while longer periods result in large departures from mme. NO 2 and PM 10 require larger training periods than O 3 . The use of mmW is practically of no benefit compared to mme if the training period is less than 20 days for NO 2 and 30 days for PM 10 . For all pollutants, the variability of the weights and the bias have no effect on the error after 60 days. The results demonstrate that the ensemble estimators based on the analytical optimization become insensitive to inaccuracies in the bias and weights for training periods exceeding 60 days. However, other published studies with weighted ensembles using non-analytical optimization (e.g. linear regression; Monteiro et al., 2013), argue that 1 month is sufficient for the weights and the bias. The subselecting schemes are more robust compared to the optimal weighting scheme in the variations of their parameters (bias, members). Using data from AQMEII-I, training periods in the order of 1 week were found essential for mme< (Kioutsioukis and Galmarini, 2014) and kzFO . Therefore, the operational implementation of each ensemble approach requires knowledge of its safety margins for the examined pollutants. Figure 9. The cumulative density function of the skill score (1 − MSE X /MSE MME , X = mmW, mme<, kzFO, kzHO) over mme, evaluated at each monitoring site for the examined species of the two AQMEII phases.

Conclusions
In this paper we analyse two independent suites of chemical weather modelling systems regarding their effect on the skill of the ensemble mean (mme). The results are interpreted with respect to the error decomposition of the mme. Four ways to extract more information from an ensemble aside from the mme are ultimately investigated and evaluated. The first approach applies optimal weights to the models of the ensemble (mmW), and the other three methods utilize se-lected members in time (mme<) or frequency (kzFO, kzHO) domain. The study focuses on O 3 , NO 2 and PM 10 , using the unprecedented datasets from two phases of AQMEII over the European domain.
The comparison of the mme skill versus the globally best single model (bestG is identified from the evaluation over all stations) points out that mme achieves lower average (across all stations) error compared to bestG. The enhancement of accuracy is highest for O 3 (up to 22 %) and lowest for PM 10 (below 3 %). We then investigate whether this benefit of en- Table 4. The probability of detection (POD) and false alarm rate (FAR) from (a) the best deterministic models, globally (bestG) and locally (bestL), (b) the unconditional ensemble mean (mme), and (c) the four conditional ensemble estimators (mme<, kzFO, kzHO, mmW (Kioutsioukis and Galmarini, 2014); kzFO: conditional spectral ensemble mean with first-order components ; kzHO: conditional spectral ensemble mean with second-and higher-order components (kzHO); mmW: optimal weighted ensemble (Potempski and Galmarini, 2009).  -The skill score of mme over its guaranteed upper ceiling (case of zero diversity) ranges between 15 and 30 %, being lower for PM 10 . Those percentages also represent the diversity normalized by the accuracy. Therefore, aside from improving the single models, their combination in an ensemble confines the mme skill if their diversity is limited.
-The promotion of the right amount of accuracy and diversity in the conditional ensemble estimators almost doubles the distance to the guaranteed upper ceiling. The skill score over mme is higher for O 3 (in the range of 18-31 %) and lower for NO 2 and PM 10 (in the range of 8-25 %), associated to the extent of potential changes in the joint distribution of accuracy and diversity in the respective ensembles. The improvement is larger for mmW and smaller for mme< and kzFO.
I. Kioutsioukis et al.: Insights into the deterministic skill of air quality ensembles -The theoretical minimum MSE of mme for the case of unbiased and uncorrelated models is far from being achieved from all ensemble estimators.
-As we move towards the tail, the probability of detection (POD) of bestG (bestL) dominates over the mme (mme<). At the extreme percentiles, kzFO and mmW are the only estimators with POD higher than bestL.
-The combination of the results from the average error and the extremes identifies mmW as the estimator that outscores the others across all percentiles. kzFO has a high capacity for extremes but requires attention for the limited sites with high N EFF , where its skill is inferior to mme. kzHO and mme< have both high skill across all percentiles (better for kzHO), but they could have inferior POD compared to bestL at the extreme percentiles.
The skill enhancement is superior using the weighting scheme but the required training period to acquire representative weights was longer compared to the sub-selecting schemes. For all pollutants, the variability of the weights and the bias has negligible effect on the error for training periods longer than 60 days. For the schemes relying on member selection, accurate recent representations on the order of a week were sufficient. The learning periods constitute the necessary time for acquiring similar prior and posterior distributions in the controlling parameters of samples. The risks of all the statistical learning processes originate from the violation of this assumption, which holds in the case of changing weather or chemical regimes for example. Therefore, the operational implementation of each ensemble approach requires knowledge of its safety margins for the examined pollutants as well as its risks.
The improvement of the physical, chemical and dynamical processes in the deterministic models is a continuous procedure that results in better forecasts. Furthermore, mathematical optimizations in the input data (e.g. data assimilation) or the model output (e.g. ensemble estimators) have a significant contribution in the accuracy of the whole modelling process. The presented post-simulation advancements were the result only of favourable ensemble design. However, the theoretical minimum MSE of mme for the case of unbiased and uncorrelated models is far from being achieved from all ensemble estimators. Further development is underway in the presented ensemble methods that take into account the meteorological and chemical regimes.

Data availability
All data used in the study are available in the ensemble platform (http://ensemble.jrc.ec.europa.eu/public/) upon request at aqmeii@jrc.ec.europa.eu. stituto Nacional de Ecología-INE) (North American national emissions inventories); US EPA (North American emissions processing); TNO (European emissions processing); ECMWF/MACC project & Météo-France/CNRM-GAME (Chemical boundary conditions). Ambient North American concentration measurements were extracted from Environment Canada's National Atmospheric Chemistry Database (NAtChem) PM database and provided by several US and Canadian agencies (AQS, CAPMoN, CASTNet, IMPROVE, NAPS, SEARCH and STN networks); North American precipitation-chemistry measurements were extracted from NAtChem's precipitation-chemistry database and were provided by several US and Canadian agencies (CAPMoN, NADP, NBPMN, NSPSN, and REPQ networks); the WMO World Ozone and Ultraviolet Data Centre (WOUDC) and its data-contributing agencies provided North American and European ozone sonde profiles; NASA's Aerosol Robotic Network (AeroNet) and its data-contributing agencies provided North American and European AOD measurements; the MOZAIC Data Centre and its contributing airlines provided North American and European aircraft take-off and landing vertical profiles. For European air quality data the following data centres were used: EMEP European Environment Agency, European Topic Center on Air and Climate Change, and AirBase provided European air-and precipitation-chemistry data. The Finnish Meteorological Institute is acknowledged for providing biomass burning emission data for Europe. Data from meteorological station monitoring networks were provided by NOAA and Environment Canada (for the US and Canadian meteorological network data) and the National Center for Atmospheric Research (NCAR) data support section. Joint Research Center Ispra and Institute for Environment and Sustainability provided their ENSEMBLE system for model output harmonization and analyses and evaluation. The co-ordination and support of the European contribution through COST Action ES1004 EuMetChem is gratefully acknowledged. The views expressed here are those of the authors and do not necessarily reflect the views and policies of the US Environmental Protection Agency (EPA) or any other organization participating in the AQMEII project. This paper has been subjected to EPA review and approved for publication. The UPM authors thankfully acknowledge the computer resources, technical expertise, and assistance provided by the Centro de Supercomputación y Visualización de Madrid (CESVIMA) and the Spanish Supercomputing Network (BSC). GC and PT were supported by the Italian Space Agency (ASI) in the frame of the PRIMES project (contract no. I/017/11/0). The same authors are deeply thankful to the Euro Mediterranean Centre on Climate Change (CMCC) for having made available the computational resources.
Edited by: G. Carmichael Reviewed by: M. Plu and one anonymous referee