Improving the deterministic skill of air quality ensembles 1 2

Ioannis Kioutsioukis, Ulas Im, Efisio Solazzo, Roberto Bianconi, Alba Badia, Alessandra 3 Balzarini, Rocío Baró, Roberto Bellasio, Dominik Brunner, Charles Chemel, Gabriele 4 Curci, Hugo Denier van der Gon, Johannes Flemming, Renate Forkel, Lea Giordano, 5 Pedro Jiménez-Guerrero, Marcus Hirtl, Oriol Jorba, Astrid Manders-Groot, Lucy Neal, 6 Juan L. Pérez, Guidio Pirovano, Roberto San Jose, Nicholas Savage, Wolfram Schroder, 7 Ranjeet S Sokhi, Dimiter Syrakov, Paolo Tuccella, Johannes Werhahn, Ralf Wolke, 8 Christian Hogrefe, Stefano Galmarini 9


Introduction
Uncertainties in atmospheric models such as the chemical weather models, whether due to the input data or the model itself, limit the predictive skill.The incorporation of data assimilation techniques and the unceasing improvement in the understanding of the physical, chemical and dynamical processes result in better forecasts (Zhang et al., 2012).In addition, mathematical tools such as ensemble forecasting provide an extra channel for uncertainty quantification and eventually reduction.Such method seems similar to the Monte Carlo approach; in practice, the similarity is only phenomenological since the probability density function of the uncertainty is not sampled in any statistical context like random, latin-hypercube, etc.The Atmos.Chem. Phys. Discuss., doi:10.5194/acp-2016-513, 2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.benefits from ensemble forecasting arise from the averaging out of the unpredictable components (Kalnay, 2003).
ECMWF reports an increase in forecast skill of 1 day per decade for meteorological variables, evaluated on the geopotential height anomaly (Simmons, 2011).The air quality modelling and monitoring has a shorter history that does not allow a similar adequate estimation of such trend for the numerous species being modelled.Moreover, the skill changes dramatically from species to species.Recent results for ozone suggest that medium range forecasts can be performed with a quality similar to the geopotential height anomaly forecasts (Eskes et al., 2002).Besides the continuous increase in skill due to the enlarged scientific understanding, more accurate and denser observations as well as ensemble forecasting, an extra gain of similar magnitude can be achieved for ensemble-based deterministic forecasting using conditional averaging (e.g., Galmarini et al., 2013;Mallet et al., 2009;Solazzo et al., 2013).
Ideally, for continuous and unbiased variables, the multi-model ensemble mean outscores the skill of the deterministic models provided that the members have similar skill and independent errors (Potempski and Galmarini, 2009;Weigel et al., 2010).Practically, the multi-model ensemble mean usually outscores the skill of the deterministic models if the evaluation is performed over multiple observation sites and times.This occurs because over a network of stations, there are some where the essential conditions (e.g. the skill difference between the models is not too large) for the ensemble members are fulfilled, favouring the ensemble mean; for the rest, where the conditions are not accomplished, local verification highlights one or another atmospheric model but none particularly.Hence, although the skill of the numerical models varies in space (latitude, longitude, altitude) and time (e.g., hour of the day, month, season), the ensemble mean is usually the most accurate spatio-temporal representation.
One of the challenges in ensemble forecasting is the processing of the deterministic models datasets prior to averaging in order to construct another dataset where its members ideally constitute an independent and identically distributed (i.i.d.) sample (Kioutsioukis and Galmarini, 2014;Bishop and Abramowitz, 2013).This statistical process favours the ensemble mean at each observation site.Two basic pathways exist to achieve this goal: model weighting or model sub-selecting.There are several methods to assign weights to ensemble members such as the singular value decomposition (Pagowski et al., 2005), the dynamic linear regression (Pagowski et al., 2006;Djalalova et al., 2010), the Kalman filtering (Delle Atmos.Chem. Phys. Discuss., doi:10.5194/acp-2016-513, 2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.Monache et al., 2011), the Bayesian model averaging (Riccio et al., 2007) and the analytical optimization (Potempski and Galmarini, 2009) while model selection usually relies on the quadratic error or its proxies (e.g.Solazzo et al., 2013;Kioutsioukis and Galmarini., 2014).In this work, we apply both approaches in an inter-comparison study of two air quality ensemble systems (hereafter, Phase I and Phase II), generated within the Air Quality Model Evaluation International Initiative (AQMEII).The differences between the ensembles of Phase I and Phase II originate from many sources, related to both the input data and the models: (a) the year is different (2006 vs. 2010), therefore the meteorological conditions are different; (b) emission methodologies have changed (see Table 3 in  processes apart from feedback processes.Recent studies with regional air quality models yielded that the full variability of the ensemble can be retained with only an effective number of models (N EFF ) on the order of 5-6 (e.g.Solazzo et al., 2013;Kioutsioukis and Galmarini, 2014;Marecal et al., 2015).The minimum number of ensemble members to sample the uncertainty should be well above N EFF ; for this reason, we focus on the European domain due to its sufficient number of models to form the ensemble.The uncertainties arising from observational errors are not taken into consideration.
The objectives of the paper are (a) to interpret the skill of the unconditional multi-model mean within the phase I and II of AQMEII, (b) to calculate the maximum expectations in the skill of alternative ensemble estimators and (c) to evaluate the operational implementation of the approach using cross-validation.The paper is structured as follows: section 2 provides a brief description of the ensemble's basic properties through a series of conditions expressed by mathematical equations.In Section 3, a comparison of the skill of the deterministic models and the unconditional ensemble mean across phase I and phase II is performed.In Section 4, the skill of the alternative ensemble estimators is demonstrated.Conclusions are given in Section 5.

Minimization of the ensemble error
The notation conventions used in this section are briefly presented in the following.Assuming Assuming a uniform ensemble, the squared error (MSE) of the multi-model ensemble mean can be broken down into three components, namely, bias, error variance and error covariance (Ueda and Nakano, 1996): The decomposition provides the reasoning behind ensemble averaging: as we include more ensemble members, the variance factor is monotonically decreasing and the MSE converges towards the covariance factor.Covariance, unlike the other two positive definite factors, can be either positive or negative; its minimization requires an ensemble composed by independent or even better, negatively correlated members.In addition, bias correction should be a necessary step prior to any ensemble manipulation.More details regarding this decomposition within the air quality ensembles context can be found in Kioutsioukis and Galmarini, 2014.In similar fashion, the squared error of the multi-model ensemble mean can be decomposed into the difference of two positive-definite components, with their expectations characterized as accuracy and diversity (Krogh and Vedelsby, 1995): This decomposition proves that the error of the ensemble mean is guaranteed to be less than or equal to the average quadratic error of the component models.The ideal ensemble error depends on the right trade-off between accuracy (1 st term on the r.h.s. of Eq. 2) and diversity (2 nd term on the r.h.s. of Eq. 2).
The two decompositions presented assume uniform ensembles, i.e. all members receive equal weight.For the case of a non-uniform ensemble, the MSE of the multi-model ensemble mean can be analytically minimized to yield the optimal weights, provided that the participating models are bias-corrected (Potempski and Galmarini, 2009): where, w is the vector of optimal weights, K is the error covariance matrix and l the unitary vector.In its simplest form, the equation assigns one weight for each model at each measurement site; more complicated versions like multidimensional optimisation for many variables (e.g. chemical compounds) at many sites simultaneously are not discussed here.
It appears that the skill of the unconditional ensemble mean (mme) has the potential for certain advantages over the single members, provided some properties are satisfied.As those properties are not systematically met in practice, better ensemble skill can be achieved through sub-selecting schemes such as the ideal trade-off between accuracy and diversity (mme<) or the optimal weighting (mmW).Another sub-selecting scheme is also considered that is derived from ensemble optimization at selected spectral bands with the Kolmogorov-Zurbenko (kz) filter (Zurbenko, 1986) and combining them either linearly (kzFO) or nonlinearly (kzHO) (Galmarini et al., 2013).An inter-comparison of all those approaches in ensemble averaging is explored in this work using observed and simulated air quality timeseries.

Reducing dimensionality
The combination of redundant models (i.e., models with highly correlated errors) results in loss of valuable information due to the dependent biases (Solazzo et al., 2013).To improve the accuracy of the ensemble, redundant information in the sub-selecting schemes is discarded by mean of the effective number of models (N EFF ) sufficient to reproduce the variability of the full ensemble.N EFF is calculated as (Bretherton et al., 1999): where s i is eigenvalue of the error covariance matrix.The fraction of the overall variance expressed by the first N EFF eigenvalues is 86%, provided that the modelled and observed fields are normally distributed (Bretherton et al., 1999).The highest eigenvalue is denoted as s m .

Verification metrics
The skill of the forecasts have been measured with the following statistical parameters: (1) normalised mean square error (NMSE), i.e. the mean square error (MSE) divided by , where  and  are the mean value of the observation and the model respectively, (2) hit rate (HR), i.e. the proportion of occurrences (e.g.events exceeding threshold value) that were correctly identified, (3) Taylor plots (Taylor, 2001), which summarize standard deviation, root mean square error (RMSE) and Pearson product-moment correlation coefficient in a single point on a two-dimensional plot.

Results
In this section we apply the conceptual context briefly presented in section 2 to investigate the differences and commonalities of the ensembles across the two AQMEII phases (Rao et al., 2011).As mentioned in the introduction, the two ensembles are dissimilar with respect to their input data (emissions, boundary conditions) and their participating coupled models (offline/on-line) apart from the different meteorology/photochemistry due to the different simulation year.The model settings and input data for phase I are described in Solazzo et al. (2012a, b), Schere et al. (2012), Pouliot et al. (2012); for phase II, similar information is presented in Im et al. (2015a, b), Brunner et al. (2015), Baro et al. (2015), Pouliot et al. (2015).In both cases, the modelling communities simulated annual air quality over Europe and North America for the years 2006 (I) and 2010 (II).From the provided station-based hourly time-series, we analysed the three-monthly period with relatively high concentrations; for O 3 , June-July-August was selected while September-October-November is used for NO 2 and PM 10 .All monitoring stations are rural and have data at least 75% of the time.
We start the analysis with a presentation of the ensemble properties in the two phases, originating from variations in the components (observations, models and their interactions).
Only the unconditional full ensemble average (i.e.mme) is assessed in this section.

Observations
The observation networks across the two phases of AQMEII have similar characteristics per species like the number of stations and the fraction of missing data (Table 1).The network is denser for O 3 for which there are as many monitoring stations as for NO 2 and PM 10 combined, with PM 10 having the fewest observations.Figure 1 compares the statistical distribution of all three species between the two AQMEII phases, through the cumulative density function composed from the mean value at each percentile of the observations.All three pollutants demonstrate a decrease from 2006 to 2010, in line with the emissions reductions, as already documented (European Environmental Agency, 2013).However, we should mention that the decline is unrealistically larger for PM 10 due to the different spatial coverage of the sampling stations.Unlike the other pollutants, no valid data for France and UK were available in phase II for PM 10 (station locations are shown in Figure 4).

Models
The number of ensemble members available from Phase I ranges from 10 (PM 10 ) to 12 (O 3 ) and 13 (NO 2 ) while in Phase II 14 members were available for all species (Table 1).Following the statements of section 2, each model has been bias-corrected prior to the analysis, i.e. its own mean bias over the examined three-month period has been subtracted from its modelled time-series at each monitoring site.
The boxplots of NMSE over all monitoring stations is presented in Figure 2. The aggregated mean skill of the individual models across the two phases appears similar for O 3 , shows an improvement for NO 2 (median <NMSE> shifted from 0.53 to 0.49) and a worsening for PM 10 (median <NMSE> shifted from 0.47 to 0.50) (Table 2).At the same time, the best model at each monitoring station has similar behaviour for O 3 and NO 2 across the two phases and experiences degradation for PM 10 (median <NMSE> shifted from 0.34 to 0.37).In summary,

Multi-model mean
As shown above, the differences between Phase 1 and Phase 2 in terms of individual accuracy of the models varied between the three examined species.We examine now the consequences in the behaviour of the multi-model mean and interpret the results with respect to the presented error decompositions.As suggested from equations 1 and 2, the error of the multimodel mean relies on the skill difference of its members and their error dependence.

Skill difference
Despite the different changes in individual model skill for the different species, when they are combined to form an ensemble, the skill difference between the best model and the average skill has decreased for all species from phase I to II.This is inferred from the values of the indicator NMSE BEST /<NMSE> that increase (Table 2).This increase occurs because of more good models in phase II.To explain this, we evaluate the percentage of cases each model has been identified as being 'best' and record the number of models exceeding specific percentage thresholds.If models were behaving like i.i.d., the probabilities of being best would be roughly equal (~1/M) for all models.As can be inferred from Table 2, the proportion of equally good models has increased in phase II for O 3 and NO 2 , since the number of models exceeding the 1/M percentage contains half of the models compared to one third in phase I.This is not however true for the Phase II PM 10 simulations, where one model outscores the others at roughly 40% (~6/M) of the stations, implying a missing process in the majority of the models.It turned out that this model was erroneously running with off-line coupling between meteorology and chemistry.

Error dependence
The combination of models with correlated errors brings redundant information in the ensemble and reduces the benefits of ensemble averaging.The eigenvalues of the covariance matrix calculated from the model errors provides information for the members' diversity and the ensemble redundancy.Following the eigen-analysis of the error covariance matrix at each station separately and converting the eigenvalues to cumulative amount of explained variance, the resulting matrix is presented into box and whisker plot (Figure 3).The number of necessary eigenvalues to capture 86% of the variation is referred as effective number of models (N EFF ).In phase I, the maximum value of N EFF across all stations is 6 for O 3 and NO 2 and 4 for PM 10 .In phase II, this number is approximately 5 for all species.Hence, 5±1 models are sufficient for all species at both phases.Therefore, from a pool of 10-14 models, the benefits of ensemble averaging cease after 6 members (but not 6 particular members).
Further, the average explained variation by the maximum eigenvalue (s m ) has increased for all species in phase II, indicating a decrease in ensemble diversity.
Similar values across the two phases for the effective number of models are found from an estimation based on the optimal trade-off between accuracy and diversity, shown in the same figure.Rather than using a benchmark for the error dependence (i.e., the error covariance matrix), the N EFF is estimated from the error minimization across all possible combinations of M models at each site.At 50% of the stations, the optimum number of ensemble members is less or equal to 3 while at 95% of the stations the maximum optimum number of models becomes 6.In other words, we do need more than 6 members at most stations.The only exception is the NO 2 (II) case, where N EFF across the two phases defer by 1 (higher in phase II).As we will see later, this is due to the fact that only for NO 2 (II), there is imbalance in the relative changes of skill difference and error dependence.

Multi-model mean skill
The phase II ensemble consists of models with, compared to phase I, generally improved skill for NO 2 , worse skill for PM 10 and similar skill for O 3 .The phase II ensemble as a whole demonstrates smaller skill differences between models for all species.Last, increased error dependence is evidenced in phase II, arising primarily from the fact that 50% of the ensemble members run the same model with differences arising only from the choice of different physical or chemical parameterizations.The modulation of the ensemble mean skill owing to the changes in its properties across the two phases is now examined.
The skill of the multi-model mean has been compared against the skill of the best available deterministic model, independently evaluated at each monitoring site.The geographical distribution of the ratio RMSE(mme)/RMSE BESTMODEL is presented in Figure 4.The indicator does not exhibit any longitudinal or latitudinal dependence.We also observe that the number of extreme cases where the mme skill was notably inferior to the best model has dropped from phase I to II.Specifically, the percentage of stations where the RMSE(mme) was 10-30% higher than the RMSE BESTMODEL dropped from 17.2% to 9.3% for O 3 and from 10.0% to 5.6% for NO 2 .As presented in more detail in Table 3 for the statistical distribution of the indicator: -no major differences exist for O 3 , with the mme outscoring the best model at half of the stations.Extreme values of the indicator at both tails are trimmed in phase II; -a clear improvement is evident for NO 2 , with the mme providing more skilled forecasts at 63% of the sites, compared to 38% in the previous phase.All ranges exhibit improvement, indicating a distribution shift; -a mild improvement is also evident for PM 10 , where the number of stations where mme performs better increased from 38% to 42%.Extreme values of the indicator at both tails are increased in phase II.
The reason behind the behaviour of mme is given in Figure 5 and emerges from the joint distribution of skill difference and error dependence.Skill difference decreased for all species and error dependence increased for all species, from phase I to II.It is their relative change that modulates mme skill.For O 3 , both are altered by a comparable amount, resulting in similar mme skill across phase I and II.For NO 2 , skill difference was improved more than error dependence was worsened, yielding a net improvement of mme.For PM 10 , the situation is similar to NO 2 though with a milder relative difference.
The area below the diagonal in Figure 5 corresponds to monitoring sites with disproportionally low diversity under the current level of accuracy.Seen from another angle, this area of the chart indicates high spread in skill difference and relatively highly dependent errors.This situation practically means a limited number of skilled models with correlated errors, which in turn denotes a small N EFF value as demonstrated in Figure 6.The opposite state is true for the area above the diagonal.It corresponds to locations that are constituted from models with comparable skill and relatively independent errors, reflecting a high N EFF value.This is the desired synthesis for an ensemble.In the next section we will examine some approaches that are able to put all points in the area above the diagonal.Figure 7 demonstrates such a case with an ensemble build with selected members (mme<).

Ensemble improvements
Following the identification of the weaknesses in the ensemble design, the potential for corrections through more sophisticated schemes is now investigated.Given the observations, optimal weights or members can be estimated or selected.In this section we mark the boundaries of the possible improvements for different ensemble mean estimators applicable to the AQMEII datasets and in the next subsection we investigate the actual forecast skill for sub-optimal conditions using cross-validation.The average error across all the monitoring stations was lower for mme compared to the single models in both phases.The spatio-temporal robustness of mme skill has increased in phase II, for different reasons per species as analysed in the previous section.We consider the skill of the multi model mean as the starting point and we investigate pathways for further enhancing it through the non-trivial problem of weighting or sub-selecting.The optimal weights (mmW) are estimated from the analytical formulas presented in Potempski and Galmarini, 2009.The sub-selection of members has been built upon the optimization of either the accuracy/diversity trade-off (mme<) (Kioutsioukis and Galmarini, 2014) or the spectral representation of 1 st and higher order components by different models (kzFO, kzHO) (Galmarini et al., 2013).
The results evaluated at all stations are presented in Figure 8 in the form of Taylor plots.For O 3 , the deterministic models have standard deviations that are smaller compared to observations and a narrow correlation pattern (~0.7) that is slightly deteriorated in phase II.
For NO 2 , members with higher variance -as well as lower-than the observed variance exist in the ensemble while the correlation spread is becoming narrower in phase II and demonstrates a minor improvement.Last, simulated PM 10 from the deterministic models displays smaller standard deviation compared to observations with a wide correlation spread (0.3-0.6).The multi-model mean is always found closer to the reference point, in an area that incorporates lower error and increased correlation but at the same time generally low variance.The examined ensemble estimators (mmW, mme<, kzFO, kzHO) are horizontally shifted from mme, hence they demonstrate even lower error and increased correlation and variance.
Among them, the highest composite skill was found for mmW, followed by kzHO.
A comparison between the skill of the examined improvements versus mme, at each station separately, is now conducted.The cumulative density function of the indicator MSE X /MSE MME (X = mmW, mme<, kzFO, kzHO) evaluated at each monitoring is shown in Figure 9.For O 3 , the median improvement was 27% for mmW, 22-25% for kzHO and 17% for kzFO and mme<, relatively equal across the two phases.At ten percent of the stations, the improvement can be over 41%.For NO 2 , the median improvement for phase I (phase II) was 21% (17%) for mmW, 20% (13%) for kzHO and 13% (7-9%) for kzFO and mme<.The magnitude of improvement can exceed 39% (30%) at roughly ten percent of the stations.
The ability to forecast extreme values is now examined through the hit rate indicator (probability of detecting events exceeding a certain threshold).Due to the lowering of the concentrations from phase I to II, a percentile threshold is more appropriate for the comparison rather than a fixed threshold.Therefore, a threshold reflecting the average 90 th percentile across the stations has been selected, being 129/117 µg/m 3 (phaseI/II) for O 3 , 30/26 µg/m 3 for NO 2 and 52/33 µg/m 3 for PM 10 .The ability of the models at the tail simulation was similar to the <NMSE> change from phase I to II.For O 3 , the percentage of successful events exceeding the 90 th percentile for mme was 29% (25%) for phase I (II).The major improvement occurred for mmW, where the aggregated hit rate was 51% (48%), and the smaller improvement was for mme<, with value 42% (38%).The spectral estimators yielded values of 47% (42%) and 46% (40%) for kzFO and kzHO respectively.For NO 2 , the successful hits for mme was 35% (42%) and reached 45% (49%) for mmW.For the other ensemble averages, the result was 39% (45%) for mme<, 39% (44%) for kzFO and 40% (47%) for kzHO.For PM 10 , the total percentage of successful hits for mme was 19% (16%) and became 33% (42%) for mmW, while the other estimators yielded 28% (27%), 29% (30%) and 31% (28%) for mme<, kzFO and kzHO respectively.
The range of forecast error, from the worst deterministic model to the optimum ensemblebased average is presented in Table 4. Statistics were calculated for the 3-monthly evaluation period and averaged over all monitoring sites.All values have been normalized with the error of the best deterministic model in order to quantify the potential extent of improvement that each method can achieve as a function of species and feedbacks.We observe that the benefits from ensemble averaging in the form of mme range from 1% to 12% when compared to the Atmos.Chem. Phys. Discuss., doi:10.5194/acp-2016-513, 2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.best numerical model.Under proper weighting, this distance is, at a minimum, doubled.The range of improvement for mmW over the best single model was from 9% to 27%.
To summarize: -[Error] The analytical optimization of the error through non-uniform weighting (mmW) achieved lower MSE compared to the sub-selecting schemes.Among species, improvements over mme are larger for O 3 and smaller for PM 10 , i.e. proportional to the skill of the deterministic models.
-[Extremes] The ranking of the methods with respect to their capability for extremes was inline with the skill of the methods for the mean error.The ability of all models to capture levels exceeding a fixed threshold was better for O 3 and PM 10 in phase I and for NO 2 in phase II.Among species, mme performed best for NO 2 and worst for PM 10 .
The total percentage of successfully modelled extreme values from using the statistical treatments increased by up to 10% for NO 2 , 23% for O 3 and 26% for PM 10 .

Forecasting performance
The statistical treatments applied to a pool of ensemble simulations generated results with improved skill in diagnostic mode.To provide a perspective on applying these techniques in a forecasting context, we explore the temporal robustness of the weighting scheme, i.e. their predictability window.For this reason, the weights have been re-calculated for variable timeseries length that is progressively increasing from 1 to 60 days, for all monitoring stations across the two phases.The evaluation period for all training windows is the same 30-day segment, not available in the training procedure.The interquartile range of the day-to-day difference in the weights is calculated and its range over all stations is displayed in Figure 10.
No convergence occurs, however the variability of the mmW weights is notably reduced after a certain amount of time.If we set a tolerance level at the second decimal, to be satisfied at all stations, we need 20 days of hourly time-series for O 3 and NO 2 and 30 days for PM 10 (phase I).This period can be thought of as the necessary training or learning period.In phase II, those periods are increased and they become 25 days for O 3 , 45 days for NO 2 and PM 10 .
Weights are unpredictable for smaller periods.In practice, even safer margins should be employed.Using half of the tolerance applied, we need an approximate learning period of 50 days for phase I and 60 days for phase II.Last, the sub-selecting schemes, unlike the analytical optimization, are quite robust even for very small training periods (e.g. 1 week), Atmos.Chem.Phys. Discuss., doi:10.5194/acp-2016-513, 2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.whether in the form of mme< (Kioutsioukis and Galmarini, 2014) or kzFO/kzHO (Galmarini et al., 2013).
Table 5 presents the mmW skill obtained from training over time series of different lengths varying from 5 to 60 days.For O 3 , mmW trained over 10 days yields similar results with mme while longer periods result in large departures from mme.NO 2 and PM 10 require larger training periods than O 3 .The use of mmW is practically of no benefit compared to mme if the traning period is less than 20 days for NO 2 and 30 days for PM 10 .For all pollutants, the variability of the weights has no effect in the error after 60 days.

Conclusions
In this paper we give an overview of the performance of the forecast systems in the two phases of AQMEII and their effect in the skill of the ensemble mean.The results are interpreted with respect to the error decomposition of the ensemble.Ways to extract more information from an ensemble besides the ensemble mean are ultimately investigated and evaluated.
Air Quality models simulate the atmospheric composition through a series of complex physical, chemical and dynamical processes.In the hypothetical scenario where a simulation experiment with an ensemble of chemical weather models is performed twice, with the only difference being off-line or on-line coupling among meteorological and chemical modules, the increased non-linearity in the latter case is expected to enhance the model independence and hence generate more diverse results between models.Assuming the accuracy of the models remains the same, the increased diversity in the latter case favours the skill of the multi-model mean in the simulation with feedbacks compared to models without interactions.
However, maintaining the same level of accuracy when we incorporate feedbacks in the models is not granted.Besides feedbacks, the varying factors between the two AQMEII experiments included also different models, emissions, boundary conditions and simulation year.
The indirect contrast assessed demonstrated that the ensembles of phase I and phase II have several key differences.The average accuracy in phase II has improved for NO 2 , decreased for PM 10 and remained the same for O 3 .At the same time, the accuracy of the best model remained the same for NO 2 and O 3 and decreased for PM 10 .In other words, without pushing further the predictability limits, many models simulate better NO 2 in phase II.The opposite is true for PM 10 , where phase II modelling accuracy was deteriorated.In terms of redundancy, despite the expected increase in variability, the ensemble diversity was reduced in phase II, mainly due to the fact that half of the ensemble members were originating from the same model using only different physical or chemical parameterizations.The combined effect for the multi-model mean, in terms of the NMSE was neutral, regardless of the idealized theoretical expectations.However, the relative changes in the accuracy and diversity in phase II, favoured always the multi-model mean over the best local deterministic model, enhancing further its spatiotemporal robustness.This raises the topic of ensemble design and supports again the critical importance of having the right amount of accuracy and diversity within an ensemble.
Several improvements in the multi-model mean skill were also examined in the form of weighting or sub-selecting.The skill enhancement was superior using the weighting scheme but the required training phase to acquire representative weights was higher compared to the sub-selecting schemes.For all pollutants, the variability of the weights has negligible effect in the error for training periods longer than 60 days.The range of improvement for the optimal multi-model mean over the best single model was from 9% (PM 10 ) to 27% (O 3 ), when the corresponding range for the traditional unconditional multi-model average was from 1% to 12%.The advancement from the other approaches that use reduced-size ensembles closely follows the skill of the optimal scheme.The presented post-simulation advancements were the result of only favourable ensemble design.The combined skill earned from conditional versus unconditional ensemble averaging is comparable with the one obtained each decade as a result of the aggregated advancements in numerical prediction due to more and better assimilated observations, higher computing power and progress in our understanding of dynamics and physics.
The improvement of the physical, chemical and dynamical processes in the deterministic models is a ceaseless procedure that results in better forecasts.Besides that, mathematical optimizations in the input data (e.g.data assimilation) or the model output (e.g.ensemble estimators) have a significant contribution in the accuracy of the whole modelling process.
Further development is underway in the presented ensemble methods that take into account the meteorological and chemical regimes.
Pouliot et al. 2015); (c) boundary conditions are very different (obtained from GEMS in Phase I, MACC in Phase II); (d) the composition of the ensembles is different; (e) the models in Phase II use on-line coupling between meteorology and chemistry; (f) the models may have been updated with new science an ensemble composed of M members (i.e.output of modelling systems) denoted as  !, Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.i=1,2,…,M, the multi-model ensemble mean can be evaluated from  =  ! !!!!! ,  != 1.The weights (w i ) sum up to one and can be either equal (uniform ensemble) or unequal (nonuniform ensemble).The desired value (measurement) is .
(a) many models improved their skill for NO 2 in the Phase II simulations although no improvement occurred in the prediction capacity of the best model, (b) the model skill was generally deteriorated for PM 10 in Phase II, shifting the NMSE distribution towards higher values, (c) no notable changes were seen for O 3 .The indirect feedback mechanisms available in phase II generally improved the simulation of meteorological drivers such as temperature, radiation and precipitation, which in turn improved the forecast of many atmospheric gases while particulate matter and cloud processes require updated parameterizations (Brunner et al. (2015), Makar et al. (2015)).Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016 Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.
Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016   Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.kzFO.The magnitude of improvement surpasses 22% (37% in phase II) at ten percent of the stations.The statistical distributions of all MSE X /MSE MME indicators (X = mmW, mme<, kzFO, kzHO) are well bounded from above to lower than unity values.The only exception exists for roughly 10% of the stations, for all pollutants, where kzFO demonstrates higher MSE compared to mme.Unlike the other ensemble estimators, kzFO utilises independent spectral components each obtained from a single model, eliminating the possibility for 'cancelling out' of random errors.All cases belonging to this 10% of the samples demonstrate high N EFF , Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016   Manuscript under review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Figure 1 .
Figure 1.Comparison of the Cumulative density functions of the observations (O 3 , NO 2 , PM 10 ) 1

Figure 2 .
Figure 2. Model skill difference via the NMSE.On each box, the central mark indicates the median, 1

Figure 3 .
Figure 3. Model error dependence through the eigenvalues spectrum.The average explained 1

Figure 4 .
Figure 4. Comparison of the mme skill against the best local deterministic model by means of the indicator RMSE MME /RMSE BEST .

Figure 6 .2
Figure 6.Like Figure 5 but showing the N EFF with respect to skill difference and error dependence.1 2

Figure 7 .
Figure 7. Like Figure 5 but for the mme< skill in the reduced ensemble.Please note the change in the colorscale.

Figure 10 .
Figure 10.The interquartile range over all stations of the day-to-day difference in the weights arising from variable time-series length.

Table 1 . The forecasting systems and the evaluation network in Europe in the inter-comparison 1 exercise of the AQMEII phases I and II: simulation models, number of rural stations and data 2 coverage per species.
Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016Manuscriptunder review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Table 3 . The percentage of stations lying at various bins of the indicator RMSE MME /RMSE BEST , 1 evaluated at each monitoring site for the examined species of the two AQMEII phases.
Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016Manuscriptunder review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Table 4 . The RMSE from the worst deterministic model to the optimum ensemble average, 1 averaged over all stations. The worst and the best model have been evaluated at each site. The 2 worst (best) deterministic model is the set containing the worst (best) time-series at each station. 3 All values have been normalized with the RMSE of the composite best deterministic model.
Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016Manuscriptunder review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.

Table 5 . The RMSE of mmW for various training lengths, calculated for the testing time-series (i.e. 1 not-used in the training phase) that contains all stations. All values have been normalized with the 2 RMSE of the composite best deterministic model.
Atmos.Chem.Phys.Discuss., doi:10.5194/acp-2016-513,2016Manuscriptunder review for journal Atmos.Chem.Phys.Published: 30 June 2016 c Author(s) 2016.CC-BY 3.0 License.