Presentation of the EURODELTA III intercomparison exercise - evaluation of the chemistry transport models’ performance on criteria pollutants and joint analysis with meteorology

. The EURODELTA III exercise has facilitated a comprehensive intercomparison and evaluation of chemistry transport model performances. Participating models performed calculations for four 1-month periods in different seasons in the years 2006 to 2009, allowing the inﬂuence of different meteorological conditions on model performances to be evaluated. The exercise was performed with strict requirements for the input data, with few exceptions. As a consequence, most of differences in the outputs will be attributed to the differences in model formulations of chemical and physical processes. The models were evaluated mainly for background rural stations in Europe. The performance was assessed in terms of bias, root mean square error and correlation with respect to the concentrations of air pollutants (NO 2 , O 3 , SO 2 , PM 10 and PM 2 . 5 ), as well as key meteorological variables. Though most of meteorological parameters were prescribed, some variables like the planetary boundary layer (PBL) height and the vertical diffusion coefﬁcient were derived in the model preprocessors and can partly explain the spread in model results. In general, the daytime PBL height is underestimated by all models. The largest variability of predicted PBL is observed over the ocean and seas. For ozone, this study shows the importance of proper boundary conditions for accurate model calculations and then on the regime of the gas and particle chemistry. The models show similar and quite good performance for nitrogen dioxide, whereas they struggle to accurately reproduce measured sulfur dioxide concentrations (for which the agreement with observations is the poorest). In general, the models provide a close-to-observations map of particulate matter (PM 2 . 5 and PM 10 ) concentrations over Europe rather with correlations in the range 0.4–0.7 and a systematic underestimation reaching − 10 µg m − 3 for PM 10 . The highest concentrations are much more underestimated, particularly in wintertime. Further evaluation of the mean diurnal cycles of PM reveals a general model tendency to overestimate the effect of the

Abstract. The EURODELTA III exercise has facilitated a comprehensive intercomparison and evaluation of chemistry transport model performances. Participating models performed calculations for four 1-month periods in different seasons in the years 2006 to 2009, allowing the influence of different meteorological conditions on model performances to be evaluated. The exercise was performed with strict requirements for the input data, with few exceptions. As a consequence, most of differences in the outputs will be attributed to the differences in model formulations of chemical and physical processes. The models were evaluated mainly for background rural stations in Europe. The performance was assessed in terms of bias, root mean square error and correlation with respect to the concentrations of air pollutants (NO 2 , O 3 , SO 2 , PM 10 and PM 2.5 ), as well as key meteorological variables. Though most of meteorological parameters were prescribed, some variables like the planetary boundary layer (PBL) height and the vertical diffusion coefficient were derived in the model preprocessors and can partly explain the spread in model results. In general, the daytime PBL height is underestimated by all models. The largest variability of predicted PBL is observed over the ocean and seas.
For ozone, this study shows the importance of proper boundary conditions for accurate model calculations and then on the regime of the gas and particle chemistry. The models show similar and quite good performance for nitrogen dioxide, whereas they struggle to accurately reproduce measured sulfur dioxide concentrations (for which the agreement with observations is the poorest). In general, the models provide a close-to-observations map of particulate matter (PM 2.5 and PM 10 ) concentrations over Europe rather with correlations in the range 0.4-0.7 and a systematic underestimation reaching −10 µg m −3 for PM 10 . The highest concentrations are much more underestimated, particularly in wintertime. Further evaluation of the mean diurnal cycles of PM reveals a general model tendency to overestimate the effect of the PBL height rise on PM levels in the morning, while the intensity of afternoon chemistry leads formation of secondary species to be underestimated. This results in larger modelled PM diurnal variations than the observations for all seasons. The models tend to be too sensitive to the daily variation of the PBL. All in all, in most cases model performances are more influenced by the model setup than the season. The good representation of temporal evolution of wind speed is the most responsible for models' skillfulness in reproducing the daily variability of pollutant concentrations (e.g. the development of peak episodes), while the reconstruction of the PBL diurnal cycle seems to play a larger role in driving the corresponding pollutant diurnal cycle and hence determines the presence of systematic positive and negative biases detectable on daily basis.

Introduction
The ongoing project EURODELTA has very successfully extended the European Air Quality Modelling capability by providing a forum in which modelling teams could share experiences in simulating technically interesting and policyrelevant problems. The joint exercises contribute to further improving modelling techniques as well as quantifying and understanding the sources of model uncertainties related to the parameterization of processes and the quality of input data. EURODELTA is now an activity contributing to the scientific work of the UNECE (United Nations Economic Commission for Europe) Task Force on Measurement and Modelling (TFMM) under the Convention on Long-range Transboundary Air Pollution (CLRTAP). The TFMM was established in 2000 to provide a forum for the parties, the EMEP (European Monitoring and Evaluation Programme) centres and other international organizations for scientific discussions to evaluate measurements and modelling and to further develop working methods and tools. These are used for policy studies in support of the Gothenburg Protocol signed in 1999, which is a multi-pollutant protocol of the convention designed to reduce acidification, eutrophication and groundlevel ozone by setting emission ceilings for sulfur dioxide, nitrogen oxides, volatile organic compounds, fine particulate matter and ammonia.
In 2004, EURODELTA I (van Loon et al., 2007) examined the common performance of the chemistry transport models (CTMs) in predicting recent (2000) and future (2020) air quality in Europe using the concept of a model ensemble to measure robustness of predictions. The spread of model predictions about the ensemble mean gave a measure of uncertainty for each predicted value. In a 2020 world, the effect of making emission reductions for key pollutants in specific geographic areas was investigated. The pollutants were NO x (nitrogen dioxide), SO 2 (sulfur dioxide), VOC (volatile organic compound), PM (particulate matter as PM 10 and PM 2.5 for particle diameters below 10 and 2.5 µm, respectively) and NH 3 (ammonia). The countries were France, Germany and Italy. The effect of reducing NO x and SO x in sea areas was also investigated. Source-receptor relationships used in integrated assessment (IA) modelling were derived for all the models and compared to assess how model choice might affect this key input. EURODELTA II (Thunis et al., 2008) built on this project by taking a closer look at how the different models represent the effect on pollutant impacts on a European scale by applying emission reductions to individual emission sectors.
The new EURODELTA III (ED-III) exercise was designed to exploit and interpret intensive measurement campaigns carried out by EMEP . As far as possible, the models have been run in ED-III with the same input data (emissions, meteorology, boundary conditions) and over the same domain (domain extension and resolution). This distinguishes the study from other model intercomparisons. ED-III focused on four EMEP intensive measurement periods: The four different periods, within a rather limited number of years, allowed the influence of different meteorological conditions on model performance to be evaluated. The list of modelling teams participating in the ED-III is reported in Table 1. FUB ran two of the four periods. The ED-III framework (emissions, model configurations) was also used to assess the impact of the horizontal resolution on the performance of air quality models .
The ED-III exercise allowed a very comprehensive intercomparison and evaluation of chemistry transport model performance with a joint analysis of some meteorological variables to be made. A first evaluation of the 2009 campaign with an interim version of models was published in Bessagnet et al. (2014). Moreover, the selected periods coincide with EMEP intensive measurement periods so that an extended set of observational data was available. Therefore, in addition to EMEP operational monitoring data, size disaggregated (in PM 2.5 and PM 10 ) aerosol data and hourly measurements for studying diurnal cycles have been used. Additional AirBase data (Mol and de Leeuw, 2005) were used to evaluate the impact of meteorology on air pollutant concentrations. Finally, the exercise was performed under strict requirements (with some exceptions) concerning the input data. As a consequence, most of differences in the outputs will be attributed to the simulation of chemical and physical processes. The objective of this paper is 2-fold: (i) to present the exercise, the input data and the participating models, and (ii) to analyse the behaviour of models in the four campaigns focusing on the criteria pollutants PM 10 , PM 2.5 , O 3 , NO 2 and SO 2 as defined in the EU directive on air quality 2008/50/EC (EC, 2008) and relevant meteorological variables. Complementary analyses of deposition fluxes and PM composition data at high temporal resolution will be discussed in companion papers in order to better understand the behaviour of models.
2 Description of models 2.1 Overall description of models The models are synthetically described in Tables 2 and 3. All the models were run on the same domain at 0.25 • 0.25 • resolution in longitude and latitude except CMAQ. CMAQ simulations were performed on a Lambert-conformal conic projection with the standard parallels at 30 and 60 • and a grid of 112 × 106 cells of size 24 km × 24 km. The results of the CMAQ simulations were interpolated to the prescribed EURODELTA grid.
Participants delivered both air concentrations and meteorological parameters. Most of variables were delivered on an hourly basis, while dry and wet deposition fluxes were provided on a daily basis. The output species include, among others, O 3 , NO 2 and SO 2 , total PM mass concentrations both in 10 and 2.5 µm fractions (PM 10 and PM 2.5 ). Secondary inorganic aerosols such as ammonium (NH + 4 ), sulfate (SO 2− 4 ) and nitrate (NO − 3 ) and other PM components relevant for the analysis as well as wet deposition of sulfur and nitrogen compounds were also collected and will be used in companion papers. The delivered air concentrations should approximately correspond to the standard measurement height (typically 3 m) and were directly derived from the first model layer, except for LOTOS-EUROS and EMEP that corrected the concentrations from the first layer to be representative of 3 m concentrations. The PM 2.5 and PM 10 concentrations are calculated as follows in each model: where xx = 2.5 or 10 µm; PPM stands for primary particulate matter and includes elemental carbon, primary organic aerosol and primary non-carbonaceous aerosol; SOA represents secondary organic aerosol; and sea salt and dust represent the contribution of the corresponding natural processes mainly controlled by the wind speed. The participating models differ in the availability of PM components and formation routes. For instance, EMEP, LOTOS-EUROS and RCG contain coarse-mode nitrate formation (produced by reaction of nitric acid with sea salt and dust), whereas the others do not. In CMAQ, additional anthropogenic dust is calculated as 90 % of unspecified PM coarse emissions and attributed to fugitive dust (Binkowsky and Roselle, 2003). CAMx did not activate the parameterization of sea salt in this exercise.
Based on the setup of models and completeness of datasets, an ensemble called ENS has been built based on  ) CHIM 2006HZG CMAQ (Byun and Schere, 2006Matthias et al., 2008) CMAQ 2006MSC-W -Met.NO EMEP (Simpson et al., 2012) EMEP 2006TNO LOTOS-EUROS (Sauter et al., 2014) LOTO 2006ENEA/ARIANET MINNI (ARIANET, 2004) MINNI 2006FUB RCG (Stern et al., 2006 (1970) has been used to derive K z profiles as Eq. (1): where K z is a value of K A at the height of the atmospheric boundary layer z A , and K B at the height of the surface layer z B , the so-called constant-flux layer. Minimum K z values have been set to 1 m 2 s −1 . Any values of K z calculated below will be set to this value. By default, CAMx employs a standard K-theory approach for vertical diffusion to account for sub-grid-scale mixing layer to layer.

CHIMERE
In this study, the PBL is directly taken from the ECMWF IFS data. Horizontal turbulent fluxes were not considered. Vertical turbulent mixing takes place only in the boundary layer. The formulation uses K diffusion following the parameterization of Troen and Mahrt (1986), without a counter-gradient term. In each model column, diffusivity K z is calculated as Eq. (2): where w s is a vertical velocity scale given by similarity formulae.
-In the unstable case: w s = u 3 * + 2.8ew 3 * 1/3 , where e = max(0.1, z/ h), L is the Monin-Obukhov length, w * is the convective velocity scale, u * the friction velocity and h the boundary layer height. The minimum value of K z is assumed to be 0.01 m 2 s −1 . K z and the wind speed were corrected in urban zones according to Terrenoire et al. (2015) by applying a correction factor to limit the diffusion within the urban canopy, but this correction has very little effect at this resolution.

CMAQ
The boundary layer height in COSMO is calculated with the turbulent kinetic energy (TKE) method (Doms et al., 2011). CMAQ directly used the PBL fields from COSMO.
In CMAQ, the vertical turbulent mixing is estimated using the Asymmetric Convective Model scheme version 2 (ACM2; Pleim, 2007a, b). The ACM2 replaces the simple eddy viscosity (K-theory) scheme. The ACM2 scheme allows the non-local mixing, which means upward turbulent mixing from the surface across non-adjacent layers through the convective boundary layer. Pleim (2006) compared the eddy viscosity and the ACM2 schemes in CMAQ, finding that the ACM2 scheme tends to predict larger concentrations of secondary pollutants and smaller concentrations of primary pollutants at the surface, and has a more well-mixed profile in the PBL than the eddy viscosity scheme.

EMEP
The mixing height is calculated using a slightly modified Richardson number (Ri B ) following Jeričevič et al. (2010) and defined as the lowest height at which the Ri B > 0.25. Finally, the PBL is smoothed with a second-order Shapiro filter in space. The PBL height is not allowed to be less than 100 m or exceed 3000 m. The initial calculation of the vertical exchange coefficients is done using the Ri number and wind speed vertical gradient for the whole domain. Then, K z values within the PBL are recalculated based on Jeričevič et al. (2010) for stable and neutral conditions. For unstable situations, K z is calculated based on the similarity theory of Monin-Obukhov for the surface layer, whereas K z profiles from O'Brien (1970) are used for the PBL above the surface layer. For more details, see Simpson et al. (2012).

LOTOS-EUROS
The first model layer is, by definition, the mixing layer, with height equal to the boundary layer height as given by ECMWF. Horizontal diffusion is not used, but for vertical mixing the vertical diffusion coefficient is calculated according to Eq. (3): where κ is the von Karman constant; u * the friction velocity; the functions proposed by Businger (1971) for stable, neutral or unstable atmosphere; z the height and L the Monin-Obukhov length. The friction velocity is calculated depending on the wind at reference height (10 m), the Businger functions and the roughness length per land use class. The vertical structure of LOTOS-EUROS is determined by the mixing layer height, with a shallow surface layer (25 m) to avoid mixing of near-surface emissions that is too fast and a second layer equal to the mixing layer as given by ECMWF.

MINNI
In MINNI, the friction velocity and Monin-Obukhov length are determined by using the Holtslag and van Ulden (1983) iterative scheme for unstable conditions and the Venkatram (1980) iterative method for stable conditions. Micrometeorological parameters over water are derived with the profile method, using air-sea temperature difference (Hanna et al., 1985), with the needed roughness length, depending on wind speed, supplied by the Hosker (1974) parameterization.
During daytime, both convective and mechanical heights are determined, keeping then the maximum value between the two parameters. The convective height is calculated following the Maul (1980) version of the Carson (1973) algorithm, essentially based on the heat conservation equation. The mechanical mixing height is estimated by using the Venkatram (1980) algorithm. During nighttime, the bulk Richardson number method is applied (Sorensen, 1998), in which the height of the boundary layer is given by the smallest height at which the bulk Richardson number reaches the critical value fixed to 0.25.

RCG
The mixing layer depth in the model is the height of the layer closest to the input boundary layer height taken from the ECMWF IFS data. Vertical diffusion parameters for stable and unstable conditions are derived using the Monin-Obukhov similarity theory for the diabatic surface layer. The friction velocity and Monin-Obukhov length are calculated iteratively depending on the 10 m wind, the stability correction factors and the roughness length determined from land use.
3 Input data

Anthropogenic emissions
The first step in the emission preparation was to calculate the spatial pattern of emissions for the reference year 2007, which was selected because it was a key year for the TNO-MACC inventory (Kuenen et al., 2011). The anthropogenic emission input was harmonized following the methodology described in Terrenoire et al. (2015). The total emissions per sector and country were then scaled to the corresponding year of the campaigns: 2006, 2007, 2008 and 2009. Emission categories are broken down into 11 classes called SNAP (selected nomenclature for air pollutants): (1) public power stations, (2) residential and comm./inst. combustion, (3) industrial combustion, (4) production processes, (5) extraction and distribution fossil fuel, (6) solvents use, (7) road traffic, (8) other mobile sources (trains, shipping, aircraft, etc.), (9) waste treatment and (10) agriculture. Natural emissions (11) were calculated by the models as set out in Sect. 3.2.
The gridded distribution of anthropogenic emissions was provided by INERIS and it was based on a merging of different databases from Kuenen et al., 2011); (Vestreng et al., 2007); emission data from the GAINS database (http://gains. iiasa.ac.at/gains).

Emission regridding was based on INERIS expertise and performed by means of various proxies:
population data coming from the EEA database merged with global data (from the Socioeconomic Data and Applications Center http://sedac.ciesin.columbia.edu) to fill gaps in Europe; the US Geophysical Survey land use at 1 km resolution (http://www.usgs.gov/); -French bottom-up emission data for wood combustion to derive a proxy based on population density; -EPER data for industries. The EPER decision is based on Article 15(3) of Council Directive 96/61/EC (EC, 1996) concerning integrated pollution prevention and control. EPER is a web-based register which enables the public to view data on emissions of 50 key pollutants to water and air from large and medium-sized industrial point sources in the European Union.
The TNO-MACC dataset provides two distinct datasets: (i) large point sources (LPSs) with the coordinates of stacks and (ii) surface emissions on a fine grid (0.125 • × 0.0625 • ). In the gridding process, the first step consisted in summing up LPS emissions from the TNO-MACC emissions inventory for 2007 with surface emissions to obtain total emissions as in the EMEP inventory. LPSs were aggregated with surface emissions because no data were available to calculate plume rise heights for point sources emissions. For the various SNAP sectors, the processing steps were the following: -SNAP 2: The country emissions were regridded with coefficients based on population density and French bottom-up data; the methodology  was extrapolated to all of Europe. For PM 2.5 emissions, the annual EMEP national totals were kept except in the following countries: Czech Republic, Bosnia and Herzegovina, Belgium, Belarus, Spain, France, Croatia, Ireland, Lithuania, Luxembourg, Moldova, Republic of Macedonia, the Netherlands, Turkey. For these countries, PM 2.5 emissions from GAINS were used as this database provides higher numbers and certainly more realistic since wood burning is known to be underestimated in the EMEP database (Denier van der . Additional factors were applied on two Polish regions for both PM 2.5 and PM 10 emissions. As a preliminary solution, domestic combustion emissions from provinces with active coal mines were multiplied by a factor of 8, while those in neighbouring provinces were adjusted by a factor of 4 (Kiesewetter et al., 2015). The former activity in coal mine regions still leads to high emissions of PM due to domestic uses of coal.
For countries where TNO-MACC emissions were not available, the EMEP 0.5 • × 0.5 • emissions were used (Iceland, Liechtenstein, Malta and Asian countries) and regridded with adequate proxies (artificial land use, EPER data for industries).
The following emitted species were used in the models: methane (this species comes from the TNO-MACC inventory), carbon monoxide, ammonia, sulfur oxides, nonmethane volatile organic compounds (NMVOCs), nitrogen oxides, primary particulate matter.
Residential emissions of particulate matter are dominant in wintertime. In most countries, they come from wood-burning or coal uses. Germany, Sweden, Spain clearly have the lowest levels of PM 2.5 emissions for this activity sector. Romania, Poland and France have the highest levels of annual total emissions per country . For this activity sector, the PM 2.5 emissions by components are provided in the Supplement S8.
The time profiles are those used in Thunis et al. (2008). Three types of profiles were provided: -Seasonal factors with one value per species, month, activity sector and country.
-Weekly factors with one value per species, day of the week (Monday-Sunday), activity sector and country.
-Hourly factors with one value per hour (local time), species and activity sector.
The vertical injection profile in CTMs was prescribed according to Bieser et al. (2011) where industrial sectors and residential heating were assigned in lower levels compared to the lower vertical levels than other literature default profiles . Since only PM 2.5 and coarse PM emissions were provided by EMEP, a PM speciation profile provided by IIASA (based on Klimont et al., 2013) was used to estimate the fraction of non-carbonaceous species, elemental carbon and organic matter per activity sectors and country. Models used their own split for NO x , SO x and NMVOC emissions. These emission inventories did not account for recent changes in the way to measure semi-volatile organic compounds from wood-burning emissions as discussed in Denier van der .
The full emission dataset is available upon request to INERIS. www.atmos-chem-phys.net/16/12667/2016/ 2012). The model of emissions of gases and aerosols from nature (MEGAN) is a modelling framework for estimating fluxes of biogenic compounds between terrestrial ecosystems and the atmosphere using simple mechanistic algorithms to account for the major known processes controlling biogenic emissions. It is available as an offline code and has also been coupled into land surface and atmospheric chemistry models. EMEP, LOTOS-EUROS and RCG used parameterizations derived from Simpson et al. (1999) for the temporal variations according to temperature and light, with maps of tree species from Koeble and Seufert (2001).

Natural
CMAQ used the BEIS (Biogenic Emission Inventory System; Vukovich and Pierce, 2002) module developed by the US EPA. BEIS estimates volatile organic compound (VOC) emissions from vegetation and nitric oxide (NO) and carbon monoxide (CO) emissions from soils. Because of resource limitations, recent BEIS development has been incorporated into the Sparse Matrix Operational Kernel Emissions (SMOKE) system (available at https://www.cmascenter.org/ smoke), so that the native version of BEIS is built within the SMOKE architecture.

Soil nitrogen monoxide (NO) emissions
CHIMERE and MINNI used version 2.04 and CAMx used version 2.1 of the MEGAN model to calculate the NO emissions. RCG used a parameterization of NO emissions described in Simpson et al. (1999). LOTOS-EUROS did not include NO emissions in this simulation. CMAQ used the BEIS module. The soil NO emission parameterization for EMEP is described in Simpson et al. (2012).

Sea salt emissions
All models host very different schemes based on Monahan (1986) with some updates from Martensson et al. (2003) for LOTOS-EUROS, and Gong et al. (1997) for RCG. CMAQ and MINNI used the Zhang et al. (2005) parameterization and CAMx had no sea salt for this exercise due to uncertainty that is too high in sea salt parameterization. EMEP used parameterization from Monahan (1986) for larger sizes of sea spray and Martensson et al. (2003) for smaller sizes.
CMAQ also emits sea salt sulfate using a fraction of 7.76 % of emitted sea salt split into the accumulation and coarse modes.

NO emissions from lightning
The only model to describe NO emissions from lightning is the EMEP model, following Köhler et al. (1995).

Wildfire emissions
Fire emissions were provided by the GFASv1.0 database (Kaiser et al., 2012) only for the 2006 campaign. The Global Fire Assimilation System (GFASv1.0) calculates biomass burning emissions by assimilating fire radiative power (FRP) observations from the MODIS instruments onboard the Terra and Aqua satellites. It corrects for gaps in the observations, which are mostly due to cloud cover, and filters spurious FRP observations of volcanoes, gas flares and other industrial activities. For all models, the wildfire emissions were assigned in the whole PBL layer. Only the following species were taken into account: CO, CH 4 , NO x , SO 2 , PM 2.5 , TPM (total primary matter), OC (organic carbon) and EC (elemental carbon).

Dust emissions
For CAMx, CHIMERE and CMAQ, no natural dust module is activated for this exercise. For these three models, natural dust only comes from the boundary conditions. For EMEP, windblown dust parameterization is documented in Simpson et al. (2012), road dust calculations are included in the calculations from Denier van der Gon et al. (2010). LOTOS-EUROS contains emission parameterizations for several sources of mineral dust . Only wind-blown dust, resulting from wind erosion of bare soils, was taken into account here, together with dust from boundary conditions. Other sources (agricultural activities, road dust resuspension) were not activated in ED-III. In MINNI, dust emissions from local erosion and particle resuspension (Vautard et al., 2005) with attenuation in the presence of vegetation from Zender et al. (2003) is activated in this exercise. RCG considers resuspension of mineral aerosol as a function of friction velocity and the nature of soils. Two mechanisms are treated: direct release of small dust particles by the wind (Loosmore and Hunt, 2000), and indirect release by collision with bigger soil grains that are lifted by the wind but return to the surface by sedimentation (saltation process from Claiborn et al., 1998).

Meteorology
All models except CMAQ and RCG share the same meteorological dataset at 0.2 • resolution based on ECMWF IFS (Integrated Forecast System) calculations.
Because of its importance for applications (e.g. in air pollution modelling), the boundary layer height, as diagnosed in the ECMWF IFS model, was made available. The parameterization of the mixed layer (and entrainment) uses a boundary layer height from an entraining parcel model. But, in order to get a continuous field, in neutral and stable situations the bulk Richardson method proposed by Troen and Mahrt (1986) is used as a diagnostic, independently of the turbulence parameterization. Boundary layer height is defined as the level where the bulk Richardson number, based on the difference between quantities of energy at that level and the lowest model level, reaches the critical value Ri cr = 0.25.
For RCG, a different meteorological dataset was used. The 3-D data for wind, temperature, humidity and density were produced employing a diagnostic meteorological analysis system developed at Freie Universität (Berlin, Germany) and based on an optimum interpolation procedure on isentropic surfaces. The system takes into account all available observed synoptic surface and upper air data as well as topographical and land use information (Reimer and Scherer, 1992). Rain data, cloud data and boundary layer heights were retrieved from the IFS dataset. Boundary layer parameters as friction velocity and Monin-Obukhov length were calculated on-the-fly by applying standard boundary layer theory.
The CMAQ model used meteorological variables calculated with the COSMO model in CLimate Mode (COSMO-CLM) version 4.8 CLM 11. The COSMO model is the nonhydrostatic operational weather prediction model applied and further developed by the national weather services joined in the COnsortium for SMall scale MOdeling (COSMO) described in Bettems (2015).

Boundary conditions
In this study, the MACC reanalysis was used as input data for the boundary conditions (Inness et al., 2013;Benedetti et al., 2009). The MACC II project (Modelling Atmospheric Composition and Climate) established the core global and regional atmospheric environmental service delivered as a component of the Copernicus initiative (http://copernicus. eu/). The reanalysis production stream provides analyses and 1-day forecasts of global fields of O 3 , CO, NO 2 , SO 2 , HCHO, CO 2 , CH 4 and aerosols. Other reactive gases are available from the coupled chemistry transport model. The reanalysis covers the period 2003-2011 with a 1-month spinup. It runs at approximately 78 km × 78 km horizontal resolution over 60 levels. The coupled chemistry transport model has the same 60 vertical levels and a horizontal resolution of 1.125 • × 1.125 • . For aerosols only elemental carbon, organic carbon, dust and sulfate were used.
Stratospheric ozone fields from the MACC reanalysis agree with ozone sondes and ACE-FTS (atmospheric chemistry experiment Fourier transform spectrometer) data within ±10 % in most seasons and regions. In the troposphere, the reanalysis shows biases of −5 to +10 % with respect to ozone sondes and aircraft data in the extratropics, while larger negative biases are shown in the tropics. Areaaveraged total column ozone agrees with ozone fields from a multi-sensor reanalysis dataset within a few percent. For aerosols, the observed Aerosol Optical Depth (AOD) is assimilated in the model with a feedback on individual PM species (sea salt, dust, elemental carbon, organic carbon and sulfate). When available, the MACC reanalysis is compared with observations, the model acronym in the supporting material is MACCA.

Air pollutant concentrations
The evaluation was carried out with the available EMEP standard monitoring (Tørseth et al., 2012) and intensive period observations for  on hourly and daily bases (see Supplement S8 for the description of background sites). Elevated sites above 1500 m in altitude have been excluded from the analysis. The measurements were downloaded from the EBAS database (http://ebas.nilu.no/). Additional AirBase data (Mol and de Leeuw, 2005) were used to evaluate the impact of meteorology on air pollutant concentrations in Sect. 7.2.
It is important to note that daily measurement for a day, N , is the averaged value between day N HH:00 and day N +1 HH:00, with HH usually varying in the range [00,09] in GMT. For most of the species, measurements on a daily and hourly basis are not necessarily performed for the same set of stations. Deposition and the PM composition are also available; the dataset will be detailed in the companion papers.

Temperature and wind speed
The temperature, wind speed and precipitation measurements come from 2016 synoptic stations in Europe reported by the European meteorological centres. The data are provided on an hourly basis. The temperature is measured at 2 m and the wind speed at 10 m. Some meteorological data are also reported at some EMEP sites. At EMEP sites, daily accumulated measurements (e.g. precipitation) for a day N represent the integral between day N at HH:00 and day N+1 at HH:00, with HH usually varying in the range [00, 09] in GMT.

PBL height
The sounding data were extracted from the University of Wyoming database (http://weather.uwyo.edu/). For each site and each day, two soundings are available at 00:00 and 12:00 GMT. The provided meteorological parameters are pressure (hPa), the corresponding height above ground level (m), dew point temperature ( • C), relative humidity (%), mixing ratio (g kg −1 ), wind direction (degrees) and wind speed (expressed in knots and converted to m s −1 by applying the conversion factor 0.514), potential and virtual potential temperature (K). For the present study, data were extracted over 77 stations in Europe. The boundary layer height is estimated using the calculation of the bulk Richardson number profile and searching for the altitude where the critical value of Ri cr = 0.25 is reached. The analysis was limited to the first 25 vertical points, roughly corresponding to an altitude of 5000 m above ground level. Since the boundary layer Atmos. Chem. Phys., 16, 12667-12701, 2016 www.atmos-chem-phys.net/16/12667/2016/

Mean bias
height is a concept valid only for convective periods, only the soundings of 12:00 GMT were analysed and used for the model evaluation.
In addition to the previous PBL data, hourly heights of the atmospheric boundary layer were calculated from lidar measurements in a background site near Paris (SIRTA in Palaiseau, France). A new objective method for the determination of the atmospheric boundary layer depths using routine lidar measurements has been used (Pal et al., 2013).

Error statistics for the evaluation of model performances
The errors statistics considered in this report are presented in Table 4. In the Supplement S0-S1 the performances of all models for the four campaigns are reported. For a given pollutant or meteorological variable, model performance is computed for a common set of stations (over the same common geographic area). All maps of pollutant concentrations and meteorological variables concerning individual models and ensemble are provided in the Supplement (Sects. S2-S6). For the analysis of the ensemble, a coefficient of variation (VAR) is defined as follows in Eq. (4): where C m is the concentration of individual model m included in the ensemble (CHIMERE, LOTOS-EUROS, MINNI and EMEP), M is the number of models and C ENS is the ensemble mean concentration.

Evaluation of the meteorology
Some general features for each campaign can be provided; they are taken from the NOAA (National Oceanic and Atmospheric Administration) global analysis (https://www.ncdc. noaa.gov/sotc/global/). June 2006 temperatures were above average everywhere in Europe with low precipitation except in Balkan countries and Spain compared to the 1961-1990 base period. January 2007 was characterized by windy conditions in Europe with temperatures above the average everywhere except in Spain, where temperatures were close to the average values. In the beginning of February, temperatures were particularly low in Scandinavia. Precipitation was low over the Mediterranean basin but above the climate average, compared to the 1961-1990 period in the rest of Europe.
In September-October 2008, no clear general characteristics were recorded; this transition period was characterized by slight negative temperature anomalies over the western part of Europe, mainly France, the United Kingdom and north of Spain.
After some cold spells in the end of February, March 2009 turned milder with average warmer temperatures compared to the 1961-1990 base period. Precipitation was below average in the western part of Europe and above average in the central and eastern parts of Europe.

The 2 m temperature
As summarized in the Supplement S0, the models using ECMWF data show comparable high temporal correlation www.atmos-chem-phys.net/16/12667/2016/ Atmos. Chem. Phys., 16, 12667-12701, 2016 Correlations are lower for all models over north of Italy and Austria. On average for the considered period, the bias is negative for all models in the range [−0.3 K, −0.7 K] for CAMx, CHIMERE, EMEP and LOTOS-EUROS. The negative bias for this group of models is more important for the two wintertime campaigns; however, in Switzerland and Austria this bias exceeds −2 K for all campaigns. Since this group of models shares the same meteorology, the error statistics are very similar; the discrepancies are due to the different interpolation methods used to regrid the 3-D and 2-D ECMWF variables to the final CTM grid. RCG displays a very low absolute bias close to zero for the 2009 campaign, and CMAQ displays the lowest negative bias up to −2 K for the 2009 campaign. CMAQ has a lower correlation coefficient particularly in Germany and Poland for the 2008 and 2009 campaigns.
As displayed in Fig. 1, the negative bias is driven by afternoon temperatures that are underestimated by all models; this statement is valid for all campaigns. The nighttime temperatures are more in line with the observations. The RCG diurnal cycle is rather different with a flatter profile, but for the other models using ECMWF or COSMO data, the general pattern is well captured.

The 10 m wind speed
All the models using ECMWF data overestimate the wind speed from +0.1 to +0.9 m s −1 , while CMAQ, driven by COSMO, showed on average the lowest absolute bias. The biases are the highest for the two winter (2009)  The bias is generally higher in eastern and northern Europe than in western and Mediterranean areas. In Europe, the spatial pattern of biases shows high positive bias in several coastal areas and negative bias in mountainous areas (Alps). This clearly points out a problem in some regions for the calculation of some emissions directly relying on IFS U10 fields. According to Ingleby et al. (2013), ECMWF 10 m wind speeds are slightly overestimated especially at night. In the IFS, only 10 m winds from ocean-bound ships are used in the data assimilation due to problems with station representa-Atmos. Chem. Phys., 16, 12667-12701, 2016 www.atmos-chem-phys.net/16/12667/2016/ tiveness for inland sites. Moreover, errors in wind speed measurements are higher for low winds. For the lowest winds, the comparison of the predicted diurnal cycle with observations shows a larger positive bias at night than during the afternoon (Fig. 1), this behaviour could lead to an overestimation of the advection process in the chemistry transport models. Time correlations are better for models using ECMWF data but all models exhibit low correlations over the Alpine region (north of Italy, southeast of France, Switzerland and Austria). The RCG model shows higher correlation coefficients over northern Europe (Finland and Sweden) for the 2009 campaign.

PBL and mixing
As explained in Sect. 4.2, the observed PBL height was calculated at 12:00 GMT because of methodology hypotheses, except at the SIRTA site where hourly measurements are available for 2008 and 2009. All models have a negative bias. The lowest RMSEs are shown for CAMx and CHIMERE which use the ECMWF PBL; the biases are in the range −237 and −100 m for these two models. It is worth noting that CAMx and CHIMERE exhibit exactly the same performance, while LOTOS-EUROS and EMEP, which adopted IFS PBL too, show partially different performances. Some differences are attributed to different interpolation schemes and the use of minimum PBL values during nighttime as for EMEP. The largest underestimation of the PBL height is usually found for MINNI particularly for the 2006 campaign (up to −616 m) and EMEP (up to −451 m) and the correlation coefficients for these models are lower compared to the others. CMAQ has the lowest bias for most of campaigns. Models using IFS PBL data showed the best performance for temporal correlation (see Supplement S0), the main discrepancies are observed for the 2006 campaign with several sites in Europe with negative correlations. The largest negative biases are observed in the south of the domain; in these regions CMAQ performs better. In some regions over the Mediterranean basin, particularly in coastal areas, the MINNI's PBL is sometimes strongly biased up to −1000 m. The obtained results suggest that either the Carlson algorithm or the micro-meteorological parameterization implemented by MINNI tends to underestimate the intensity of convection.
The spatial representation of the PBL for the 2009 campaign shows higher differences between the models mainly over the ocean and seas where the coefficient of variation reaches 40 % in some areas (Fig. 2). While LOTOS-EUROS, CHIMERE, RCG and CAMx use the PBL from IFS with some differences in spatial and time interpolations, the other models use their own parameterizations discussed in Sect. 2.2. The diurnal cycles displayed in Fig. 2 show that MINNI simulates a higher PBL at night and a lower PBL during daytime compared to ECMWF. The difference in the afternoon PBL is quite important over countries influenced by the ocean like Great Britain. CMAQ and EMEP simulate the highest PBL at night over France and Great Britain. The hourly times series at the SIRTA site confirm the underestimation of the ECMWF PBL but at this station, the negative bias of MINNI is of the same order of magnitude as those of the other models. The correlations based on hourly values are somewhat lower for CMAQ, EMEP, MINNI (below 0.50) compared to the models using ECMWF data.
The differences in treatment of advection and mixing as reported in Sect. 2.2 can lead to differences in the reconstruction of pollutant dispersion. Figure 3 shows the mean coefficient of variation of CO concentrations predicted by the models sharing the same raw meteorology (IFS) for the 2006 campaign. This pollutant can be considered a tracer with low influences of deposition and chemistry processes, most of the differences in concentrations are related to transport and mixing. The figure clearly shows that mixing in emission areas, such as big cities, produces the highest differences exceeding 20 % of variations. The next highest coefficients of variation are observed over the seas and ocean, which are related to the differences of PBL predicted by the models (Fig. 2); elsewhere, this coefficient remains below 10 %. Correlations are similar for all models in the range 0.5-0.6, only CMAQ has lower correlations, on average. For the summertime campaign in 2006 CHIMERE and CMAQ display the lowest correlation for daily averaged concentrations but CHIMERE has the lowest bias with EMEP. The low correlation for CMAQ and CHIMERE is due to the difficulties in reproducing both spatial patterns and day-to-day variations. For this campaign, most models underestimate concentrations in the mountainous regions in Spain and over the Alps (Fig. 5). The models tend to overpredict ozone concentrations in background stations influenced by large urban areas like GR01 station in Greece and IT01 close to Rome. All models simulate high ozone concentrations over the Mediterranean Sea; most of them behave satisfactorily in Malta and Cyprus stations in agreement with the ozone concentrations pattern over the seas for the ensemble shown in Fig. 5 and particularly in Malta (Nolle et al., 2002). The  and CHIMERE use exactly the same PBL height of IFS, but nighttime performances of the two models are rather different. In Fig. 5, the right side is the gridded coefficient of variation that is a standardized measure of the dispersion of model results. It is defined as the ratio of the standard deviation to the mean of models. This coefficient is very low for the 2006 campaign, below 10 %, and the models have different responses along the ship tracks. The coefficients of variation are the highest for the 2007 campaign (Supplement S2) associated with low performances of the ensemble (high normalized root mean square errors). France, Spain and Norway show the lowest coefficients of variation, indicating a more coherent behaviour among the models, but not necessarily corresponding to better model performance than other areas. At Mace Head (IE31) located on the western part of the domain, the time series of model results vs. ozone observations shows flat shape for the two winter campaigns with very low time correlations in 2009 (Fig. 7). The best correlation coefficients are observed for 2006 and 2008; the models are able to capture the peaks. At this station, the negative bias mentioned in 2009 is roughly the same for LOTOS-EUROS, MINNI and RCG and comparable to the MACC analysis (−20 µg m −3 ), the other models CAMx, EMEP, CHIMERE and CMAQ have a lower absolute bias (about −10 µg m −3 ). This behaviour shows that concentrations close to boundary conditions are quickly modified, certainly because the regional models restore their own chemical equilibrium in relation to dynamical processes like deposition and vertical dispersion.

Nitrogen dioxide
For NO 2 , all models perform similarly in terms of correlation with values in the range 0.6-0.7 ( Fig. 4 and Supplement S1). The spatial correlation is much higher in the range 0.7-0.9 for all models. Only CMAQ strongly overestimates the mean concentrations and CAMx underestimates the con-centrations for all campaigns. Bessagnet et al. (2014) showed rather low concentrations of elemental carbon compared to other models; this inert species is particularly sensitive to vertical mixing and CAMx presents the highest minimum diffusion coefficient that is of major importance during stable conditions and partly explains the lower NO 2 concentrations.  For CAMx, the enhanced mixing influences also O 3 concentrations that are higher than in other models. The spatial pattern of the ensemble shown for 2009 (Fig. 8) displays high concentrations over the Benelux region, north Italy, the biggest cities and over the shipping tracks. The bias of the ensemble is rather good except for one station in Serbia (RS05) with high observed values, probably due to local sources. The gridded coefficients of variation provided in Fig. 8 show that most of differences between models are observed over remote areas far from emission regions even if errors are expected to occur more frequently for low values. As shown for a less reactive species like CO, the differences of mixing in models over emission areas can lead to large differences in modelled concentrations. This effect can be clearly seen over the east Mediterranean for maritime emissions where the PBL is different from model to model. Over land, the NO 2 chemistry and the different biogenic NO emission modules in the models are believed to explain a large part of the differences in NO 2 concentrations far from urban areas. As shown in Fig. 8, the root mean square errors of the models are the highest for the stations close to the emission areas. The diurnal cycles in Fig. 9 show a general underestimation during the afternoon. It should be pointed out that the observed NO 2 concentrations can be slightly overestimated. For some types of analyzers, NO 2 is catalytically converted to NO on a heated molybdenum surface and subsequently measured by chemiluminescence after reacting with ozone. The drawback of this technique is that other oxidized nitrogen compounds such as peroxyacetyl nitrate and nitric acid are also partly converted to NO (Steinbacher et al., 2007). In the observations, the presence of two peaks in NO 2 concentrations is related to the traffic emission peaks occurring in the morning and the evening. The timing of the peak occurrences is also modulated by the meteorology: for the 2006 and 2008 campaigns performed with identical summer time shift, we clearly see a time shift of +1 and −1 h, respectively, for the morning and evening peaks corresponding to a later rise and earlier fall of the PBL. Thus, as expected, the narrowest time lag between the two peaks is observed for the 2007 campaign. Most of the models predict the first peak too early, particularly CHIMERE and CMAQ for the 2006 campaign, and the second peak generally occurs too late.
CMAQ shows the strongest nighttime bias that contributes to explain the overall overestimation shown by the model in all campaigns. CMAQ was driven by a different meteorology that was characterized by very good performance with respect to both wind speed and PBL height mean bias. Conversely, IFS-driven models overestimated nighttime wind speed. As nighttime vertical mixing is mainly driven by mechanical forces, the model results suggest that models tend to underestimate mixing during stable conditions and, as a consequence, IFS-driven models show better results, suggesting compensation processes.

Sulfur dioxide
The correlations are rather low for all models in the range from 0.2-0.4 for the 2006 campaign to 0.5-0.6 for the 2007 campaign ( Fig. 4 and Supplement S1 for all statistics). Two groups of models are identified CAMx, MINNI and RCG that largely overestimate the concentrations and CHIMERE, CMAQ, EMEP and LOTOS-EUROS which are closer to the observations on average with the best performances for the RMSE. The overestimation in the MINNI model could be partially explained by the low model PBL height. For CAMx, possible reasons such as the vertical distribution of SO 2 emissions near the harbours and coastal areas, insufficient conversion to sulfate and deposition that was too low were discussed in Ciarelli et al. (2016). This leads to a positive bias of the ensemble as shown in Fig. 10 (Supplement S4) particularly in western Europe; the normalized RMSE is frequently above 100 % in most parts of Europe. The main hot spots are located in eastern Europe, in addition to high concentrations along the shipping routes. The coefficient of variation is the lowest over emission areas but very high in remote areas like the oceans far from shipping tracks and over mountain areas. This behaviour, very different from a primary species like CO, is a first indication of the very different way to simulate the SO 2 chemistry and deposition processes in the models.
The diurnal cycles presented in Fig. 11 show a peak at about 10:00-12:00 GMT. This peak corresponds with the hourly emission profiles of the industrial sector showing an emission peak at the same hours; however, most of models predict a larger decrease in the afternoon. Only CMAQ for the 2007 campaign captures satisfactorily the diurnal profile.

PM 10
Concerning the RMSE, on average the performances of the models are similar except CMAQ which has the highest values driven by low correlations and high negative biases particularly for the 2006 campaign (Fig. 4). All models underestimate the concentrations generally in the range −3 to −10 µg m −3 . Except CMAQ, the correlations are in the range 0.4-0.6, but CHIMERE and EMEP reach 0.7 for the 2006 campaign. MINNI has the lowest absolute biases for the 2007, 2008 and 2009 campaigns. The ensemble provides a good picture of the PM 10 concentrations in Europe ( Fig. 12 and Supplement S5) except for two stations (IT01 in Italy and CY02 in Cyprus) with high recorded values. For CY02, high PM 10 concentrations are linked to high calcium concentrations (Bessagnet et al., 2014) due to dust events issued from north Africa. This dust event can be clearly observed for EMEP in Fig. 14. The spatial patterns show low concentrations below 5 µg m −3 in remote Scandinavia and three hot spots in the Po Valley, Benelux and south Poland. The coefficient of variation of model results is rather high over the seas and arid areas as well as over areas influenced by biogenic emissions as in Scandinavia. This coefficient is generally the lowest over the western Europe. The best RMSEs of the en-Atmos. Chem. Phys., 16, 12667-12701, 2016 www.atmos-chem-phys.net/16/12667/2016/ Figure 9. Mean diurnal cycles of nitrogen dioxide for all campaigns simulated by the models compared with observations. Averaged concentrations are provided on the right side of the charts.  (Table 5), EMEP captures the high concentrations better in the south of the domains, whereas CHIMERE performs better over the Benelux region (Supplement S5). In 2008, RCG has particularly good spatial correlation compared to the other models. The missing sea salt emission for CAMx is Figure 11. Mean SO 2 diurnal cycles for all campaigns simulated by the models compared with observations. Averaged concentrations are provided on the right side of the charts. clearly observed over the ocean with very low PM 10 concentrations impairing the spatial correlations.
As shown in the Supplement S5, most of models underestimate the highest PM 10 concentrations observed in 2008 and 2009 by a factor of 2. For the 10 % highest PM 10 concentrations, MINNI has the lowest underestimations for these two campaigns, whereas EMEP behaves rather well for the 2006 campaign regarding the bias and the correlation. As shown in Bessagnet et al. (2014) the large underestimation in 2009 is driven by the underestimation of organic species.
The observed diurnal cycles of PM 10 are very flat for all campaigns with a small peak in the evening (Fig. 13). The Atmos. Chem. Phys., 16, 12667-12701, 2016 www.atmos-chem-phys.net/16/12667/2016/ Figure 13. Mean diurnal cycles of PM 10 for all campaigns simulated by the models compared with observations. Averaged concentrations are provided on the right side of the charts.
systematic underestimation of PM 10 can be clearly observed but the shape of the cycle is not very well captured, the evening peak is not reproduced. The models simulate low concentrations in the afternoon mainly driven by the elevation of the PBL. For the 2009 campaign, MINNI reproduces the diurnal cycle very well until 16:00 GMT. As shown in Fig. 14, dust concentrations are higher for MINNI in the centre of the domain. MINNI uses a parameterization for wind-blown dust very productively over any land cover types (Vautard et al., 2005). EMEP mainly produces dust by traffic resuspension and a little over arable land. This higher production of dust by MINNI in Europe certainly improves the negative bias for PM usually observed in chemistry transport models, particularly in the afternoon when the wind speed is higher and the soil moisture content lower.
Most of the underestimation of PM 10 by the models is driven by daytime PM 10 concentrations that were too low. It is noteworthy that MINNI calculates the lowest PBL that could explain its relatively higher PM 10 concentrations. For the summer campaign in 2006, the PM 10 observations show an increase of concentrations in the afternoon while all other models tend to predict a decrease, indicating that all models are too sensitive to dynamical processes (meteorology) and not sufficiently sensitive to the chemical formation.

PM 2.5
Performances on PM 2.5 concentrations are rather different compared to PM 10 (Fig. 4). MINNI generally shows a slight positive bias while all models underestimate the averaged concentrations, with CMAQ showing the highest negative bias. The performance of CHIMERE on the correlation is very good for all campaigns, with its RMSE being the lowest for three campaigns. As for PM 10 , the ensemble captures the spatial patterns of PM 2.5 rather well. The concentrations in the south of Europe (Fig. 15 and Supplement S6) are not specifically underestimated except in Cyprus where dust events also contribute to increase the PM 2.5 concentrations. For all campaigns, the coefficient of variation for PM 2.5 is the lowest in Spain but the RMSE of the ensemble is not particularly low in this region. The coefficient of variation is generally high over the northeast part of the domain. For all campaigns, the models simulate a hot spot over the north of Italy. As shown in the Supplement S6, CMAQ captures the PM 2.5 concentrations in Ispra (IT04) for 2007 and 2008 campaigns better than the other models. This station, located at the border of the Po Valley hot spot, is usually underestimated by the models due to the very stably stratified meteorological conditions in this region. The spatial correlations are www.atmos-chem-phys.net/16/12667/2016/ Atmos. Chem. Phys., 16, 12667-12701, 2016 usually better for PM 2.5 for all models except for the summer campaign (Table 5).
As for the PM 10 concentrations, the diurnal cycle of PM 2.5 is rather flat with very small morning and evening peaks (Fig. 16). The models have a different behaviour; they simulate a sharp decrease of concentrations in the afternoon consistent with PM 10 diurnal cycles. This confirms the lack of secondary production during daytime. The chemical schemes for the production of organic matter are still incomplete for one main reason. As suggested by Jathar et al. (2014) a large part of the unspeciated fraction of organic species reacts and produces secondary organic matter and gasoline vehicles could be an important contributor, as well as wood-burning emissions, according to Denier van der . This unspeciated fraction is not included in our emission inventory, explaining a part of the negative bias of models observed either in winter or summer campaigns, particularly during the afternoon. This suggests that models with negative biases for PM 2.5 concentrations are consistent with the level of the completeness of our inventory and the state-ofthe-art knowledge on SOA modelling.
7 Impact of meteorology on pollutant concentrations 7.1 Impact of the PBL parameterization with MINNI results for the 2009 campaign As shown in the previous section, MINNI underestimates the PBL heights calculated at 12:00 GMT from measurements but it is in better agreement with hourly data available at SIRTA (Fig. 2). In order to test the effect of PBL heights on air quality predictions, the MINNI model has been run using the PBL from IFS instead of its own parameterization for PBL heights. As shown by Curci et al. (2015), processes in the PBL can greatly affect the PM 2.5 ground concentrations; for instance, temperature and relative humidity can favour the production of ammonium nitrate in the upper PBL. Figure 17 shows the average PBL heights and the average concentrations of O 3 , NO 2 and PM 10 using MINNI's parameterizations (left graphs) and the percentage difference between the average concentrations calculated with PBL heights given by IFS (PBL IFS ) and by MINNI's parameterizations (PBL MINNI ) (right graphs) using the following formula: (PBL IFS − PBL MINNI )/ PBL MINNI .
It can be seen that over the seas, on average, PBL heights calculated with MINNI's parameterizations (PBL MINNI ) are lower than PBL heights given by IFS (PBL IFS ) but over land PBL MINNI is higher than PBL IFS in coastal areas, north Atmos. Chem. Phys., 16, 12667-12701, 2016 www.atmos-chem-phys.net/16/12667/2016/  Africa, Scandinavian mountains and the middle of the Russian plains, and lower over the rest. Over the sea, PBL IFS are higher than PBL MINNI more than 50 % while over the land the differences are between −30 and +30 %. Figure 17 also shows that the O 3 concentrations increase in correspondence with the increase of PBL heights up to 10 % and more, and decrease where the PBL heights decrease. This behaviour is explained by the fact that with a higher PBL more O 3 is entrained from high altitudes where O 3 concentrations are higher than at surface. Since the NO 2 sources are mainly at surface, the NO 2 concentrations generally decrease with the increase of PBL heights and increase with the decrease of PBL heights as a consequence of more or less effective dilution, respectively. Over most of Europe, the NO 2 concentrations decrease up to 8 % when PBL IFS heights are used. The PM 10 concentrations respond to PBL height variation in the same way as NO 2 . The use of PBL IFS heights produces a 4 % decrease of PM 10 concentrations in most parts of Europe but an increase of 6-8 % in coastal areas and Russian plains.
In terms of statistics, the use of the PBL from IFS in MINNI slightly improves the correlations mainly driven by an improvement of time correlations. PM 10 , PM 2.5 and NO 2 concentrations are decreased by less than 0.5 µg m −3 , improving all error statistics reported in Fig. 4f. An increase of 2.75 µg m −3 is observed for O 3 concentrations. It is also worth mentioning that the variations in pollutant concentrations are small (over the land below 10 % generally) in comparison to the variations of PBL height; therefore, other factors such as emissions spatial distribution, meteorology (e.g. advection and vertical dispersion, especially in low-wind areas), gas phase chemistry, aerosol physics and chemistry have to be investigated for improving model performances.
These results clearly show the importance of having good estimates of PBL heights but they also demonstrate that more investigations are necessary in order to identify the best parameterization of PBL heights as well as vertical diffusivities and vertical advection schemes which improve the simulated concentrations over all of Europe.

Influence of meteorology on NO 2 concentrations with CAMx results
Pollutant concentrations are strongly influenced by the reconstruction of meteorological fields. In this section, a comparison of model performances in reproducing wind speed and NO 2 concentrations is presented and discussed. Further- Figure 16. Mean diurnal cycles of PM 2.5 for all campaigns simulated by the models compared with observations. Averaged concentrations are provided on the right side of the charts. more, PBL height data, collected at SIRTA site (Paris) have been used, too. Being mainly related to emission processes, NO 2 has been selected as a tracer of the influence of dispersion on pollutant concentrations. The analysis has been performed over the Paris area since the hourly variation of the PBL has been available. Two other limited areas, namely all of Germany (DE) and the Po Valley (POV) have been selected to complement the analysis. The NO 2 observed dataset has been set up from AirBase database (Mol and de Leeuw, 2005), selecting just background stations having more than 75 % valid data over the whole year 2009. Modelled concentrations have been derived from the CAMx simulation results, while modelled meteorological fields have been derived from IFS.
In the case of the Paris area, the meteorological model showed a very good performance in reproducing the observed wind speed, whose temporal evolution clearly influences the corresponding temporal variability of NO 2 concentrations (Fig. 18). Also, the PBL height is reproduced quite well by the model, though the model tends to underestimate the nighttime minima and, conversely, overestimate some diurnal peaks.
Within the Paris area, NO 2 observations are quite well reproduced by CAMx, showing a low bias of the median value lower than 2 ppb, corresponding to less than 20 % of the observed median concentration (Fig. 18). The availability of both wind speed and PBL height observations allow the influence of both processes to be clearly detected. For example, on 3-4, 10 and 25 March, the underestimation showed by CAMx seems well related to a corresponding overestimation of the PBL rather than the wind speed (Fig. 19). Conversely during the nighttime hours of 5 March, CAMx results were more influenced by the wind speed.
The analysis has been completed comparing the diurnal cycle of both NO 2 and meteorological variables, reported in Figs. 20 and 21. At German sites, NO 2 concentrations are slightly overestimated during nighttime and underestimated during daytime. This behaviour does not seem strictly related to wind speed, particularly during nighttime, thus being probably more related to vertical turbulence. At Po Valley sites, NO 2 values are systematically underestimated, while wind speed is correctly reproduced, even partially underestimated during daytime hours. NO 2 modelled concentrations show a clear low bias during nighttime, probably related to an imprecise reconstruction of the strong stable conditions that characterize this area during the cold season. The model discrepancies are enhanced during the morning hours, when the model is not able to capture the magnitude of the observed   peak. The discrepancy is probably caused by a growth of the PBL that was too rapid during the initial daytime hours. Late in the afternoon, the NO 2 bias tends to decrease, probably thanks to a very quick collapse of PBL height after sunset. At Paris sites, NO 2 modelled concentrations show a behaviour similar to the Po Valley area. The availability of both wind speed and PBL height observations, allows most of the previous comments to be confirmed. Particularly, it is worth noting that at SIRTA site, PBL height shows an increase during morning hours that is too rapid, followed by a decrease just after sunset that is too strong. However, underestimation of NO x emissions cannot be ruled out as depicted in Vaughan et al. (2016) or Chen and Borken-Kleefeld (2016); these works highlight the potential underestimation of NO x traffic emissions.

Discussion
The results from a mathematical model depend on three main factors: the model formulation (in terms of its assumptions, sub-models, numerical methods and their implementation in computer code); the model input data including boundary conditions; the skill of the model user particularly with respect to use of default values for certain inputs and parameters. When comparing results from a modelling exercise, the performance, assessed from comparison between modelling results and data, is influenced by all three of these factors. It is therefore difficult to make judgments on the performance of a model without understanding the importance of configuration and use. In this model intercomparison exercise, we have tried to achieve a greater focus on the effects of model formulation by standardizing the model input data as far as possible and running models for specific time periods having different meteorology and season (emissions and meteorology) to test responses over a range of input data. The comparison of results with observations has also been done in a standard way. CMAQ shows the largest RMSEs between predicted and observed values for NO 2 over all campaigns; LOTOS-EUROS shows the lowest RMSEs for SO 2 over all campaigns; CAMx always exhibits the highest RMSEs for SO 2 over all periods. This means that in several cases either the model formulation or the input setup influence the model per-formances more than specific features of the meteorological season.
For all pollutants and campaigns, there is not a strong correlation between the performances of the ensemble (through the RMSEs of the difference between predictions of the ensemble and observed values) with the variability of models (through the coefficient of variation between individual model predictions and the ensemble predictions). This means if models are close to each other (low coefficient of variation), the mean of models can be far or close to the observed values; there are no specific rules. However, for SO 2 and PM 2.5 a correlation of −0.2 to −0.3 is observed for three campaigns meaning that a large variability tends to improve the performance of the ensemble for these compounds. For the other compounds, O 3 , NO 2 and PM 10 , the correlation is close to zero. The coefficient of variation is the lowest for ozone (below 10 %) particularly in the afternoon hours (see Supplement S7) and for the summer period 2006, while for SO 2 this coefficient is the highest generally between 30 and 40 %. For PM, this coefficient is about 10 to 20 % over several countries; the coefficient of variation is higher in the afternoon, highlighting the difference between chemical schemes for the aerosol chemistry that is more active during daytime. Conversely, the low coefficient of variability for O 3 Atmos. Chem. Phys., 16,2016 www.atmos-chem-phys.net/16/12667/2016/ confirms a coherence of ozone chemistry scheme between models. The intercomparison proved that CTMs are able to reproduce ozone concentrations, showing an average RMSE of individual models corresponding to 30 % of the mean observed concentration for daily values. Modelled daily cycles are generally more spread during nighttime than daytime hours. This means that even though most models shared the same meteorology, including PBL height, they proved to be very sensitive to vertical dispersion and deposition parameterization, the two key processes governing O 3 concentration during nighttime. During daytime, modelled concentrations are more similar; they show a different ranking with respect to night hours. This means, as expected during daytime, vertical mixing reconstruction is more similar among models, and chemical schemes exhibit a different efficiency in ozone production. This behaviour is not detectable in 2007, which was a cold and windy period, hampering the development of photochemical processes.
NO 2 performances are less robust than for O 3 . The RMSE represents about 70 % of the observed mean concentration, but the value is even higher in case of CMAQ. Bias is negative for most models, except CMAQ, adopting a different meteorology and MINNI, characterized by lower PBL heights. CHIMERE biases are closer to 0 than other models sharing the same meteorology, such as CAMx.
As for ozone, most of the discrepancies among models and with respect to observations take place during nighttime, when the atmosphere is more stable. As most models share the same wind fields, the modelled spread in nighttime concentrations can be related to vertical dispersion. Such spread for primary species and particularly for CO can be considered a measure of the uncertainty related to vertical mixing and qualitatively corresponds to 80-100 % of the observed mean concentration. The height of the first level is also very important for the mixing and deposition processes, it ranges from 20 m for CAMx and CHIMERE to 90 m for EMEP. To be more representative of surface concentrations a correction is implemented for models having a coarse first surface layer (LOTOS-EUROS and EMEP). Daytime modelled concentrations are more similar among models and generally underestimated, though the modelled PBL field at noon seemed lower than the observed one. As already mentioned, such a systematic discrepancy could be related to a measurement artefact, but also to photochemistry that could give rise to an excess of nitric acid. More accurate observations of nitric acid and nitrate would be required. SO 2 concentrations show the worst performance, with RMSE values corresponding to 130-160 % of the observed mean concentrations. The highest RMSEs are shown by CAMx, MINNI and RCG. CAMx. It is worth noting that the modelled diurnal cycles show a weak morning peak, more typical of surface sources not observed in observations. Conversely, measured data present a diurnal peak, usually related to enhancing downward mixing of aloft sources, where most SO 2 is emitted. Discrepancies among models and with respect to observations can also be due to chemistry. For example, in 2009, Bessagnet et al. (2014) reported for CHIMERE an underestimation of SO 2 concentrations on an hourly basis, while sulfate was overestimated; conversely RCG, adopting a more simplified approach for sulfur chemistry than CHIMERE, overestimated SO 2 , while it underestimated sulfates.
PM 10 model performances are less homogenous within the 4 years than other pollutants. The campaigns in 2006 and 2007 that were characterized by a more dispersive atmosphere show a mean RMSE around 10 µg m −3 , representing 55-65 % of the mean observed concentration. Differently, the RMSE rises up to 15 µg m −3 for 2008 and 2009 campaigns representing more than 80 % of the observed mean. The bias is better reproduced by EMEP and MINNI, while CAMx and CMAQ show the strongest underestimation. The analysis of each PM compound for the 2009 period (Bessagnet et al., 2014) revealed that MINNI and EMEP were characterized by rather different scores, suggesting that their overall performance is influenced in a different way by both chemistry and meteorology. Particularly MINNI performance seems more driven by a reduced dispersion often giving rise to higher concentrations than other models, while EMEP seems more able to capture the evolution of the single PM compound as shown in Bessagnet et al. (2014). CAMx and CMAQ often show the strongest negative bias. As for CAMx, this result is probably driven by the combined effect of meteorology (also NO 2 is underestimated by CAMx) as well as the absence of some key processes such as sea salt and dust resuspension and a PM coarse chemistry. Differently, the CMAQ model was characterized by very high NO 2 concentrations stressing a less dispersive capability than other models. As for CMAQ, the low PM 10 values are probably related to deposition processes. Indeed, for the 2009 episode (Bessagnet et al., 2014) CMAQ proved to be more efficient than the other models for dry deposition of both NO x and SO x compounds.
The observed diurnal cycles of PM 10 are very flat for all campaigns with a small peak in the evening. The PM 10 observations show an increase of concentrations in the afternoon while all models predict a decrease, indicating that all models are too sensitive to dynamical process (meteorology) and not sufficiently sensitive to the chemical formation. The analysis of individual compounds of PM will bring more details, it will be investigated in a companion paper.
Model performance for PM 2.5 is on average slightly better than PM 10 , both in terms of bias and correlation. PM 2.5 concentration is less affected by natural processes, which are more relevant for coarse PM; therefore, the obtained results suggest that modelling natural processes still present some relevant weaknesses (Bessagnet et al., 2014). Modelled diurnal cycles show improved performance in terms of bias, but not with respect to the daily evolution. Firstly, this result confirms that there are processes mainly affecting the coarse fraction that are still missing in some state-of-the-art CTMs, highlighted by the different biases between PM 10 and PM 2.5 . Secondly, the differences in the daily pattern, particularly evident in 2006 where photochemistry is at its maximum, confirm that dilution processes during daytime hours are too efficient with respect to chemical processes, thus preventing the increase of modelled concentrations during afternoon hours.
Even if the meteorology was prescribed in the exercise, some variables related to dispersion modelling such as the vertical diffusion and the PBL height are often diagnosed in the model preprocessing. This step involves important differences in the dispersion as was shown for a tracer species like CO. Although most models used the same PBL from IFS (CHIMERE, CAMx, LOTOS-EUROS, RCG), the variability of model PBL (including other PBL parameterization as used in EMEP, CMAQ and MINNI) shows important differences of PBL calculations over the ocean and the Mediterranean Sea. IFS wind speeds are overestimated with a bias reaching 1 m s −1 , which can have a dramatic effect at low wind speed conditions.
The comparison of the meteorological fields pointed out that the reconstruction of the meteorological variables is still affected by relevant uncertainties. Wind speed simulated by IFS and COSMO showed a systematic difference along the whole day, with IFS providing an average wind speed that in 2007 and 2009 was 12 % higher than COSMO. PBL reconstruction showed an even higher variability with a spread among the models corresponding to 27-29 % of the mean midday PBL value of each campaign.
A comparison of modelling performances in reproducing wind speed and NO 2 concentrations was performed too, also including some analysis of the influence of PBL height estimation. The comparison of modelled concentrations against wind speed and PBL heights confirmed that meteorology strongly influences CTM performance. Particularly the temporal evolution of wind speed is the most responsible for model skillfulness in reproducing the daily variability of pollutant concentrations (e.g. the development of peak episodes), while the reconstruction of the PBL diurnal cycle seems more influential in driving the corresponding pollutant diurnal cycle and hence the presence of systematic positive and negative biases detectable on a daily basis.

Conclusions
One of the main outcomes of such a multi-seasonal intercomparison is that in most cases model performances are more influenced by the model setup than the season. Another general outcome stemming from the whole exercise is that model performances are more different from one pollutant to another than for the same pollutant within the different seasons. This confirms once again that on average and for the limited dataset used in this exercise, the model formulation (parameterization of chemical/physical processes, calculation of meteorological diagnosed variables) and setup (number of verti-cal levels, value of key parameters, etc.) are more influencing than raw meteorological conditions on model performance. One of the few exceptions is shown by O 3 in 2009 where model results were characterized by RMSE values very similar to the other years, whereas bias was negative instead of positive as in the three previous years. But, as already pointed out, such a result was mainly driven by a relevant underestimation in the ozone boundary concentrations from MACC.
Even if the meteorology was prescribed, some variables like the PBL height and the vertical diffusion coefficient are diagnosed in the model preprocessors and explain the spread of model results. For ozone, this study shows the importance of boundary conditions on model calculations and then on the regime of the gas and particle chemistry. The worst performances are observed for sulfur dioxide concentrations that are poorly captured by the models. The performances of models are rather good and similar for the nitrogen dioxide. On average, the models provide a rather good picture of the particulate matter concentrations over Europe even if the highest concentrations are underestimated. For the PM, the mean diurnal cycles show a general tendency to overestimate the effect of the PBL height rise while the afternoon chemistry (formation of secondary species) is certainly underestimated. PM observations show very flat diurnal profiles for all seasons. In general, the daytime PBL height is underestimated by all models, the largest variability of predicted PBL is observed over the ocean and seas. The temporal evolution of wind speed is the most responsible for model skillfulness in reproducing the daily variability of pollutant concentrations (e.g. the development of peak episodes), while the reconstruction of the PBL diurnal cycle seems more influential in driving the corresponding pollutant diurnal cycle and hence the presence of systematic positive and negative biases detectable on daily basis.
The study stresses the importance of emission sources particularly in wintertime. Wood-burning emissions are likely the most underestimated source, through the missing species called semi-volatile organic compounds. Road traffic emissions could also be underestimated; gasoline and diesel vehicles are both concerning, and more generally all activity sectors involving combustion processes can be of concern. In this study, the importance of meteorological data is highlighted, the difficulties for meteorological models to simulate meteorological variables like wind speed and PBL height during stable conditions can lead to dramatic consequences for air quality modelling. Developments in air quality modelling have to not only focus on processes but also on emissions and meteorological input data. To complement the analysis, companion papers will focus on depositions of sulfur/nitrogen compounds and on the behaviour of models for particulate matter species. This ensemble of analyses will help to prioritize the improvement of air quality models used in the frame of the CLRTAP.
The Supplement related to this article is available online at doi:10.5194/acp-16-12667-2016-supplement.
Acknowledgements. The EBAS database has largely been funded by the CLRTAP-EMEP program, AMAP and by NILU internal resources. Specific developments have been possible due to projects like EUSAAR (EBAS web interface), EBAS-Online (upgrading of the database platform) and HTAP (import and export routines to build a secondary repository for support of http://www.htap.org. A large number of specific projects have supported development of data and metadata reporting schemes in dialog with data providers (CREATE, ACTRIS and others). For a complete list of programs and projects for which EBAS serves as a database, please consult the information box in the framework filter of the web interface. These are all highly acknowledged for their support, particularly Anne-G.