The POLARCAT Model Intercomparison Project ( POLMIP ) : overview and evaluation with observations

L. K. Emmons1, S. R. Arnold2, S. A. Monks2, V. Huijnen3, S. Tilmes1, K. S. Law4, J. L. Thomas4, J.-C. Raut4, I. Bouarar4,*, S. Turquety5, Y. Long5, B. Duncan6, S. Steenrod6, S. Strode6,**, J. Flemming7, J. Mao8, J. Langner9, A. M. Thompson6, D. Tarasick10, E. C. Apel1, D. R. Blake11, R. C. Cohen12, J. Dibb13, G. S. Diskin14, A. Fried15, S. R. Hall1, L. G. Huey16, A. J. Weinheimer1, A. Wisthaler17, T. Mikoviny17, J. Nowak18,***, J. Peischl18, J. M. Roberts18, T. Ryerson18, C. Warneke18, and D. Helmig19


Introduction
Observations show that the Arctic has warmed much more rapidly in the past few decades than global-mean temperature increases. Arctic temperatures are affected by increased heat transport from lower latitudes and by local in situ response to radiative forcing due to changes in greenhouse gases and aerosols (Shindell, 2007). Model calculations suggest that in addition to warming induced by increases in global atmospheric CO 2 concentrations, changes in short-lived climate pollutants (SLCPs), such as tropospheric ozone and aerosols in the Northern Hemisphere (NH), have contributed substantially to this Arctic warming since 1890 (Shindell and Faluvegi, 2009). This contribution from SLCPs to Arctic heating and efficient local amplification mechanisms (e.g., icealbedo feedback) put a high priority on understanding the sources and sinks of SLCPs at high latitudes and their climatic effects. Despite the remoteness of the Arctic region, anthropogenic sources in Europe, North America and Asia have been shown to contribute substantially to Arctic tropospheric burdens of SLCPs (e.g., Fisher et al., 2010;Sharma et al., 2013;Monks et al., 2014;Law et al., 2014). The Arctic troposphere is more polluted in winter and spring as a result of long-range transport from northern mid-latitude continents and the lack of efficient photochemical activity or wet scavenging needed to cleanse the atmosphere (Barrie, 1986).
Large forest fires in boreal Eurasia and North America also impact the Arctic in the spring and summer seasons . Our understanding of contributions from SLCP sources to present-day Arctic heating is sensitive to the ability of models to simulate the transport and processing of SLCPs en route to the Arctic from lower latitude sources. This model skill has implications for our confidence in predictions of Arctic climate response to future changes in midlatitude anthropogenic and wildfire emissions.
Comparisons of model results to long-term surface observations have shown that global chemical transport models (CTMs) have notable limitations in accurately simulating the Arctic tropospheric composition, as well as having significant differences among models (e.g., Shindell et al., 2008). Transport of emissions from lower latitudes to the Arctic is mainly facilitated by rapid poleward transport in warm conveyor belt airstreams associated with frontal systems of mid-latitude cyclones (Stohl, 2006). The result is that although the Arctic is remote from source regions, Arctic enhancements in trace gas and aerosol pollution are far from homogeneous. They are instead characterized by episodic import of pollution-enhanced air masses, exported from the mid-latitude boundary layer by large-scale advection in frontal systems Schmale et al., 2011;Quennehen et al., 2012). Polluted air uplifted from warmer, southerly latitudes (Asia and North America) tends to enter the Arctic at higher altitude, while air near the surface is influenced mainly by low-level flow from colder, more northerly source regions, particularly Europe (e.g., Klonecki et al., 2003;Stohl, 2006;Helmig et al., 2007b;Tilmes et al., 2011;Wespes et al., 2012). The stratification of the largescale advection and slow vertical mixing leads to fine-scale layering and filamentary air mass structure through the Arctic troposphere, where air masses from different source origins produce distinct layers, and are stirred together only in close proximity, while mainly retaining their own chemical signatures (Engvall et al., 2008;Schmale et al., 2011). Air masses are eventually homogenized by turbulent mixing and radiative cooling, but usually on timescales longer than the rapid advection along isentropic surfaces that is characteristic of intercontinental transport in these systems (Methven et al., 2003(Methven et al., , 2006Stohl, 2006;Arnold et al., 2007;Real et al., 2007). For global Eulerian models, representing this finescale structure, subsequent slow mixing and chemical processing is challenging, particularly with characteristic coarse grid sizes of 100 km or more.  (IPY 2008) as part of the international POLARCAT (Polar Study using Aircraft, Remote Sensing, Surface Measurements and Models, of Climate, Chemistry, Aerosols and Transport) activity . Numerous papers have been written on these observations and corresponding model simulations (many are in a special issue of Atmospheric Chemistry and Physics; http://www.atmos-chem-phys.net/special_issue182.html).
The POLARCAT Model Intercomparison Project (POLMIP) was organized with the goal of exploiting this large data set to comprehensively evaluate several global chemistry models and to better understand the causes of model deficiencies in the Arctic. While aerosols are an important component of the Arctic atmospheric composition, POLMIP focuses on gas-phase chemistry, primarily carbon monoxide (CO), reactive nitrogen (NO y ), ozone (O 3 ) and their precursors. This paper provides an overview of the POLMIP models and their evaluation against observations, as well as an evaluation of the emissions inventories used by the models. Two additional papers present more detailed analyses Arnold et al., 2014). Monks et al. (2014) comprehensively evaluate the model CO and O 3 distributions with surface, aircraft and satellite observations as well as compare the effects of chemistry and transport using synthetic tracers. In general, the models are found to underestimate both CO and O 3 , while the modeled global mean OH amounts are slightly higher than estimates constrained by methyl chloroform observations and its emission estimates, suggesting the model errors are not entirely due to low emissions. The comparison of fixed-lifetime tracers to idealized OH-loss CO-like species shows that the differences in OH concentrations among models have a greater impact on CO than differences in transport in the models. The tracer analysis also shows a very strong influence of fire emissions on the atmospheric composition of the Arctic. Ozone production in air influenced by biomass burning is evaluated by Arnold et al. (2014). Using tracers of anthropogenic and fire emissions, fire-dominated air was found to have enhanced ozone in the POLMIP models, with the enhancement increasing with air mass age. Differences in NO y partitioning are seen among models, likely due to model differences in efficiency of vertical transport as well as hydrocarbon oxidation schemes.
The next section of this paper gives an overview of the POLARCAT aircraft campaigns that prompted this intercomparison (Sect. 2). Section 3 presents a summary of the models that participated, along with a description of the model experiment design and the emissions used. Comparisons of the model results with observations are shown in Sect. 4, including vertical profiles of ozone from sondes, surface layer non-methane hydrocarbons (NMHCs), satellite observations of NO 2 , and the numerous compounds measured from research aircraft. Finally, an evaluation of fire emission fac-tors is shown in Sect. 5, using aircraft observations of fireinfluenced air masses.

POLARCAT observations
POLARCAT is a consortium of tropospheric chemistry experiments performed during the IPY 2008 . A wealth of data on tropospheric ozone and its photochemical precursors were obtained through the depth of the Arctic troposphere during spring and summer. These observations provide an opportunity to evaluate model representations of processes controlling tropospheric ozone in imported pollutant layers above the surface. The NASA Arctic Research of the Composition of the Troposphere from Aircraft and Satellites (ARCTAS) mission  was grouped into three parts, ARCTAS-A, ARCTAS-B and ARCTAS-CARB. Three research aircraft took part in this campaign with a slightly different goal for each mission. ARCTAS-A and ARCTAS-B targeted mid-latitude pollution layers and wildfire plumes, respectively, transported to the Arctic. ARCTAS-CARB was focused on California air quality targeting fresh fire plumes in northern California, as well as various anthropogenic sources (Huang et al., 2010;Pfister et al., 2011).
The NOAA Aerosol, Radiation, and Cloud Processes affecting Arctic Climate (ARCPAC) mission was conducted in spring between the end of March and 21 April using the NOAA P3 aircraft (Brock et al., 2011). It was designed to understand the radiative impacts of anthropogenic pollution and biomass burning. The campaign was based in Fairbanks, Alaska, and frequently targeted wildfire plumes that were transported from Siberia (e.g., Warneke et al., 2009).
POLARCAT-France, using the French ATR-42 aircraft, was based in Kiruna, Sweden, and took place between 30 March and 11 April (de Villiers et al., 2010;Merlaud et al., 2011). The summer mission was based in western Greenland in Kangerlussuaq and took place between end of June and mid-July Quennehen et al., 2011).
The POLARCAT-Greenland Aerosol and Chemistry Experiment (GRACE) mission was conducted during the same time (1-17 July), based also at Kangerlussuaq, Greenland, using the DLR Falcon research aircraft (Roiger et al., 2011). Flights covered latitudes from 57 to 81 • N and targeted anthropogenic and fire emissions in the troposphere and lower stratosphere.
YAK-AEROSIB (Airborne Extensive Regional Observations in Siberia) was conducted in July 2008 covering parts of northern and central Siberia using an Antonov-30 research aircraft, operated by the Tomsk Institute of Atmospheric Optics (Paris et al., 2008). This campaign was also performed in collaboration with the POLARCAT program (Paris et al., 2009

Design of model intercomparison
Simulations were run for each model over the same time period, from 1 January 2007 to 31 December 2008. This includes a 1-year spin-up period leading into a full 12-month simulation (January-December 2008) used in the analysis. A single emissions inventory was specified for use by all of the models, as described below. Each global model was run at its standard resolution with its standard chemistry scheme, meteorological forcing and other parameterizations. The requested model output included monthly mean species distributions and diagnostics to allow evaluation of the seasonal cycles of the models using surface and satellite observations. Hourly instantaneous output of a smaller number of species was requested for 30 March-23 April and 18 June-18 July (20-90 • N) to allow for comparison to the aircraft observations and NO 2 satellite retrievals.

Description of emissions
All modeling groups were asked to use the same emissions inventory. The anthropogenic emissions are from the inventory provided by D. were provided with this inventory, so the emissions for specific hydrocarbons were based on the VOC speciation of the RETRO inventory as in Lamarque et al. (2010). The anthropogenic emissions do not include any seasonal variation. Daily biomass burning emissions are from the Fire INventory of NCAR (FINN), which are based on MODIS fire counts . Other emissions (biogenic, ocean, volcano) were derived from the POET inventory (Granier et al., 2005). A preliminary comparison of the ARCTAS emissions to the MACCity inventory (Granier et al., 2011) showed the ARCTAS inventory has higher emissions and produced results closer to the observations in a MOZART simulation. Due to the different speciation of VOCs in the models, there is some slight difference in emission totals. Details of these different VOC treatments are given with the model descriptions below. Table 1 gives the emission totals for each species provided, by sector, while Table 2 gives totals calculated from the supplied output. Each model determined lightning emissions based on their usual formulation, as described below.

Artificial tracers
One goal of the POLMIP intercomparison was to separately compare tracer transport and chemistry among the models. Synthetic tracers with a fixed lifetime are a valuable tool for comparing transport only. Artificial fixed-lifetime tracers emitted from anthropogenic and wildfire sources of CO were specified for three regions: Europe (30-90 • N, 30 • W-60 • E), Asia (0-90 • N, 60-180 • E), and North America (25-90 • N, 180-30 • W). Each model (except GEOS-Chem) included these tracers in their simulation with a set lifetime of 25 days. Figure 1 shows the anthropogenic CO emissions, with fire emissions overlaid, for April and July monthly averages. The highest fire emissions are generally far removed from the anthropogenic emissions (e.g., northern Canada and Siberia) within each region. Since anthropogenic and biomass burning emissions have different relative amounts of CO, NO x and VOCs, the offset in location of the two source types leads to significant differences in atmospheric composition within these regions. These differences have particular relevance in the analyses of Monks et al. (2014) and Arnold et al. (2014) that use these tracers. Figure 1 also shows the daily variation of the fire emissions averaged over each tracer region. Asia had high fire emissions from March through July, but at different locations through that period (e.g., farther north in July than April). Biomass burning in eastern Europe began in April, with stronger fires in August. The North America fire emissions were significantly less on average, but were locally important in California and Saskatchewan in June and July.

Description of POLMIP models
Nine global and two regional models participated in POLMIP. The resolution and origin of meteorological analyses of each model is given in Table 3. Additional details are given below.

CAM-chem
The Community Atmosphere Model with Chemistry (CAMchem) is a component of the NCAR Community Earth System Model (CESM). Two versions were used in POLMIP, based on versions 4 and 5 of CAM. The CAM4-chem results shown here are slightly updated from those described in Lamarque et al. (2012), while CAM5-chem includes expanded microphysics and modal aerosols (Liu et al., 2012). Both versions of CAM-chem use the MOZART-4 tropospheric chemistry scheme (see MOZART description below), along with stratospheric chemistry, and are evaluated in . For POLMIP, CAM-chem was run in the "specified dynamics" mode, where the meteorology (temperature, winds, surface heat and water fluxes) is nudged to meteorological fields from GEOS-5, using the lowest 56 levels. Lightning NO emissions are determined according to   the cloud height parameterization of Price and Rind (1992) and Price et al. (1997). The vertical distribution follows De-Caria et al. (2005) and the strength of intra-cloud and cloud-ground strikes are assumed equal, as recommended by Ridley et al. (2005).

C-IFS (Composition-IFS)
The Integrated Forecasting System (IFS) of the European Centre for Medium Range Weather Forecasting (ECMWF) has been extended for the simulation of atmospheric composition in recent years. For the POLMIP runs, the CB05 chemical scheme as implemented in the TM5 chemical transport model (Huijnen et al., 2010) has been used (Flemming, 2015). C-IFS uses a semi-Lagrangian advection scheme. The POLMIP runs are a sequence of 24 h forecasts, initialized with the operational meteorological analysis. Lightning emissions in C-IFS are based on the model convective precipitation (Meijer et al., 2001) and use the C-shaped profile suggested by Pickering et al. (1998), and follows the same implementation as TM5, except that the lightning emissions are scaled to give a global annual total of 4.9 Tg N yr −1 .

GEOS-Chem
GEOS-Chem is a global chemical transport model driven by assimilated meteorological observations from the Goddard Earth Observing System (GEOS-5) of the NASA Global Modeling and Assimilation Office (GMAO) (Bey et al., 2001). GEOS-Chem version 9-01-03 (http://www. geos-chem.org) was used for this study. The standard GEOS-Chem simulation of ozone-NO x -HO x -VOC chemistry is described by Mao et al. (2010), with more recent implementation of bromine chemistry (Parrella et al., 2012). The chemical mechanism includes updated recommendations from the Jet Propulsion Laboratory (Sander et al., 2011) and the International Union of Pure and Applied Chemistry (http:// www.iupac-kinetic.ch.cam.ac.uk). In addition, this simulation includes HO 2 aerosol reactive uptake with a coefficient of γ (HO 2 ) = 1 producing H 2 O as suggested by Mao et al. (2013). Lightning NO emissions are computed with the algorithm of Price and Rind (1992)

GMI
GMI (Global Modeling Initiative; http://gmi.gsfc.nasa.gov/) is a NASA offline global CTM, with a comprehensive representation of tropospheric and stratospheric chemistry Strahan et al., 2007). The simulations for POLMIP were driven by MERRA meteorology. The GMI chemical mechanism treats explicitly the lower hydrocarbons (ethane, propane, isoprene) and has two lumped species for larger alkanes and alkenes following Bey et al. (2001). Several oxygenated hydrocarbons (e.g., formaldehyde, acetaldehyde) are simulated, including direct emissions and chemical production; acetone is specified from a fixed field. The mechanism includes 131 species and over 400 chemical reactions. Flash rates are parameterized in terms of upper tropospheric convective mass flux but scaled so that the seasonally averaged flash rate in each grid box matches the v2.2 OTD/LIS climatology.

LMDz-INCA
The LMDz-OR-INCA model consists of the coupling of three individual models. The Interaction between Chemistry and Aerosol (INCA) model is coupled online to the Laboratoire de Météorologie Dynamique (LMDz) general circulation model (GCM) (Hourdin et al., 2006). LMDz used for the POLMIP exercise is coupled with the ORCHIDEE (Organizing Carbon and Hydrology in Dynamic Ecosystems) dynamic global vegetation for soil-atmosphere exchanges of water and energy (Krinner et al., 2005), but not for biogenic CO 2 or VOC fluxes. INCA is used to simulate the distribution of aerosols and gaseous reactive species in the troposphere. The oxidation scheme was initially described in Hauglustaine et al. (2004) including inorganic and non-methane hydrocarbon chemistry. INCA includes 85 tracers and 264 gasphase reactions. For aerosols, the INCA model simulates the distribution of anthropogenic aerosols such as sulfate, black carbon, particulate organic matter, as well as natural aerosols such as sea salt and dust. LMDz-OR-INCA is forced with horizontal winds from 6-hourly ECMWF ERA Interim reanalysis. Lightning NO emissions are computed interactively during the simulations depending on the convective clouds, according to Price and Rind (1992), with a vertical distribution based on Pickering et al. (1998) as described in Jourdain and Hauglustaine (2001). The global annual lightning emissions total is 5 Tg N yr −1 .

MOZART-4
MOZART-4 (Model for Ozone and Related chemical Tracers, version 4) is an offline global CTM, with a comprehen-sive representation of tropospheric chemistry . While MOZART-4 includes the capability to calculate biogenic isoprene and terpenes using the MEGAN algorithms, the specified monthly mean emissions were used for POLMIP. Simulations were run with both an online photolysis calculation (FTUV) and using a lookup table (LUT), which is the same as that used in CAM-chem. Unless otherwise stated, the results shown here are from the LUT simulation. The MOZART-4 chemical mechanism treats explicitly the lower hydrocarbons (C 2 H 6 , C 3 H 8 , C 2 H 4 , C 3 H 6 , C 2 H 2 , isoprene) and has four lumped species for larger alkanes, alkenes, aromatics and monoterpenes. A number of oxygenated hydrocarbons (including formaldehyde, acetaldehyde, acetone, methanol, ethanol) are also explicitly treated with direct emissions and chemical production. The mechanism includes 100 species and 200 chemical reactions and the tropospheric gas-phase chemistry is the same as that used in the CAM-chem simulations for this study. Emissions of NO from lightning are parameterized as described above for CAM-chem .

TM5
TM5 (Tracer Model 5) is an offline global chemical transport model (Huijnen et al., 2010), where tropospheric chemistry is described by a modified carbon bond chemistry mechanism (Williams et al., 2013). The TM5 chemical mechanism includes explicit treatment of the lower hydrocarbons (C 2 H 6 , C 3 H 8 , C 3 H 6 ) and acetone, while other VOCs are treated in bulk. The mechanism is based on the CB05 scheme with modifications to the ROOH oxidation rate and HO 2 production efficiency from the isoprene + OH oxidation reaction (Williams et al., 2012). Photolysis is modeled by the modified band approach (Williams et al., 2012). In total, the TM5 chemical mechanism includes 55 species and 104 chemical reactions. Stratospheric O 3 is constrained using ozone columns from the Multi-Sensor Reanalysis (van der A et al., 2010). NO x production from lightning is calculated using a linear relationship between lightning flashes and convective precipitation (Meijer et al., 2001), using a C-shaped profile suggested by Pickering et al. (1998). Marine lightning is assumed to be 10 times less active than lightning over land. The fraction of cloud-to-ground over total flashes is determined by a fourth-order polynomial function of the cold cloud thickness (Price and Rind, 1992). The NO x production for intra-cloud flashes is 10 times less than that for cloud-toground flashes, according to Price et al. (1997).

TOMCAT
The TOMCAT model is a Eulerian global CTM (Chipperfield, 2006). This study uses an extended VOC degradation chemistry scheme, which incorporates the oxidation of monoterpenes based on the MOZART-3 scheme and the oxidation of C 2 -C 4 alkanes, toluene, ethene, propene, acetone, methanol and acetaldehyde based on the ExTC (Extended Tropospheric Chemistry) scheme (Folberth et al., 2006). Heterogeneous N 2 O 5 hydrolysis is included using offline sizeresolved aerosol from the GLOMAP model (Mann et al., 2010). The implementation of these two chemistry schemes into TOMCAT is described by Monks (2011) and Richards et al. (2013) and has 82 tracers and 229 gas-phase reactions. All anthropogenic, biomass burning and natural emissions were provided by POLMIP, with the exception of lightning emissions, which are coupled to the amount of convection in the model and therefore vary in space and time (Stockwell et al., 1999).

SMHI-MATCH
SMHI-MATCH (Multiple-scale Atmospheric Transport and Chemistry Modeling System) is an offline CTM developed at the Swedish Meteorological and Hydrological Institute (Robertson et al., 1999). SMHI-MATCH can be run on both global and regional domains but for the POLMIP model runs were performed for the 20-90 • N region. The chemical scheme in MATCH considers 61 species using 130 chemical reactions and is based on Simpson (1992) but with extended isoprene chemistry and updated reactions and reaction rates. Information about the implementation of the chemical scheme can be found in Andersson et al. (2007), where evaluation of standard simulations for the European domain is also given. ERA-Interim reanalysis data from ECMWF were used to drive SMHI-MATCH for the years 2007 and 2008. The 6-hourly data (3-hourly for precipitation) were extracted from the ECMWF archives on a 0.75 • × 0.75 • rotated latitude-longitude grid. The original data had 60 levels, but only the 35 lowest levels reaching to about 16 km in the Arctic were used in SMHI-MATCH. Monthly average results for 2007 and 2008 from global model runs using MOZART at ECMWF in the MACC (Monitoring Atmospheric Chemical Composition) project were used as both upper and 20 • N chemical boundary conditions. In addition to the standard daily POLMIP emissions, NO emissions from lightning were included using monthly data from the GEIAv1 data set, which has an annual global total of 12.2 Tg N yr −1 . DMS (dimethylsulfide) emissions were simulated using monthly DMS ocean concentrations and the flux parameterization from Lana et al. (2011).

WRF-Chem
The Weather Research and Forecasting model with Chemistry (WRF-Chem) is a regional CTM, which calculates online chemistry and meteorology (Grell et al., 2005;Fast et al., 2006). For the POLMIP runs the meteorology parameterizations are as described in the WRF-Chem (version 3.4.1) simulations of Thomas et al. (2013). Briefly, the initial and boundary conditions for meteorology are taken from the NCEP Final Analyses (FNL), with nudging applied to wind, temperature, and humidity every 6 h. The MOZART-4 POLMIP run is used for both initial and boundary conditions for gases and aerosols. The POLMIP emissions were used; however, the FINN fire emissions were processed using the WRF-Chem FINN processor, so the fire emissions are at finer resolution than 1 • (used by the global models). In addition, an online fire plume rise model was employed (Freitas et al., 2007). Lightning emissions were included using the Price and Rind (1992) parameterization as described in Wong et al. (2013). WRF-Chem was run at two model resolutions (50 and 100 km) during the summer POLARCAT campaigns, with 65 levels from the surface to 50 hPa. Selected chemical species (e.g., ozone) are set to climatological values above 50 hPa and relaxed to a climatology down to the tropopause. For the POLMIP runs, WRF-Chem employs the MOZART-4 gas-phase chemical scheme described in Emmons et al. (2010) and bulk aerosol scheme GO-CART (Goddard Chemistry Aerosol Radiation and Transport model, Chin et al., 2002), together referred to as MOZCART. The model was run from 28 June 2008 to 18 July 2008 using a polar-stereographic grid over a domain encompassing both boreal fires and anthropogenic emission regions in N. America to include the ARCTAS-B, POLARCAT-GRACE, and POLARCAT-France flights. Because of the limited temporal and spatial extent of the WRF-Chem results they could not be included in some of the plots and analysis below.

Overview of model characteristics and differences
In order to better understand the differences among models shown later in the comparison to observations, some general model characteristics are illustrated. Figure 2 shows zonal averages of temperature and water vapor for each of the models. As the models are driven or nudged by assimilated meteorology fields, their temperatures are in close agreement. One exception is LMDz-INCA, which has only horizontal winds nudged to ECMWF winds and the remaining meteorological fields are calculated. The temperature differences seen here between LMDz-INCA and the other models are comparable to those seen in the ACCMIP (Atmospheric Chemistry and Climate Model Intercomparison Project) comparisons (Lamarque et al., 2013). The models show some variation in water vapor, particularly in the tropics and in the upper troposphere at high northern latitudes. Some models (e.g., MOZART-4, CAM-chem) calculate water vapor and clouds based on surface fluxes, while others use the GEOS-5-or ECMWF-provided specific humidity values. Significant differences in cloud distributions are seen among the models, as shown in Fig. 3   to significant differences in photolysis rates, as shown in Fig. 4. For example, CAM5-chem has greater cloud fractions in the tropical upper troposphere than CAM4-chem, which leads to lower photolysis rates, particularly noticeable in J (O 3 → O 1 D). MOZART-FTUV simulations used the online Fast-TUV photolysis scheme that includes the impact of aerosols on photolysis but also has some outdated crosssections that are the larger source of the differences with the MOZART-4 results.
All of these inter-model differences in physical parameters, along with differing transport schemes, lead to differences, to varying degrees, in the modeled ozone and OH distributions. Figure 5 summarizes these model differences by plotting the pressure-latitude location of the 50 and 100 ppb ozone contours of the April and July zonal averages. The 100 ppbv O 3 contour line is one method used to estimate the location of the tropopause. The model results shown here generally agree on the location of the 100 ppb contour, with two exceptions indicating a lower Arctic tropopause height: MATCH in April and July, and TOMCAT in July. The models vary widely in the distribution of tropospheric ozone. In April at high northern latitudes, the 50 ppb O 3 contour for GEOS-Chem is at the highest altitude (500 hPa at 50 • N) while GMI is at the lowest (900 hPa). Great variability is also seen in the tropics in both April and July. Some model differences in the lower troposphere could be due to different ozone dry deposition velocities, which can have a significant impact on ozone in the boundary layer (Helmig et al., 2007a). However, ozone deposition rates were not provided for this intercomparison so this impact cannot be assessed. In the upper troposphere, model differences are more likely driven by differences in stratosphere-troposphere exchange. In addition, ozone chemical production and loss rates determine model ozone distributions, as indicated in the comparison of ozone precursors, below. Figure 6 similarly shows the zonal averages of OH, illustrating the large differences among models in the magnitude of OH concentration. In April, most of the models have values above 2 × 10 6 molecules cm −3 in the northern tropics from the surface to 500 hPa. GMI is the only model to show a maximum greater than 2 × 10 6 molecules cm −3 also in the upper troposphere. About half of the models have OH concentrations of at least 1×10 6 molecules cm −3 throughout the troposphere between latitudes 20 • S and 50 • N. In July, even greater variability among models is seen in the shapes of both contour levels.
To further illustrate and understand the differences in the modeled ozone, the time series of ozone and its precursors Atmos. Chem. Phys., 15, 6721-6744, 2015 www.atmos-chem-phys.net/15/6721/2015/ are plotted in Fig. 7 as monthly zonal averages at 700 hPa over 50-70 • N, the latitude range of most of the aircraft observations. As in Fig. 5, wide variation among models is seen for ozone. Here we see disagreement in even the shape of the seasonal cycle. The mixing ratios of CO differ among models by 50 %, largely due to the differences in OH but also affected by concentrations of hydrocarbons that are precursors of CO. The differences in CO among models are discussed in detail in Monks et al. (2014). Ethane (C 2 H 6 ) is only directly emitted, without any secondary chemical production, so the differences among models are due to OH or emissions. GEOS-Chem used slightly different emissions (see Table 2) and MATCH included acetone (CH 3 COCH 3 ) and acetylene (C 2 H 2 ) emissions in the ethane emissions as they do not simulate those species. The differences in H 2 O 2 (hydrogen peroxide) are likely a result of different washout mechanisms in the models but are also related to the HO 2 differences. In addition, the heterogenous uptake of HO 2 on aerosols may differ significantly among models (e.g., Mao et al., 2013) but was not investigated in this comparison. LMDz-INCA and TOMCAT have higher NO 2 , PAN and HNO 3 than others. GEOS-Chem has low PAN but relatively high HNO 3 . TM5 and C-IFS have lower formaldehyde (CH 2 O) than other models. High variability is seen among the models for acetaldehyde (CH 3 CHO) and acetone, with some disagreement in the seasonal cycle. The models have varying complexity in the hydrocarbon oxidation schemes, which contributes to the differences in these oxygenated VOCs, as discussed in Arnold et al. (2014). The differences among models are further explained below with regard to comparisons to observations.

Comparison to observations
An overall evaluation of the models is presented here through comparison to ozonesondes, surface network NMHC measurements, satellite retrievals of NO 2 , and simultaneous observations of ozone and its precursors from aircraft. A comprehensive evaluation of the CO distributions in the POLMIP models is presented by Monks et al. (2014).

Ozonesondes
Coincident with the NASA ARCTAS aircraft experiment, daily ozonesondes were launched at a number of sites across North America (Fig. 8) Thompson et al., 2011). Ozonesondes with their high vertical resolution and absolute accuracy of ±(5-10) % are extremely valuable for model evaluation. The hourly POLMIP model output was matched to the time and location of each ozonesonde. Since the models, with roughly 0.5-1 km vertical layer spacing in the free troposphere can- not reproduce all of the observed structure, the ozonesonde data and model profiles were binned to 100 hPa layers. The mean of each bin between the surface and 300 hPa was used to calculate the bias between model and measurements for each profile. Figures 9 and 10 show the mean and standard deviation of the observed ozone profiles at each site, along with the mean bias for each model, determined by averaging the difference between each model profile and the corresponding sonde profile. A small number of sondes were launched from Narragansett (four in April; three in June-July), so they have not been used here. In April, the models generally underestimate the observed ozone throughout the troposphere (negative bias). One consistent exception is SMHI-MATCH, which is higher than observed in the middle and upper troposphere, perhaps indicating that this model has too-strong transport of ozone from the stratosphere. At all sites, GEOS-Chem has the lowest ozone values at all altitudes above the boundary layer. TOMCAT also has among the largest negative bias in the lower and midtroposphere, but is higher than most of the models at 300 hPa. All the other models have a fairly uniform (across altitude and sites) negative bias of about 5-10 ppb. The models have slightly lower biases in June-July on average. At Kelowna and Goose Bay, the model biases fall within ±10 ppb; however, at several other sites (e.g., Churchill Lake and Bratt's Lake), the model mean biases are as much as 20 ppb below the observations. These comparisons are consistent with the ozone evaluation using aircraft observations presented by Monks et al. (2014).

Surface network ethane and propane
The NOAA Global Monitoring Division/INSTAAR network of surface sites provides weekly observations of light NMHCs around the globe (Helmig et al., 2009). The model results for ethane and propane are compared to the data over a range of northern mid-to high latitudes in Fig. 11. Monthly mean model output is used and the nearest grid point (longitude, latitude, altitude) selected for each site. All models (except GEOS-Chem, which used higher ethane emissions and has lower OH concentrations) significantly underestimate the winter-spring observations, indicating the POLMIP emissions are much too low for both C 2 H 6 and C 3 H 8 , consistent with the conclusion that CO emissions are too low (as discussed in Sect. 5.4 and in Monks et al., 2014).

Evaluation of NO 2
Satellite observations of NO 2 have been used to evaluate the individual model distributions of NO 2 across the Northern Hemisphere, as well as to evaluate the NO x emissions used for all the models. Each model was compared to OMI (Ozone Monitoring Instrument) DOMINO-v2 NO 2 tropospheric column densities (Boersma et al., 2011), matching the times of overpasses for each day and filtering out the pixels with satellite-observed radiance fraction originating from clouds greater than 50 %. In order to make a quantitative comparison between model results and satellite retrievals (of any kind), the sensitivity of the retrievals to the true atmospheric profile must be taken into account. This is done by transforming each model profile with the corresponding retrieval averaging kernel and a priori information (e.g., Eskes and Boersma, 2003), hence making the evaluation independent of the a priori NO 2 profiles used in DOMINO-v2. The transformation of the model profiles with the averaging kernels gives model levels in the free troposphere relatively greater weight in the column calculation. For instance, depending on the surface albedo the sensitivity to the upper free troposphere compared to the surface layer may increase by roughly a factor of 3 (Eskes and Boersma, 2003). This means that errors in the shape of the NO 2 profile can contribute to biases in the total column.
The statistics of the biases between the model results and the OMI NO 2 tropospheric columns are used to evaluate the NO x emissions inventory used in this study. Figure 12 shows the NO 2 tropospheric column from OMI for 18 June-15 July (the period that hourly model output was provided) and the median of the model biases for that period. The models gen- erally underestimate NO 2 over continental regions with high levels of anthropogenic pollution (e.g., California, northeastern United States, Europe, China); however, a few models overestimate NO 2 over North America (not shown). All models overestimate NO 2 over northeast Asia, in the region of fires (quantified below). OMI NO 2 retrievals have low signal to noise ratios over oceans and continental regions with low pollutant levels; therefore, conclusions should not be drawn by the model comparisons for those regions.
In Fig. 13 the OMI tropospheric columns are screened on a daily basis for pixels where at least 90 % of the total NO x emissions, based on the emissions inventory, originate from anthropogenic or biomass burning emissions, respectively. Only pixels with significant emission levels are shown. In this way, dominating source regions that are either primarily anthropogenic or biomass burning can be identified. Boxes are drawn around the highest concentrations in Fig. 13, and the regional mean for each model, along with the mean observed columns, are summarized in Fig. 14. The model results are lower than the observations for most of the anthropogenic region comparisons. For instance, NO 2 columns over South Korea are considerably underestimated. Also, the inter-quartile range is relatively large for the Europe and eastern China regions indicating a large uncertainty introduced by the models. Since all models used the same NO emissions, the large variation among models (as seen in Fig. 14) indicates differences in the chemistry and transport processes affecting NO and NO 2 . The large region of biomass burning in western Asia (April) is well captured, but over eastern Asia the models are typically too high. Also, the NO 2 from Siberian fires in July is greatly overestimated. The NO 2 column amounts are much lower for the fires in Canada than in Asia, but the models also overestimate the concentrations of this region, suggesting the NO x emission factor is too high for forests in the FINN emissions.
Atmos. Chem. Phys., 15, 6721-6744 Figure 9. Comparison of models to ozonesondes for April, showing mean and standard deviation of the observations (black line) and the mean bias (colored lines) for each model at each site (Tarasick et al., 2010;Thompson et al., 2011). Results shown for only surface to 300 hPa for clarity. The number of sondes for each site is indicated in the lower right corner of each panel.  Figure 10. As Fig. 9, but for June-July.

Comparison to aircraft observations
For each aircraft campaign, the hourly output from each model was interpolated to the location and time of the flight tracks. These results have been compared directly to the corresponding observations for as many compounds as available. Figure 15 shows the flight tracks of the campaigns which have been colored to indicate the grouping used in the following comparisons. The ARCTAS-A and ARCPAC (A1, A2, AP) campaigns took place in April and were based in Alaska. The A1 group of flights surveyed the Arctic between Alaska and Greenland at the beginning of April, while A2 and AP were primarily over Alaska in mid-April, which was after significant wildfires began in Siberia and influenced the observations (e.g., Warneke et al., 2009). The ARCTAS- CARB flights focused on characterizing urban and agricultural emissions in California, but also sampled the wildfire emissions present in the state. ARCTAS-B, based in central Canada, sampled fresh and aged fire emissions over Canada and into the Arctic. The POLARCAT-France and GRACE experiments, based in southern Greenland, sampled downwind of anthropogenic and fire emissions regions and included observations of air masses from North America, Asia, and Europe. Figures 16, 17 and 18 show vertical profiles of the observations with model results for the flights during ARCTAS-A1, ARCPAC and ARCTAS-B, respectively. For these plots the observations and the model results along the flight tracks were treated in the same way: each group of flights was binned according to altitude and the median value of each 1 km bin has been plotted. The thick error bars represent the measurement uncertainty (determined by applying the fractional uncertainty reported in each measurement data file to the median binned value), while the thinner horizontal lines show the variation (25th-75th percentiles) in the observations over the flights. In general the measurement uncertainty is much less than the atmospheric variability; however for ARCPAC, several measurements have relatively large uncertainties (such as SO 2 , NO 2 and HNO 3 ).
In addition, the difference between each model and the observations was determined for each data point along the flight tracks and then an average bias was determined for the altitude range 3-7 km, as shown in Fig. 19. In the cases where a compound was measured by more than one instrument, the differences between the model and each observation were averaged over all the measurement techniques. The uncertainties shown in Figs. 16-18 need to be kept in mind when considering the biases shown in Fig. 19.
Several models, but not all, underpredict ozone in spring by more than 10 %, consistent with the ozonesonde comparison shown in Fig. 9. All models (except GEOS-chem) underpredict CO and hydrocarbons in spring and summer, likely indicating that the emissions used for POLMIP are too low. NO and NO 2 are generally underestimated in spring, with NO 2 biases ranging from 20 to 90 % too low. In summer, all of the models match well the NO and NO 2 observations in the mid-troposphere, but NO 2 is generally overestimated in the boundary layer (ARCTAS-B; Fig. 18), consistent with the OMI NO 2 comparisons for the Canada fire regions (Fig. 14).
NO y partitioning between PAN and HNO 3 is vastly different among the models (see Arnold et al., 2014). Many models significantly overestimate HNO 3 (by a factor of 10 in some cases), which could be primarily due to differences in washout and missing loss processes. A new version of LMDz-INCA includes the uptake of nitric acid on sea salt and dust, accounting for 25 % of the total sink of nitric acid (Hauglustaine et al., 2014). GMI includes otherwise unaccounted for nitrogen species in HNO 3 , partially explaining its overestimate. The simulated PAN values also vary significantly across models, which may be due to the differences in PAN precursors (NO x and acetaldehyde) at anthropogenic and fire source regions. Alkyl nitrates were found to be a significant contribution to the NO y budget of the ARCTAS observations, particularly in low-NO x environments, and the Atmos. Chem. Phys., 15, 6721-6744, 2015 www.atmos-chem-phys.net/15/6721/2015/ poor (or lack of) representation of them in the models, could also lead to model errors in NO y partitioning (Browne et al., 2013). The PAN measurements during ARCPAC are only available for the last half of the campaign, during which numerous fire plumes were sampled that were of a scale too fine to be reproduced in the models, resulting in an apparent underestimate by all of the models in the free troposphere (Fig. 17). The observed PAN values during ARCPAC are significantly higher than the ARCTAS-A1 observations, which were made before the Siberian fire plumes began influencing the Alaskan region. The models show very different concentrations in various oxygenated VOCs and very little agreement with observations. Methanol and ethanol are generally underestimated by the models. The models do a poor job of simulating formaldehyde in spring, but are much closer to the observations in summer (during ARCTAS-B and -CARB). In April, acetaldehyde is underestimated by all of the models throughout the troposphere but with large differences among the models (10-95 % biases). In summer the models are more uniformly far below (80-100 %) the observations. Acetone is also poorly simulated by the models, with large differences among models in both spring and summer. Acetone in TM5 is particularly low, likely due to excessive dry deposition.
For ARCTAS (Figs. 16,18), the comparison to OH observations is shown. The distribution of OH is strongly affected by clouds and their impact on photolysis, which coarse-grid  Figure 14. Summary of the regional means from each model and the OMI NO 2 tropospheric columns for each region indicated in Fig. 13. (top) Anthropogenic emissions in April and (middle) June-July, (bottom) biomass burning in both seasons. Red circles are mean OMI NO 2 observations for the region; box plots show median, 25th and 75th quartiles, whiskers to 5th and 95th percentiles of the model means. models cannot be expected to reproduce; however, these differences are likely averaged out in the binned vertical profiles. The average biases indicate that in April most of the models underestimate OH, particularly in the lower troposphere. The underestimate of ozone by some of the models will also lead to lower OH. In summer, the biases are smaller (see Fig. 18). The wide range of results in comparison to H 2 O 2 indicate that there is great uncertainty in the simulation of the HO x budget.
Photolysis rates, calculated from actinic flux measurements on the NASA DC-8, are available for the ARCTAS flights. The photolysis rates J (O 3 → O 1 D) and J (NO 2 ) from a few models are compared to the observations in Figs. 16 and 18. MOZART and CAM-Chem, which use the same photolysis parameterization (fully described in Lamarque et al., 2012;Kinnison et al., 2007), agree fairly well with observations, while TOMCAT and SMHI-MATCH generally underestimate the photolysis rates. Some differences are expected due to the difficulty of representing clouds in the models.

Enhancement ratios of VOCs in fires
The measurements of numerous compounds and the frequent sampling of air masses influenced by wildfires by the DC-8 aircraft during ARCTAS allowed for a derivation of enhancement factors of VOCs relative to CO for several sets of fires, as cataloged and summarized by Hornbrook et al. (2011). In that analysis, Hornbrook et al. (2011) used a variety of parameters to identify fire-influenced air masses, their origin, age, including acetonitrile and hydrogen cyanide (CH 3 CN and HCN, which have primarily biomass burning sources), back trajectories from the aircraft flight tracks, and NMHC ratios (to determine photochemical age). During ARCTAS-B, numerous observations were made of fresh plumes from the fires burning in Saskatchewan, providing good statistics of the enhancement ratios. Since the photochemical age of these sampled plumes was generally less than 2 days, the error introduced due to chemical processing of the plumes is much less than for the older plumes from Asia, for example. The sampling of fresh plumes from the fires in Saskatchewan, with little influence of local anthropogenic sources, makes this a good period and location for the evaluation of fire emissions in the models.
Due to the coarse resolution of the models, along with the uncertainties in location, vertical distribution and strength of the sources, it is not expected that the models will capture the magnitude or exact location of plumes that were sampled by the aircraft. Therefore, instead of using the model results interpolated to the flight tracks, all of the grid points with CO mixing ratios greater than 150 ppb within the region of the fires (54-58 • N, 252-258 • E, model levels between the surface and 850 hPa) were used from the hourly output from each model. This model output was used to derive enhancement ratios of VOCs relative to CO, comparable to those derived by Hornbrook et al. (2011) (given in their Table 2 and Fig. 7). Figure 20 shows the enhancement ratios derived from the aircraft measurements, giving the mean and standard deviation of all observed Saskatchewan fire plumes. Also shown in Fig. 20 are the emission factors (EF) determined from the emissions inventory used by the models, averaged over 28 June-5 July and 54-58 • N, 252-258 • E. For each model, the enhancement ratio was determined as the slope of a linear fit to the correlation of each VOC to CO.
For the VOCs with direct emissions and little or no secondary production (ethane, propane, methanol, ethanol), the VOC / CO ratios of the model mixing ratios are very close to the emission factors of the inventory used by the models. This indicates the chemical processing in the vicinity of the fires is slow enough that the observations are a good indicator of the actual fire emission factors. This also means the model ratios can be quantitatively compared to the observations. Thus, we can conclude for the Saskatchewan fires that the fire emission factors used are too high for ethane, too low for propane, about right for methanol and much too low for ethanol. However, the compounds that have significant chemical production in addition to emissions (i.e., formaldehyde, acetaldehyde and acetone) have very different mixing VOC / CO ratios from the emission ratios. The model enhancement ratios of CH 2 O and CH 3 CHO are significantly higher than the inventory emission factors due to chemical production, but they agree well with the observations. The model ratios for acetone, however, are lower than the observations but not very different from the emission factor, implying the emission factors are too low.

Conclusions
Eleven global or regional chemistry models participated in the POLARCAT Model Intercomparison Project (POLMIP), allowing for an assessment of our current understanding of the chemical and transport processes affecting the distributions of ozone and its precursors in the Arctic. To limit the differences among models, a standard emissions inventory was used. All of the models were driven, to at least some degree, by observed meteorology (GEOS-5, NCEP or ECMWF) and therefore represented the dynamics of the study year (2008).
While the extensive suite of aircraft observations in 2008 at high northern latitudes is extremely valuable for evaluating the models, they cannot uniquely identify the source of emissions errors, as the Arctic is influenced by many sources at lower latitudes. However, several conclusions can be drawn about the emissions inventory used in this study. Based on the comparisons to aircraft observations and the NOAA surface network data, emissions of CO, ethane and propane are clearly too low. The comparisons to satellite retrievals of OMI NO 2 show a few regions of consistent model errors that indicate that anthropogenic NO x emissions are underestimated in East Asia, while fire emissions are overestimated in Siberia. Large differences are seen among the model NO 2 tropospheric columns over Europe and China, Atmos. Chem. Phys., 15, 6721-6744       thus limiting the conclusions that can be drawn regarding the accuracy of the emissions inventory. The large range in modeled NO 2 (where NO x emissions were the same) also indicates that model chemistry and boundary layer parameteri-zations can significantly impact NO x chemistry. More accurate emissions inventories might greatly improve many of the model deficiencies identified in this study. Emissions inventories modified based on inverse modeling results, as well as Atmos. Chem. Phys., 15, 6721-6744, 2015 www.atmos-chem-phys.net/15/6721/2015/ Correlations of VOCs to CO for the POLMIP models, compared to those derived from the DC-8 observations for the fires in Saskatchewan. The filled circle shows the enhancement ratio derived from DC-8 observations (Hornbrook et al., 2011). The asterisk shows the EF of the model emissions. The colored diamonds are the enhancement ratios determined for each model.
results of this study, will be used in future work as one step in improving model simulations of Arctic atmospheric composition.
The simultaneous evaluation of the models with observations of reactive nitrogen species and VOCs has illustrated that large differences exist in the model chemical mechanisms, especially in their representation of VOCs and their oxidation. Most of the models showed a negative bias in comparison to ozone observations from sondes and aircraft, with a slightly larger difference in April than in summer. The models frequently underestimated ozone in the free troposphere by 10-20 ppb in the comparison with ozonesondes. In addition, 10-30 % negative model biases were seen in comparison to the mid-troposphere aircraft ozone measurements. Comparisons for ozone precursors such as NO x , PAN, and VOCs show much greater biases and differences among models. It appears numerous factors are the causes of these model differences. The differences among model photolysis rates and cloud distributions indicate some of the possible causes for differences in modeled OH, which leads to differences in numerous species and ozone production and loss rates.
Some differences among the simulated results are likely due to different physical parameterizations such as convection, boundary layer mixing and ventilation, wet and dry deposition. Additional model diagnostics are required to better understand the differences among models. For example, comparison of the wet deposition rates and fluxes of a number of compounds could be informative in understanding the budgets of NO y , HO x and VOCs.
Evaluation of chemical transport models with numerous simultaneous observations, such as those of the POLARCAT aircraft experiments, can assist in a critical assessment of ozone simulations and identify model components in need of improvement. Model representation of the oxidation of VOCs and the NO y budget can have a significant impact on ozone distributions. Future chemical model comparisons should consider evaluation of VOCs and reactive nitrogen species as an important component of the evaluation of ozone simulations.
Acknowledgements. The numerous individuals who provided observations used in this study are gratefully acknowledged, including William H. Brune, Jingqiu Mao, Xinrong Ren, and David Shelow of Pennsylvania State University for the ARCTAS DC8 LIF OH measurements; Paul Wennberg and John Crounse of California Institute of Technology for the ARCTAS DC8 CIT-CIMS data (supported by NASA award NNX08AD29G); Steve Montzka of NOAA/ESRL/GMD for NOAA P3 flask samples of propane during ARCPAC; Joost de Gouw of NOAA/ESRL/CSD for ARCPAC PTRMS VOC observations; and John Holloway of NOAA/ESRL/CSD for ARCPAC CO and SO 2 (UV fluorescence) measurements.
The GEOS-5 data used with CAM-chem in this study have been provided by the Global Modeling and Assimilation Office (GMAO) at NASA Goddard Space Flight Center. We acknowledge the free use of tropospheric NO 2 column data from the OMI sensor from www.temis.nl. French co-authors acknowledge funding from the French Agence National de Recherche (ANR) CLIMSLIP project and CNRS-LEFE. POLARCAT-France was supported by ANR, CNRS-LEFE and CNES. This work was performed in part using HPC resources from GENCI-IDRIS (grant 2014-017141). V. Huijnen acknowledges funding from the European Commission under the Seventh Framework Programme (contract number 218793). Contributions by SMHI were funded by the Swedish Environmental Protection Agency under contract NV-09414-12 and through the Swedish Climate and Clean Air research program, SCAC. A. Wisthaler acknowledges support from BMVIT-FFG/ALR. ARCPAC was supported in part by the NOAA Climate and Health of the Atmosphere programs. J. Mao acknowledges the NOAA Climate Program Office's grant NA13OAR4310071. L. K. Emmons acknowledges support from the National Aeronautics and Space Administration under award no. NNX08AD22G issued through the Science Mission Directorate, Tropospheric Composition Program. The CESM project is supported by the National Science Foundation and the Office of Science (BER) of the US Department of Energy. The National Center for Atmospheric Research is funded by the National Science Foundation.
Edited by: M. Schulz