Radiative forcing and climate metrics for ozone precursor emissions: the impact of multi-model averaging

. Multi-model ensembles are frequently used to assess understanding of the response of ozone and methane lifetime to changes in emissions of ozone precursors such as NO x , VOCs (volatile organic compounds) and CO. When these ozone changes are used to calculate radiative forcing (RF) (and climate metrics such as the global warming potential (GWP) and global temperature-change potential (GTP)) there is a methodological choice, determined partly by the available computing resources, as to whether the mean ozone (and methane) concentration changes are input to the radiation code, or whether each model’s ozone and methane changes are used as input, with the average RF computed from the individual model RFs. We use data from the Task Force on Hemispheric Transport of Air Pollution source– receptor global chemical transport model ensemble to assess the impact of this choice for emission changes in four regions (East Asia, Europe, North America and South Asia). We conclude that using the multi-model mean ozone and methane responses is accurate for calculating the mean RF, with differences up to 0.6 % for CO, 0.7 % for VOCs and 2 % for NO x . Differences of up to 60 % for NO x 7 % for VOCs and 3 % for CO are introduced into the 20 year GWP.


Introduction
One method for characterising uncertainty in the climate sciences is to perform large, multi-model ensemble studies. This approach, provided that the range of models do indeed capture the range of climate responses to an applied perturbation, provides far more information, not only on the most likely climate response, but also on the likelihood of a range of possible responses -i.e. the uncertainty associated with the mean response. However, if further downstream analysis is performed on such a large model ensemble study, then methodological choices, which may be constrained by pragmatic concerns such as data processing time, must be made.
One common example of such an application of a model ensemble is in the calculation of climate metrics and their associated uncertainty. Climate metrics provide an important method of comparing the mean climate effects of emissions of various forcing agents. It is therefore desirable to be able to compute such metrics quickly and efficiently from input ensembles, but where possible without compromising on the quality of the reported values and, crucially, their associated measurements of model spread. Metrics such as the global warming potential (GWP) and global temperature-change potential (GTP, Shine et al., 2005) introduce additional uncertainty and depend strongly on the time horizon, H , that is under investigation, but also on the spatial distribution of the forcing agent, and its lifetime in the atmosphere. These last two properties can vary strongly with model.
It would therefore seem reasonable to ask, what is the minimum volume of data processing and input information that can be used to provide meaningful estimates of climate metrics from large multi-model studies, without compromising the quality of the reported metrics and the representativeness of the associated spread.
The Hemispheric Transport of Air Pollutants (HTAP) study, provides a useful test case for the present work (Task Force on Hemispheric Transport of Air Pollution, 2010). A part of this project perturbed emissions of species which are known to affect atmospheric ozone concentrations by 20 % (in this case, NO x , VOCs (volatile organic compounds) and CO). An ensemble of 11 chemistry transport models (CTMs) took part, and each perturbed the 3 precursors in 4 pre-defined source regions. Subsequent work by Fry et al. (2012) and Collins et al. (2013) assessed the radiative forcing (RF), GWP and GTP for the precursor species. Computational limitations prevented the analysis of the variability between models in the RF, GWP and GTP in Fry et al. (2012); instead, the ensemble mean fields ± 1 standard deviation were deemed to provide the minimum subset of fields which could be used to represent the mean and standard deviation in the derived metrics.
In the present work, we calculate the RF, GWP and GTP using output from each individual model in the HTAP ensemble. We then compare our results to those obtained with the ensemble-mean subsetting method of Fry et al. (2012). Hence, we can quantitatively assess the extent to which the RF calculated with the mean fields accurately represents the mean of the RF calculated using the ozone and methane fields from each model individually. Further, by comparing the estimates of model and metric uncertainty (as represented by the standard deviations) in RF, and in GWP and GTP, we can assess whether such a representative subset can be used to accurately convey the spread in derived climate metrics. The result of this assessment will then guide the extent to which the use of the computationally less expensive ensemble-mean fields can be used, without compromising the quality of information.
The particular case of NO x is interesting because cancellation between RF due to different components of the total RF (and hence the GWP and GTP) can substantially reduce model spread (Holmes et al., 2011), if individual components are correlated. Using values drawn from the aviation NO x literature, they found that in general, a large (positive) RF due to the short-lived ozone forcing (driven directly by the NO x ) in any one model, was associated with a large (negative) long-lived ozone forcing (driven indirectly by the ef-fect of NO x on methane concentrations) in the same model. Hence the uncertainty in the net RF, derived from considering the uncertainty in each component on its own, was found to be almost double the uncertainty in the net RF when the correlation was taken into account. Our work builds on Holmes et al. (2011) by exploiting results from a single multi-model intercomparison, and investigating the effects of different timescales on the cancellation, for emissions from a number of different regions, and extends it to CO and VOCs (where the cancellation present in the NO x case does not occur).
Section 2 introduces the HTAP data and scenarios, and describes the radiation code used to perform the radiative transfer calculations. The method of Fry et al. (2012) to generate the subset of fields for input to the radiation code is briefly described, together with a description of further preparing this output for generation of the GWP and GTP metrics. Section 3 presents the initial ozone and methane fields that serve as input to the radiation code for both methodologies, and briefly discusses their differences. Sections 4 and 5 discuss the effect of the different methodologies on the reported RF and GWP and GTP, respectively, and conclusions are given in Sect. 6.

Models
The HTAP study perturbation scenarios reduced by 20 % emissions of short-lived ozone precursor gases NO x , CO and VOCs in four different regions (North America, Europe, South Asia and East Asia), and a further run in which methane concentrations were perturbed globally. There are therefore 13 scenarios in addition to one control simulation. The models each ran for a period of 12 months after a spinup time of at least 3 months . The resulting output of interest to this study are the tropospheric ozone fields, which are provided on each model grid at monthly mean resolution. Auxiliary information on methane lifetime changes for each scenario is used to calculate the change in methane and long-lived ozone concentrations as described in Sect. 2.3. Table 1 shows the HTAP nomenclature for the experiments, and the locations of the source regions. 11 CTMs (see Table 2) produced results for these scenarios. For comparison with previous literature, the 11 models used in our study are the same as those used in Fry et al. (2012) and Collins et al. (2013) (Table 2).
Of the 11 CTMs used in this study, 9 use meteorological background fields from reanalyses to drive the model, while 2 (STOC-HadAM3-v01 and UM-CAM-v01) are coupled to global climate models (GCMs) and use 2001 sea ice and sea surface temperature data to drive the GCM. The models also use a variety of sources for the baseline emissions data, with the result that a 20 % decrease in emissions is not equivalent in mass terms between models. Therefore, the model spread accounts for not only the uncertainties associated with transport and atmospheric chemistry, but also in background emissions, which can be a substantial source of uncertainty. As input to the radiation code, however, it is the absolute mass change of the species which is important for the radiative transfer calculations. The model output is re-gridded to a common resolution of 2.75 • latitude × 3.75 • longitude, with 24 vertical levels, which is comparable to the resolution of the models on average. A common tropopause was identified as the level at which the lapse rate falls below 2 K km −1 . As many of the models do not include stratospheric chemistry, stratospheric changes in all species are neglected, and, above the tropopause, the models share a common climatology. Given the relatively coarse vertical resolution of the models, and that the data are monthly mean, any definition of tropopause is necessarily imperfect; however, this method ensures clarity when averaging monthly mean fields to form ensemble means, and minimises the aliasing of stratospheric ozone into the troposphere as part of the averaging process.
For each model, January, April, July and October are used as input to the code, in order to reduce run-time constraints whilst remaining sufficient to reasonably sample the annual cycle in transport and RF. Sensitivity tests have shown that the long-lived ozone and methane RFs are almost completely insensitive to increasing the number of months included (less than 1 part in 1000), and the short-lived ozone RFs have a sensitivity of the order of 0.5 % to increasing the number of months. Table S4 in the Supplement provides a brief outline of the sensitivity tests.

Radiation code
This study uses the Edwards-Slingo radiation code (Edwards and Slingo, 1996). The code uses the two-stream approximation to calculate radiative transfer through the atmosphere. Clouds are included in the code. Nine broadband channels in the longwave and six channels in the shortwave are used. Incoming solar radiation at mid-month, and Gaussian inte- Table 2. Methane lifetime (α), feedback factor (f ), and the methane lifetime change due to a 20 % global reduction in methane, for each of the 11 CTMs, and the ensemble mean and standard deviation, as calculated in Fiore et al. (2009). Model abbreviations are explained in Fiore et al. (2009 gration over six intervals is used to simulate variation in the diurnal cycle. A common background climatology supplying temperature and humidity are taken from the European Centre for Medium-Range Weather Forecasts reanalyses (Dee et al., 2011). Mean cloud properties from the International Satellite Cloud Climatology Project (ISCCP) are also used for all RF simulations (Rossow and Schiffer, 1999). RF is calculated as the difference in the net flux at the tropopause after the stratospheric temperature has been allowed to adjust using the standard fixed dynamical heating method (e.g. Fels et al., 1980).

Construction of input metrics
The necessary inputs to the radiation code are the changes in atmospheric concentration of any radiatively active species. In this case, the relevant species are short-lived ozone, methane, and long-lived ozone, which is perturbed as a result of the influence of methane on the abundance of the OH radical.
The CTMs produce [OH], [O 3 ] and associated atmospheric loss rates as 3-D output fields. Short-lived ozone can be used directly as input to the radiation code. Methane fields for each model and each simulation were globally homogeneous, and fixed at 1760 ppbv, except in the CH 4 scenario, when they are reduced to 1408 ppbv. Equilibrium methane concentrations for each scenario have been calculated in Collins et al. (2013)  lifetimes are calculated in Fiore et al. (2009). These lifetimes include loss terms such as those due to soil processes and stratospheric loss; however, all those except the atmospheric term are assumed to be constant. The change in methane lifetime is calculated in Collins et al. (2013) from the change in [OH] (which accounts for around 90 % of loss of atmospheric CH 4 , and all other sinks are considered constant). Finally, the feedback factor, f , is determined in Fiore et al. (2009) from the change in loss rates between the control and the CH 4 perturbation scenarios, and accounts for the effect of methane change on its own lifetime (Prather, 1996). Further, long-lived changes also arise from the change in ozone resulting from a change in methane, which in turn depends on the change in methane lifetime for a given scenario. The long-lived ozone changes for each model and scenario are calculated as described in West et al. (2009) by scaling the ozone change in the CH 4 perturbation simulation by the relative change in methane concentration in each scenario as given in Fiore et al. (2009).
For each individual model, the inputs to the radiation code are the control and scenario 3-D monthly mean short-lived ozone, methane and long-lived ozone fields. Radiative transfer calculations are performed separately on each of these fields, so that the individual contributions can be separated out. The RF is the difference between the scenario and control fields for each species, and the total RF is taken to be the sum of these components. Sensitivity tests have shown that the total RF is very close (within 0.5 %) to the sum of the individual contributions from the component gases. The mean of the resulting RF ensemble is denoted RF.
This full model ensemble is contrasted with the method used in Fry et al. (2012). This method first constructs a representative subset of model input fields for input into the radiation code. This subset comprises the ensemble mean control fields, plus the ensemble mean ± standard deviation short-lived ozone, methane and long-lived ozone perturbations. This subset of fields is constructed as follows: firstly, each model field for each month is regridded to a common resolution; secondly, the mean and standard deviation of the ozone field is calculated for each month, for each pixel at each level. The standard deviation is then added to or subtracted from the mean field to give a 3-D representative field for each month.
These fields are grouped into four cases: the first comprises the control fields; the second the mean total ozone change (i.e. the sum of the short-and long-lived mean ozone fields) together with the mean methane change; the third the mean plus standard deviation total ozone and methane change; and the final case the mean minus the standard deviation changes. Therefore the radiation code must run only 3 times for each HTAP scenario (plus once for the control run), relative to 33 (11 models, 3 gaseous species) plus 11 control runs for the complete case.
The subsetting method of calculation used in Fry et al. (2012) gives only the total RF for each scenario as output.
The contributions to the total RF from each of the short-lived ozone, methane and long-lived ozone are then calculated from this total. First, the methane RF is calculated from the change in concentration using the simple formula of Myhre et al. (1998) ], α is a constant, 0.12, N is N 2 O in ppb (constant at 315 ppb) and M is CH 4 in ppb and the subscript 0 indicates the unperturbed case.
The difference between the total RF and this methane RF is then attributed to ozone. For the calculation of the GWP and GTP metrics, it is further necessary to separate the ozone RF between the short-and long-lived components. This is achieved by scaling the RF due to the (purely long-lived) ozone perturbations in the SR2 scenario by the ratio of the long-lived ozone change in any given scenario and the SR2 scenario. This RF is attributed to the long-lived ozone, with the final residual being attributed to the short-lived ozone. The mean and standard deviation of the RF calculated using this subset of fields are denoted RF EN .

Climate metrics
The methodology for calculation of the climate metrics (GWPs and GTPs) follows that described in Fuglestvedt et al. (2010), including the same impulse-response function for carbon dioxide, and the climate impulse-response function sensitivities from Boucher and Reddy (2008) which is needed for the GTP calculation. The metric calculations require the RF per unit emission per year, for each precursor and for the short-lived ozone, long-lived ozone and methane changes individually.
The calculation of GWP and GTP for each individual model is straightforward, as is the subsequent calculation of the ensemble mean and standard deviation. The implied change in methane emissions in the SR2 scenario must be calculated, as the scenario itself perturbed the atmospheric methane concentrations directly. This is done following the method in Collins et al. (2013) for each individual model.
The GWP and GTP are both the sum of a short-lived ozone component, which depends only on the ozone RF, and a longlived component, which depends on the methane and longlived ozone RF, and the change in the methane lifetime.
For the Fry-method subset, the ensemble-mean GWP and GTP are first calculated, and then a separate standard deviation due to each of these four variables is calculated. The total mean and standard deviation due to ozone changes are calculated, and then the total standard deviation is calculated in standard fashion as the square root of the sum of the variances. Note that this assumes independence between the variables. This is not necessarily the case because of correlations between the different perturbations (e.g. Wild et al., 2001); however, for the purposes of this evaluation this provides a useful upper bound, and is consistent with the published literature (Collins et al., 2013). Table 2 shows the control-run methane lifetimes, the feedback factor and the change in methane lifetime between the control and the CH 4 perturbation experiment for each model and the ensemble mean. The methane lifetime varies by about 20 %, from around 8 to 10 years, with the exception of the LLNL-IMPACT-T5a model, which has a much shorter lifetime of around 5.5 years. The feedback factor has a variability of around 10 %, with no substantial outliers.

Ozone and methane input fields
To test whether the model ensemble-mean and standard deviation input fields can be used to generate climate metrics that are representative of the model ensemble, we must first establish the extent to which the ensemble mean and standard deviation represents the input fields. Figure 1 shows the short-lived ozone annual-average mass changes for the 10 or 11 individual models used in this study (note that INCA VOCs, SA region, and LLNL NO x , all regions, are missing in the input fields).
The ensemble-mean and standard deviation short-lived ozone mass change, and the true mean and standard deviation are shown in red and blue, respectively. The mean values are identical in both cases. The two sets of bars represent the spread in the model ensemble and denote the model standard deviation calculated in two different ways. Those in blue show the standard deviation calculated from the globalaverage burden change for each individual model. Those in red show the area-average of the 3-D grid-point-level standard deviation fields, as in the subsetting method used by Fry et al. (2012). Here, the bars are calculated as the global annual-mean ± 1 standard deviation ozone field. The global average of the grid-point level standard deviation fields is not equal to the standard deviation calculated after the global mean for each model has been calculated; i.e. the order of operations in this case makes a substantial difference to the ±1 standard deviation bars. For any set of fields, the true standard deviation will always be overestimated by the area average of the 3-D standard deviation.
This effect is purely mathematical in origin, and its size is related to the degree of inhomogeneity of the initial fields. The short-lived ozone mass change fields are spatially inhomogeneous in both the horizontal and the vertical. Of the three precursor species, NO x is the most short-lived, and has the highest degree of spatial inhomogeneity. Therefore the difference between the two methods of standard deviation calculations is largest in the ozone fields for the NO x case. For a completely homogeneous field (in kg m −3 ), there would be no difference in standard deviation between the two methods.
The largest standard deviations relative to the mean are found for the VOC case, in part due to large differences between the models in terms of VOC speciation and chemistry schemes (e.g. Collins et al., 2002). Since each model defines its own VOC class within the chemistry scheme, the initial burden and the atmospheric lifetime can vary substantially between models.
It should also be noted that the spatial distribution of the short-lived ozone mean and standard deviation fields is not necessarily representative of any single, individual model. Figure 2 shows the deviation from the ensemble-mean column integrated ozone field for the NO x NA case. The top three rows show the deviation from the ensemble mean for each ensemble member, and the bottom row shows the same for the ensemble mean and standard deviation fields. By con-  Table 1) for each individual model (top three rows). The bottom row shows the ensemble mean deviation (centre, by definition this is zero everywhere) and the plus (left) and minus (right) 1 standard deviation from this mean.
struction, in the bottom row, the deviation from the mean is everywhere positive for the positive case, and always negative for the negative case. However, for any individual model, there can be both positive and negative deviations and for only a few models do their deviations resemble the ensemble-mean case. Therefore the resulting RF fields from the ensemble-mean calculation may not be expected to provide a realistic representation of the spread of forcings about the mean, when individual model ozone fields are used to calculate the forcing.

Radiative forcing
The major part of this section discusses the effect of the two averaging methods on the mean and spread of RF estimates. However, the RF's for the individual models in the HTAP ensemble have not previously presented, and may be of some interest. A brief discussion of the complete ensemble also serves to frame the subsequent discussion around appropriate averaging methods.    Table 1. Units are mW m −2 (Tg year −1 ) −1 for the NO x , VOCs and CO cases. For the CH 4 case, results are presented un-normalised, in mW m −2 . Colours show RF due to short-lived ozone (light blue), methane (red) and long-lived ozone (dark blue). Figure 3 shows the RF for all 11 models, per unit mass emission (mW m −2 (Tg year −1 ) −1 N, C and CO for the SR3, SR4 and SR5 scenarios, respectively), and the RF in mW m −2 for the 20 % reduction in methane for the SR2 scenario. RF due to short-lived ozone, methane and long-lived ozone is in general largest in SA and smallest in EU for any given model and scenario, largely due to an increased RF per unit radiatively active species due to warmer background temperatures in SA relative to EU, and also a greater impact of oxidant changes on methane lifetime in the tropics. For VOCs and CO, the methane and ozone RF act in the same direction, in contrast to NO x , where methane is suppressed and therefore it, and the long-lived ozone, act to oppose the RF due to short-lived ozone. The global-mean RF for any given model is less dependent on the location of the emission for the CO case than for the VOCs or NO x case, as CO has a much longer atmospheric residence time of 3 months, which is of the same order as the hemispheric atmospheric mixing time. The differences between the regions are therefore more pronounced for NO x than for VOCs or CO, as a result of the greater inhomogeneity in the input fields.

Results for individual ensemble members
The forcing for the CH 4 perturbation scenario (bottom panel of Fig. 3) comprises only the methane and long-lived ozone contributions, since there is no short-lived ozone forcing arising from a change in methane. The absolute methane RF is identical (−141 mW m −2 ) across all models, as they all have the same mixing ratio change, but they differ in the size of the long-lived ozone response to the change in methane.
For a particular precursor species, models with a large response in one region will tend to have a large response in all regions, i.e. the models all agree on the order of the regional responses. These depend on the relative size of emissions change in each region and the mass-normalised RF. This is a good indicator of consistency across different emissions data Table 3. Total RF per unit mass emission (mW m −2 (Tg year −1 ) −1 ) for each scenario. The standard deviation values given for RF EN are the RF resulting from the Fry-method subset mean and standard deviation short-lived ozone, methane and long-lived ozone fields, as described in Sect. 3. The true standard deviation values for RF are calculated after the total RF for each model in each scenario has been calculated; therefore they are not equal to the sum of the standard deviation for each component gas. For the CH 4 case, results are presented unnormalised, in mW m −2 , since the perturbation was a 20 % reduction in atmospheric concentration of methane, rather than a reduction in precursor emissions. sets and in transport in models, information which cannot be gained by using the model ensemble mean alone. Therefore differences between regions are more robust than suggested by the standard deviation.
For NO x , there is substantial anti-correlation between the short-lived ozone and methane responses, and hence the short-lived and long-lived ozone responses, with r 2 values between 0.70 (EA) and 0.86 (NA and SA, Table S2). This will result in a smaller standard deviation than if the quantities were truly independent of each other, as found by Holmes et al. (2011) for the case of aviation NO x emissions. Table 3 compares RF, ±1 standard deviation per unit mass emissions change, with the mean and standard deviation of the computationally much less intensive RF EN (the case in which the subsetting approach used in Fry et al. (2012) has been followed).

Ensemble-mean RF measures
Differences between the means are only of the order of a few percent, with the largest differences found for the NO x NA case of 2 %. For VOCs and CO, the differences are essentially negligible. The larger fractional difference in the case of NO x is due to the fact that the means are a small residual of two much larger components. Hence RF EN is representative of the true ensemble mean, RF. By contrast the standard deviation in the RF case is smaller for each regional scenario relative to RF EN . This is largely associated with the inability of the pre-calculated ensemble mean fields to represent the true model spread, as described in Sect. 3. Figure 4 separates the total RF into components due to the long-lived ozone, methane, and short-lived ozone contributions, for each scenario and gas, for the RF EN and RF and their associated standard deviations. The differences in the size of the standard deviation is in general much larger for the short-lived ozone RF estimates (light blue bars), than for the long-lived ozone or methane components. This difference is, in effect a direct transform of the mathematical averaging effect in the input fields (see Sect. 2.3), and the standard deviation divided by the mean is the same in the input fields as it is after the radiative transfer calculations.
In the CH 4 perturbation case, the absolute methane RFs (red bars) have no uncertainty associated with inter-model differences because the methane concentration change is fixed. The RF calculated using the formula of Myhre et al. (1998) is −139.6 mW m −2 for RF EN , whereas the value calculated by the Edwards-Slingo radiation code for RF is slightly more negative at −141 mW m −2 . It should be noted that some uncertainty is introduced into subsequent metric calculations, arising from the variability in the implied methane emission change, which in turn arises from variability in the methane lifetime and change in methane lifetime.

Global warming potentials
The results above suggest that the subsetting approach to reduce the volume of calculations that must be performed may indeed be a useful method for quickly calculating ensemble mean RF; however, it is also apparent that estimates Atmos. Chem. Phys., 15, 3957-3969, 2015 www.atmos-chem-phys.net/15/3957/2015/ of the model spread might not be most appropriately calculated in this fashion. Metrics that are further downstream in terms of the impact chain, such as GWP and GTP, introduce further nonlinearities which must be considered when discussing the validity of this subsetting approach. Estimates of the GWP using the ensemble mean subsetting method are denoted GWP EN , while the true values are denoted GWP. GWPs for each individual model are calculated as described in Section 2.4 using the method of Fuglestvedt et al. (2010). Tables 4 and 5 give the values of the 20-and 100year GWP, respectively, in each case for the two methods under consideration, with the associated standard deviations. As previously, the mean values resulting from both methods remain very similar, with differences of the order of 2-3 % for CO, 5 % for VOCs and up to 60 % for NO x , once again as it is a small residual of the opposing short-and long-lived terms.
Estimates of the standard deviation using the subsetting method described in Fry et al. (2012), consistent with the previous section, are larger than the full model ensemble; however, the difference between the two standard deviation estimates is no longer simply related to the differences in the input fields.
The total GWP at time horizon H is the sum of contributions from short-and long-lived components (i.e from RF due to short-lived ozone, and due to long-lived ozone, methane concentration and methane lifetime, respectively). The difference between this estimate of the standard deviation and the full ensemble estimate therefore depends on the size of each of these terms and their relative contribution to the total estimate of the standard deviation.
The absolute GWP of the short-lived ozone component does not depend on the time horizon under consideration, and it is still in effect directly proportional to the RF. Therefore the standard deviation divided by the mean of the shortlived ozone GWP remains the same as that for the RF and indeed for the input ozone fields, as does the relative difference in the size of the standard deviation estimates from the two methods. Table S3 gives the GWPs and GTPs, together with their associated standard deviation estimates for the total and for each contributing component.
The time-evolving components of the GWP, however, do not preserve this relationship, although the calculated standard deviations for each component remain larger using the subsetting method than calculating the true spread from the individual model GWPs. The total GWP is the sum of these components, and the relative difference in the calculated standard deviations from the two methods depends on the relative size of the contributions from the long-and short-lived components. Table 4. Ensemble-mean 20 year GWP. The true mean GWP is denoted GWP. The GWP calculated using the method described in Fry et al. (2012) is denoted GWP EN . Average methane lifetimes used in the metric construction are given in Table 2 At 20 years, the short-lived ozone contributes proportionately more to the total GWP than at 100 years. This results in the relative differences between the standard deviation estimates from the two methods being proportionately larger at 20 than at 100 years for CO, VOCs and NO x .

Global temperature-change potentials
The 20-and 100-year GTP means and standard deviations for the two methods are given in Tables 6 and 7. In common with most of the GWP calculations, the mean GTPs for both methods differ by only a few percent. The standard deviation estimates resulting from the subsetting method are once again almost always larger than the true value obtained from the complete ensemble.
Similar principles apply to the relationship between the uncertainty estimates for the GTP as for the GWP. One im-portant difference relative to the GWP in the 20-year case is the much larger relative contribution of the long-lived terms relative to the short-lived ozone terms. This means that, in contrast to the 20-year GWP, the 20-year NO x GTP is robustly negative in all cases.
For the 100-year GTP, in general the short-lived ozone contribution is a relatively larger contributor to the total than for the 20-year case. The relative contributions of each species and the methane lifetime to the total standard deviation estimates for both methods are given in Table S3 in the Supplement. This interplay between the various timescales associated with the GWP and GTP evolves with time, with the result that the difference between the two methods also evolves with time.
Atmos. Chem. Phys., 15, 3957-3969, 2015 www.atmos-chem-phys.net/15/3957/2015/  Figure 5 shows the time evolution of the GWP (top) and GTP (bottom) for the NO x SA region. Coloured lines show the evolution of each model, with the solid black line and dotted lines giving the true mean and standard deviation. The dashed lines and grey shading give 1 standard deviation about the mean GWP EN . Models which have a longer methane lifetime have a steeper GWP gradient at 20 years than models with a short methane lifetime; however, this is not necessarily a good indicator of a more negative NO x GWP at 20 years. Of the four longest lifetime models, three (CAMCHEM-3311m13, UM-CAM-v01 and MOZECH-v16) have GWP values that are more positive than the mean, with the fourth (GISS-PUCCINI-modelE) lying well within 1 standard deviation.

Comparison of GWP and GTP time evolution for NO x
This indicates that they also have a large short-lived ozone forcing.
GWP has its largest standard deviation between 10 and 30 years, when both short-lived ozone and methane forcings are important. The GWP EN overestimates the true standard deviation everywhere, but particularly around 10-30 years. At these timescales, the standard deviations produced in this way lie outside the range of the ensemble members, and therefore are not a good estimate of the uncertainty of the ensemble.
The GTP (lower panel in Fig. 5) does not have the same "memory" of early forcing as the GWP, so that the model spread decreases substantially after about 30 years. The separate effects of a long methane lifetime and a large shortlived ozone forcing can be more clearly seen here for UM-CAM-v01 (red line), which has a very negative minimum GTP value of less than −200, several years after the other ensemble members. The largest uncertainty in the GTP is also around 20 years, when both the short-lived ozone, and the methane and longlived ozone RF are important. Again, the GTP EN substantially overestimates the uncertainty between 10 and 30 years. At times greater than about 35 years, however, the GTP EN begins to agree better with the true GTP. The GTP EN may even slightly underestimate the uncertainty at these longer times due to the slightly smaller methane RF estimate calculated in Sect. 4.

Discussion and conclusions
This study has investigated the derivation of RF and climate emission metrics (GWP and GTP at various time horizons) for emissions of short-lived climate forcing agents from multi-model assessments, using the results of the HTAP ozone precursor emission experiments as an example. Multi-model means and their associated standard deviations of the ozone perturbations can be used as input to radiative transfer codes, which is clearly more computationally efficient than calculating the radiative forcing for each model individually and averaging the results. Overall, our results indicate that the order of averaging does not have a major impact on the mean values. It does, however, have a larger impact on estimates of the uncertainties.
The global-mean RF from emissions of ozone precursors is only mildly sensitive to using the ensemble-mean input fields with differences in the mean not exceeding 3 %. However, the standard deviation of the RF is rather distinct between the two cases. The true standard deviation (using the RF derived from each model individually) is always smaller than the standard deviation when calculating the RF with the ensemble-mean ozone change. This effect is mostly due to the construction of the input ozone fields overestimating the true ensemble spread. In the case of the long-lived ozone, the RF EN standard deviation is about 30 % larger than the true value. For the more spatially inhomogeneous short-lived ozone, the overestimate varies between 20 % for the VOC EA scenario to 90 % for the NO x EA case.
The GWP EN and GTP EN mean values agree well with the true mean as might be expected from the RF estimates, the difference not exceeding 10 % for VOCs and CO, although they can be somewhat larger (up to 60 % in EA) for the 20 year GWP NO x . This approach may therefore be sufficient for some purposes given the computational saving that may be achieved, particularly with larger ensembles.
For estimates of uncertainty, however, there is substantial disagreement between the two methods. The overestimate of uncertainty associated with the short-and long-lived ozone RF propagates to the climate metrics. These terms are the dominant cause of the increased uncertainty, rather than methane lifetime effects. For all time horizons, the uncertainty in GWP EN is not only substantially larger than the GWP, but lies outside of the range covered by the model ensemble itself. Therefore this approach should not be used when deriving the uncertainty in GWP.
There is in general a similar overestimate of the uncertainty in the GTP at short time horizons due mainly to the short-lived ozone; however, at time horizons greater than about 40 years, the ozone forcing becomes relatively less important to the GTP, and the uncertainty in GTP EN is generally more in line with the true uncertainty estimate.
The Supplement related to this article is available online at doi:10.5194/acp-15-3957-2015-supplement.