Using airborne HIAPER Pole-to-Pole Observations (HIPPO) to evaluate model and remote sensing estimates of atmospheric carbon dioxide

. In recent years, space-borne observations of atmospheric carbon-dioxide (CO 2 ) have become increasingly used in global carbon-cycle studies. In order to obtain added value from space-borne measurements, they have to sufﬁce stringent accuracy and precision requirements, with the lat- 5 ter being less crucial as it can be reduced by just enhanced sample size. Validation of CO 2 column averaged dry air mole fractions (XCO 2 ) heavily relies on measurements of the Total Carbon Column Observing Network TCCON. Owing to the sparseness of the network and the requirements 10 imposed on space-based measurements, independent additional validation is highly valuable. Here, we use observations from the HIAPER Pole-to-Pole Observations (HIPPO) ﬂights from 01/2009 through 09/2011 to validate CO 2 measurements from satellites (GOSAT, TES, AIRS) and atmo- ﬂights

à Yes, multiple GOSAT soundings are used per HIPPO profile and averaged (as stated before, "For the GOSAT comparison, we require more than 5 co-located GOSAT measurement per HIPPO profile.". We changed that sentence to "For the GOSAT comparison, we require at least 5 co-located GOSAT measurement per HIPPO profile, all of which are subsequently averaged before comparison against HIPPO". It was also stated before that "For each match, the standard error in the GOSAT XCO2 average is computed using the standard deviation of all corresponding GOSAT colocations divided by the square root of the number of colocations." è Added, thanls Typos: Page 10, line 6: "are usually 162253" ? è Typical LaTeX typing error (accidentally copying something without noticing it), apologies. Caption Tab. 1: "of different compared to" -> "of the difference compared to" è Done, thanks We thank Anonymous Referee #2 for a positive and thoughtful review. In the following, we will respond to the Reviewers comments step by step.
Minor revisions: p5, l8-10: This enables ... denoted XCO2 I think this statement would be more clear to the reader when a line is added to indicate that an extension is needed above 14 km. Then you can indeed state that this extension is of limited consequence since most of the variability in XCO2 stems from the troposphere which is covered by the HIPPO profiles. è Good point. We changed to: "This enables a comparison of individual sub-columns of air but also of column-averaged mixing ratios of CO$_2$, denoted XCO$_2$, if the profile can be reliably extended above 14\,km. As the troposphere dominates the variability in XCO$_2$, errors induced by extending profiles are expected to be small." p7, l11-12: As most ... analysis. Add a line why SCIAMACHY does not provide data over oceans è Added "…because it lacks a dedicated Glint measurement mode" and explained it better later as well, as requested by Rev. #1.
p9, l9-11: Validation ... (Olsen and Licata, 2014). If Olsen and Licata already have compared IR/MW L2 and IR-Only L2 against HIPPO, then I would expect a sentence explaining how the current study differs and/or extends wrt. the cited paper. è We rephrased and extended that sentence to reflect the main differences (using models to fill up the profile).: Olsen and Licata (2014) compare the IR/MW based and IR-Only based CO2 retrievals over the globe for 2010-2011 and for collocations with the deep-dip HIPPO-2, HIPPO-3, HIPPO-4 and HIPPO-5 profiles. Their global analysis reveals that the zonal monthly average difference rarely exceeds 0.5 ppm save at the high northern latitudes in January and October where fluctuations resulting from small number statistics dominate. Their analysis against HIPPO employs only the deep-dip measured profiles, i.e. those in which the aircraft reached the 190 hPa pressure level, to ensure good in situ measurement coverage of the AIRS sensitivity profile and to minimize the error introduced by their simple approximation of extending the aircraft profile into the stratosphere by replicating the highest altitude measurement. During the HIPPO-2 and HIPPO-3 campaigns, the AMSU channel 5 noise figure was acceptable, whereas during the HIPPO-4 and HIPPO-5 campaigns it progressively degraded at a rapid rate. For all campaigns, the two sets of collocations, averaging AIRS retrievals within ±24 hours and 500 km of the aircraft profile, exhibit the same bias and RMS to within 1 ppm for |lat| ≤ 60°. The current study extends the in situ measurements to higher altitude by the means of CarbonTracker and MACC model output, thereby allowing use of all HIPPO profiles rather than only the deep-dip profiles. Our results are statistically consistent with the latitude-dependent biases reported by Olsen and Licata (2014) and give a more detailed view of the scatter as a function of latitude. p10, l14-23: Figure 4 ... potential biases. HIPPO 3 is nicely explained in this paragraph, but HIPPO 5 is depicted in the Figure but not mentioned. Any comment that the authors can make on the MACC and CT differences/similarities?
è Added "In HIPPO 5, at the end of the growing season, the situation is reversed as the profile slopes change sign after the large CO$_2$ uptake during summer." And "For HIPPO 5, the deviations for CT2013B are somewhat smaller but it can be seen that most models suffer from these potential biases if large vertical gradients exist." p11, l2-10: Here, we look ... in the future. This alinea is mostly about measurements and campaigns that are not treated in the paper. I understand why the authors like to mention this, but maybe the conclusion, which includes a future outlook mentioning OCO-2, is the better spot for this. è This is indeed better, we moved this to the Conclusions.
p11, l11-19: For the comparison ... were the truth). This my strongest comment on the paper: Since the requirements on XCO2 are so stringent, it matters for the comparisons in this paper how exactly 1) the HIPPO profiles are extended, 2) the averaging kernel is applied, and 3) the null-space is attributed. I would recommend to incorporate a small section/paragraph explaining the mathematical details. Questions that come to mind: Is the model information just attached to the HIPPO pro-file? If a jump would appear in such a profile, how is that treated? Is the smoothed (extended) HIPPO profile compared to the GOSAT profile without null-space contribution, or is there also a null-space contribution to the smoothed HIPPO profile? If the latter, which reference is used? The same as in the GOSAT retrievals, or the model?
è This is a good point even though we prefer to keep this short in the paper. Re 1). The HIPPO profiles are extended with the model data before applying the averaging kernel correction. 2). The AK corrected HIPPO values are computed as xa+A(xt-xa) with the a priori profile xa and the "true" profile xt (HIPPO + model). For GOSAT, the column averaging kernel was used, for TES and AIRS the averaging kernel for the respective retrieval layer. We have not tested the impact of a jump in a profile; in the manuscript, a simple profile extension was performed without testing smoothness. In most cases, the impact should be relatively small. The null space contribution in GOSAT comparisons should be small as the column averaging kernels are relatively large throughout the entire column. In general, HIPPO data has always been filled in with model data, not satellite priors. We added For GOSAT: "For the HIPPO comparison against GOSAT data, we take the instrument sensitivity into account by applying the averaging kernel to the difference of the true profile (using the model-extended HIPPO dataset as truth) and the respective a priori profile. We perform this correction using both model extensions independently and then use the average of the two. " For TES: "For the comparison with TES, we use the 510\,hPa retrieval layer and apply averaging kernel corrections using model-extended HIPPO data as {\em truth}, using both models indepdently and averaging results after averaging kernel correction." For AIRS: "For the comparison with AIRS (Fig\,. \ref{fig:HIPPO_AIRS}), the sensitivity maximum varies around 300\,hP and we apply the averaging kernels similarly to TES." We hope this will clarify the issue.
p11, l22-24: Even after ... for MACC. Please refer to Figs 5 and 6 è done p11, l22: Even after normalization It is clear how the HIPPO data is corrected, but how is the other data corrected? With the HIPPO value, or with the average value of the particular model? è With the HIPPO value. We added a sentence "For each campaign, we also normalize all data with the respective campaign average of the HIPPO dataset." p13, l23: lower left quadrant Maybe the authors would like to note that these points are also outliers in the CT comparison. Not as strong as in the case of MACC, but still in C3 the same quadrant, which may be an indication that the transport errors in both models are roughly equal and/or the GFED data is somewhat off. è We mentioned that "both models" show that feature. p24, Fig 3: There are some strong excursions in the HIPPO profiles close to the surface; any explanation for these? è These might be caused by dips close to the surface with HIPPO, potentially coming from the land data. It should not really affect XCO2 a lot as it only affects a small subcolumn.
HIPPO-1, 3, and 4 (and possibly 5), the differences between HIPPO and MACC resp. CT differs significantly for > 70N. Any explanation for this behaviour? è We agree, there seem to be substantial differences but we don't have any explanation yet for this and would not like to speculate too much.
Please, reposition the legend box; CT-HIPPO 5 is barely visible. è done p26-p28, Fig 5- spheric CHartography SCIAMACHY (e.g. Schneising et al., 2014) and the Greenhouse Gas ::::: Gases : Observing Satellite GOSAT (Lindqvist et al., 2015) haven :::: have : shown to reproduce the seasonal cycle as well as the secular trend of total column CO 2 abundances reasonably well (Kulawik et al., 45 2015). However, accuracy requirements are very stringent (Miller et al., 2007), warranting large scale biases of less than 0.5-1 ppm, being less than 0.3% of the global background concentration. This is one of the most challenging remote sensing measurement :::::::::::: measurements from space as we 50 not only want to reproduce known average seasonal cycles and trends but also small inter-annual deviations, resolved to subcontinental scales. There have been successes in doing so (e.g. Basu et al. (2014); Guerlet et al. (2013)) but controversies regarding overall retrieval accuracy on the global scale still remain (Chevallier, 2015) and can neither be fully refuted nor confirmed with validations against the Total Column Carbon Observing Network (TCCON) (e.g. Kulawik et al., 2015). In addition, total uncertainties might be a mix of measurement and modeling biases (Houweling et al., 2015), 60 for which uncertainties in vertical transport can play a crucial role (Stephens et al., 2007;Deng et al., 2015).
In this manuscript, we use the term accuracy to refer to systematic errors that remain after infinite averaging and can vary in space and time. Globally constant systematic errors are easy to correct for but those with spatio-temporal dependencies can have a potentially large impact on flux inversions.
Given the importance of the underlying scientific questions regarding the global carbon cycle and the challeng-70 ing aspect of both the remote sensing aspect as well as the atmospheric inversion, every additional independent validation beyond ground-based data can be crucial. Here, we use measurements from the HIAPER Pole-to-Pole Observations (HIPPO) programme (Wofsy, 2011) to evaluate both atmo-75 spheric models as well as remotely sensed estimates of atmospheric CO 2 .
2 Data description

110
For the comparison of HIPPO against model data as well as for a more robust comparison of HIPPO against total column satellite CO 2 observations, we use two independent atmospheric models that both provide 4D CO 2 fields (space and time) that are consistent with in-situ measurements of atmo-115 spheric CO 2 . The main differences between those are the use of a different inversion scheme as well as underlying transport model. In addition, both models were used to extend individual HIPPO profiles from the highest flight altitude to the top of atmosphere when comparing to total column estimates 120 from the satellite.

CarbonTracker CT2013B
CarbonTracker (Peters et al. (2007) with updates documented at http://carbontracker.noaa.gov) is a CO 2 modeling system developed by the NOAA Earth System Research Lab-125 oratory. CarbonTracker (CT) estimates surface emissions of carbon dioxide by assimilating in situ data from NOAA observational programs, monitoring stations operated by Environment Canada, and numerous other international partners using an ensemble Kalman filter optimization scheme built 130 around the TM5 atmospheric transport model (Krol et al. (2005); http://www.phys.uu.nl/~tm5/). Here we use the latest release of CarbonTracker, CT2013B, which provides CO 2 mole fraction fields globally from 2000-2012. In this study, we interpolate modeled CO 2 mole fractions to the times and 135 locations of individual HIPPO observations.

MACC v13r1
Monitoring Atmospheric Composition and Climate (MACC, http://www.copernicus-atmosphere.eu/) is the European Union-funded project responsible for the development of the 140 pre-operational Copernicus atmosphere monitoring service. Its CO 2 atmospheric inversion product relies on a variational Bayesian formulation, developed by LSCE :: (Le :::::::::: Laboratoire ::: des :::::::: Sciences :: du ::::::: Climat :: et ::: de ::::::::::::::: l'Environnement), that estimates 8-day grid-point daytime/nighttime CO 2 fluxes and 145 the grid point total columns of CO 2 at the initial time step of the inversion window. It uses the global tracer transport model LMDZ (Hourdin et al., 2006), driven by the wind analyses from the ECMWF.   Tables S1 and S2 of Chevallier 2015). For this study, the model simulation has been interpolated to the time 155 and location of the individual observations using the subgrid parametrization of the LMDZ advection scheme in the 3 dimensions of space (Hourdin and Armengaud, 1999). For the sake of brevity, we refer to MACC version 13r1 simply as MACC.

AIRS (v5)
The AIRS Version 5 (V5) tropospheric CO2 product is a re-205 trieval of the weighted partial-column dry volume mixing ratio characterizing the mid-to upper-tropospheric CO 2 concentration. The product is derived by the technique of Vanishing Partial Derivatives (VPD) described in Chahine et al. (2005) and is reported at a nominal nadir resolution of 90 km 210 x 90 km over the globe over the latitude range 60S to 90N and time span September 2002 to present. The VPD method assumes a CO 2 profile that is a linearly time-dependent global average constant volume mixing ratio throughout the atmosphere. Using that prior profile, the 215 VPD derives CO 2 by shifting the CO 2 , T, q and O 3 profiles and minimizing the residuals between the cloud-cleared radiances and those resulting from the forward calculation for channel subsets selected to avoid contamination by surface emission (except in regions of high topography). Fur-220 ther, it localizes the maximum sensitivity to variations of CO 2 concentration to the pressure regime spanning 300 hPa to 700 hPa.

325
: if ::::: large ::::::: vertical :::::::: gradients ::::: exist. : Overall, both CT2013B as well as MACC show an excellent : a ::::: good : agreement with HIPPO over the oceans. In some cases, MACC seems to compare somewhat better, which might be related to the longer inversion window of MACC, which can have 330 an impact over remote areas such as the Pacific Ocean. However, this statement cannot be generalized as it may be specific to remote areas with low measurement density and be very different elsewhere.
These numbers should not be used to compare the models against each other because, as evident in Fig. 2, there are 380 regions where either one or the other model is in better agreement with HIPPO. In conclusion, one can state that most model mismatches are below 1 ppm in remote areas such as the oceans and can reach 2-3 ppm over the continents with potentially higher values in under-sampled areas with high 385 CO 2 uptake such as the US corn belt. In addition, it should be mentioned that both models ingest a multitude of CO 2 measurements at US ground-based stations and areas further away might be less well modeled. However, the excellent . Left: Scatterplot of :::::::: normalized :::: (with :::::::: campaign :::::: average) XCO2 computed from individual HIPPO profiles (x-axis) against corresponding MACC data. Right: Difference plot of XCO2 against latitude. Campaigns as well as North and Southbound tracks are color-coded. agreement provides a benchmark against which satellite re-390 trievals have to be measured.

GOSAT
The comparison of GOSAT satellite data against HIPPO is somewhat more complicated because there is not necessarily a matching GOSAT measurement with each HIPPO profile.

395
For coincidence criteria, we follow exactly Kulawik et al. (2015), based on the dynamic co-location criteria detailed in Wunch et al. (2011);Keppel-Aleks et al. (2011. In addition, we require that the difference of CT2013B sampled at the HIPPO and the actual GOSAT location is less than 400 0.5 ppm, thereby bounding the error introduced by the spatial mismatch between HIPPO and respective GOSAT soundings. For each match, the standard error in the GOSAT XCO 2 average is computed using the standard deviation of all corresponding GOSAT colocations divided by the square root of 405 the number of colocations.
In Figure 7, the scatterplot of HIPPO vs. GOSAT is depicted. It is obvious that the data density is far lower than for the models because a) HIPPO 1 is not overlapping in 415 time and b) only a subset of HIPPO profiles is matched with enough co-located GOSAT soundings. This gives rise to a reduced dynamic range in XCO 2 , from about -1.5 to 3 ppm difference to the campaign average. However, both slope and r 2 are also in excellent agreement with HIPPO and only very 420 few points are exceeding 1 ppm difference. Those that are < 1 ppm are also associated with larger uncertainties induced by model extrapolation, as seen in the larger error-bars for HIPPO in the left panel (esp. for HIPPO 2S). The right panel shows the discrepancies for the models as well, just for 425 the subset that could be compared against GOSAT and using the model sampled at the GOSAT locations.
One can see that it is hard to make a clear statement on whether GOSAT or the models compare better with HIPPO. Figure 8 shows this comparison in more detail, plotting 430 model-HIPPO differences on the x-axis and GOSAT-model differences on the y-axis. As before, the error-bar for GOSAT is derived as the standard error in the mean and the model error-bar by using the variability of HIPPO XCO2 using the 2 different models to extrapolate to the top-of-atmosphere (and 435 the average of the 2 is defined as HIPPO XCO 2 . The center C. Frankenberg: HIPPO model-satellite comparison 9 box spans the range from -0.5-0.5 ppm, a strict requirement for systematic biases (GHG-CCI, 2014). The green and red shaded areas indicated regions where either the GOSAT data meets the 0.5 ppm requirement but the models not (green) or vice versa (red). Given the small amount of samples, it is premature to draw strong conclusions but it appears that somewhat more points lie in the green area. It also has to be pointed out that pure measurement unsystematic noise also contributes to the scatter in GOSAT. Same as left but using MACC instead of CT2013B. The inner box represent the area where both model and GOSAT are within 0.5 ppm compared to HIPPO, which corresponds to the very stringent accuracy requirement. The green and red shaded areas correspond to regions where the satellite deviates less than the models and is within 0.5 ppm (green) as well as where the models deviate less than GOSAT (red). The white cells on the outer edges indicate areas where both deviate more than 0.5 ppm overall.
For MACC, there is even a noticeable correlation between MACC-HIPPO and GOSAT-HIPPO with an r 2 of 0.26. This can hint at either small-scale features caught by HIPPO and missed by both GOSAT and models or small systematic variability between the exact HIPPO and GOSAT co-location.

450
Most of the samples causing the high r 2 are located in the lower left quadrant, underestimated by GOSAT and both models and apparently all within HIPPO 2S, located between 40S and 20S. Figure 9 depicts the HIPPO 2S campaign in more detail, 455 showing the exact flight patterns and the differences with respect to MACC (MACC-HIPPO) at each measurement point (upper panel  from the GFED (Randerson et al., 2013) emission database (which is used by both models) or transport errors in the models. For GOSAT, the mismatch is most likely caused by too lenient coincidence criteria, missing most of the biomass 470 burning plume.
Given the larger standard error in TES data, differences may be purely noise driven and not necessarily a hint at largescale biases even though the clustering of positive anomalies, 500 esp. in HIPPO 3 at higher latitudes, is apparent. As evident from Fig.3, there are stronger vertical gradients at 15-45N during HIPPO3 because they are close to the peak CO 2 value caused by wintertime respiration. This can cause potential mismatches as gradients can be strong and co-location crite-505 ria might have to be more strict. In addition, the HIPPO profiles are extended by models to the top-of-atmosphere and are thus not entirely model-independent.

AIRS (⇠300 hPa)
For the comparison with AIRS :::: (Fig . ::: 11), the sensitivity max-510 imum varies around 300 hP and we apply the averaging kernels accordingly :::::::: similarly :: to :::: TES. Owing to the large data density and high single measurement noise of AIRS, we use a minimum of 50 colocations for a comparison, still leaving many more data-points than for the GOSAT and TES com-515 parison. As coincidence criteria, we use data within 5 degrees latitude and longitude and 24 hours time difference. Even though the correlations are significant, a bias dependence on latitude can be observed, which hampers incorporation of AIRS data into flux inversions. The reason for these 520 biases is currently unknown but may be related to changes  Figure 11. Left: Scatterplot of :::::::: normalized :::: (with :::::::: campaign :::::: average) CO2 from individual HIPPO profiles (x-axis) against corresponding AIRS data. Right: Difference plot of CO2 against latitude. Campaigns as well as North and Southbound tracks are color-coded, model-HIPPO differences are plotted as well. Please refer to Fig. 7 for a detailed legend.
in peak sensitivity altitude as a function of latitude. A full characterization of averaging kernels per sounding would alleviate these concerns. Given the observed larger model-HIPPO CO 2 differences at higher altitudes, a fully charac-525 terized AIRS CO 2 product could be worthwile for the flux community. However, requirements for systematic biases in partial columns are even stricter than for the total column (Chevallier, 2015).

530
In this study, we compared atmospheric models as well as satellite data of CO 2 against HIPPO profiles. Table 1 provides a high level overview of the derived statistics. Both atmospheric models compare very similarly, both showing a very high correlation with respect to HIPPO, even with sub-535 tracting the campaign average XCO 2 , as is done throughout all comparisons. Largest discrepancies are found near 300 hPa at higher latitudes during peak wintertime CO 2 accumulation as well as the summer uptake period. These may be related to steep vertical gradients poorly resolved by the 540 models. In addition, a biomass burning event in the southern hemisphere seems to have been underestimated by the models, causing discrepancies of around 1 ppm.
In general, GOSAT compares very well to HIPPO, followed by TES and AIRS. For TES, most deviations can be 560 explained by pure measurement noise but AIRS appears to exhibit some latitudinal biases that would need to be accounted for if used for source-inversion studies. On the other hand, systematic model transport errors that can affect source inversions (Deng et al., 2015) were confirmed here for both 565 atmospheric models used. Despite initial scepticism towards using remotely sensed CO 2 data for global carbon cycle inversion, we are now reaching a state where potential systematic errors in both remote sensing as well as atmsopheric modeling can play en equally crucial part. Innovative meth-570 ods to characterize and ideally minimize both of these error sources will be needed in the future. One option is to apply flux inversion schemes that co-retrieve systematic biases alongside fluxes, such as in Bergamaschi et al. (2007), using prior knowledge on potential physical insight into systematic 575 biases, such as aerosol interference, land/ocean biases or air mass factors.