Interactive comment on “ Global temperature response to the major volcanic eruptions in multiple reanalysis datasets ”

Please outline what is the major purpose of the study. The major purpose of this study is to investigate the global temperature response to the volcanic eruptions using all available reanalysis datasets by highlighting common and different response signals among older and newer reanalysis datasets. An atmospheric reanalysis system provides a best estimate of the true atmospheric state and is an operational analysis system at a particular time (e.g., 1995 for the NCEP-1 system and 2009 for the JRA-55 system; see Table 1, the fourth column, of Mitchell et al. (2015)). The operational analysis system has been continuously improved at each reanalysis centre, with the main motivation to improve the tropospheric weather prediction (at least for the ECMWF, JMA, and NOAA). The consistencies and differences among different reanalysis datasets will provide a measure of the confidence and uncertainty of our current understanding of the volcanic response. Therefore, the results of this intercomparison study may be useful for validation of climate model responses to volcanic forcing and for assessing proposed geoengineering by stratospheric aerosol injection. Finally, the intercomparison results of this paper can also link studies using only a single reanalysis dataset to other studies using a different reanalysis dataset. We have added these points to the Abstract of the revised manuscript. P 13318, L 17-20: Did you make any conclusions regarding data quality and reanalysis improvements? The recent four reanalysis datasets, i.e., JRA-55, MERRA, ERA-Interim, and NCEP-CFSR, showed similar signals for the El Chichon and Mount Pinatubo eruptions from the 1979-2009 analysis. Thus, these four reanalysis datasets are equally good for studies on the response to these two eruptions. The NCEP-1, NCEP-2, and JRA-25 showed different tropical stratospheric signals particularly for the El Chichon eruption. The use of older analysis systems may be the cause of these different signals. For the JRA-25, the known stratospheric cold bias in the radiative scheme of the forecast model should be part of the reason. The 20CR has no QBO because upper-air observations were not assimilated, and thus is not suitable for the study of this kind. However, the 20CR applied volcanic aerosols in the forecast model and showed volcanic signals at least qualitatively. For the Mount Agung eruption from the 1958-2001 analysis, three out of the four reanalysis datasets analyzed, i.e., the JRA-55, ERA-40, and NCEP-1, except 20CR, showed similar stratospheric warming signals with somewhat varied magnitude and spatial extent. It is found that the ERA-40 showed unknown, warming signals in the mid-1970s, which are probably not realistic. Considering the discussion for the 1979-2009 analysis above, and because it is the only dataset that employs the most recent reanalysis system, currently JRA-55 would be best for studies on the response to the Mount Agung eruption. We have added these points in Conclusions of the revised manuscript (the last paragraph). P 13318, L 27-29: I disagree, there multiple examples of using reanalysis for comparison with model simulations. We have rephrased this sentence as: “Investigation of climatic response to individual volcanic eruptions using multiple reanalysis datasets for the purpose of comparison and evaluation of reanalysis datasets is rather limited.” There are several studies showing one or two reanalysis datasets to compare model simulations. But, most of the cases, they are the NCEP-1 and/or ERA-40 (e.g., Eyring et al, 2010; Karpechko et al., 2010). More recent studies used the ERA-Interim (e.g., Arfeuille et al., 2013; Toohey et al., 2014). But, more recent reanalysis datasets such as the JRA-55, MERRA, and NCEP-CFSR have not been used for the volcanic studies (except for our previous study by Mitchell et al. (2015)) to the knowledge of the authors. Arfeuille, F., Luo, B. P., Heckendorn, P., Weisenstein, D., Sheng, J. X., Rozanov, E., Schraner, M., Brönnimann, S., Thomason, L. W., and Peter, T.: Modeling the stratospheric warming following the Mt. Pinatubo eruption: uncertainties in aerosol extinctions, Atmos. Chem. Phys., 13, 11221–11234, doi:10.5194/acp-13-11221-2013, 2013. Eyring, V., Shepherd, T. G., and Waugh D. W. (eds.): SPARC CCMVal report on the evaluation of chemistry-climate models, SPARC Rep. 5, World Meteorol. Soc., Geneva, Switzerland, 2010. Karpechko, A. Yu., Gillett, N. P., Dall'Amico, M., and Gray, L. J.: Southern Hemisphere atmospheric circulation response to the El Chichón and Pinatubo eruptions in coupled climate models, Q. J. R. Meteorol. Soc., 136, 1813–1822, doi:10.1002/qj.683, 2010. Toohey, M., Krüger, K., Bittner, M., Timmreck, C., and Schmidt, H.: The impact of volcanic aerosol on the Northern Hemisphere stratospheric polar vortex: mechanisms and sensitivity to forcing structure, Atmos. Chem. Phys., 14, 13063-13079, doi:10.5194/acp-14-13063-2014, 2014. P 13320, L 1-5: Please discuss your corresponding findings in the conclusion section. We assume that you are referring to P 13319, L 1-5. We have done this. Please see our answers to your question at P 13318, L 17-20. In the revised manuscript, we have clearly described this in the Conclusions section. P 13321, L 1-21: There are number of other indexes, e.g., NAO, Indian Monsoon, why they are not included? Could you comment on this? This is because we focused on the climate indices that are the forcing, not the response, and are relevant to the zonal mean response, not to the regional response. In an early phase of this study, we tested to include the Arctic Oscillation (AO) index, Antarctic Oscillation (AAO) index, and Indian Ocean dipole mode index (Saji et al., 1999), but the obtained volcanic response was found to be quite similar to the one without considering these indices. Also, there was discussion within the coauthors that the AO and AAO should be considered as response, not as forcing. We did not consider the Indian Monsoon index, but we think that it is more related to regional response, not zonal mean response. We have added a note on this in the revised manuscript (Section 2, the 3rd paragraph). Saji, N. H., Goswami, B. N., Vinayachandran, P. N., Yamagata, T.: A dipole mode in the tropical Indian Ocean, Nature, 401, 360-363, 1999. P 13322, L 9: It is really not clear and has to be explained. This means that the regions evaluated as statistically significant are smaller than those in Mitchell et al. (2015) particularly for the solar and ENSO signals in the tropical lower stratosphere, but the general features are quite similar to those in Mitchell et al. In the revised manuscript, we have added this explanation (Section 3.1, the first paragraph). P 13323, L 15: In linear approximation, bias should not affect a response to external forcing. The cold bias of the forecast model was “not fully” corrected by the observations. This means that depending on the situation (e.g., at large volcanic eruptions or during a specific period of time) the correction by the observations might worse (or better) than other periods. It is possible that the bias was not constant over time, in particular when unusual, volcanically affected temperature measurements came into the JRA-25 system. So, we think this could be a part of the reasons. We have added this explanation (Section 3.1, the second paragraph). However, this is only a speculation, and thus we have rephrased this sentence as follows: “may” has been changed to “might” “due to” has been changed to “related to” P 13324, L 18-19: Repetition We have inserted the word “again.” P 13326, L 15-25: It is most important that the Agung period is not covered by satellite observations. Could you please comment on this? The weakness of the radiosonde dataset in comparison with the microwave and infrared sounders on operational satellites is its inhomogeneity in spatial distribution and their limited height range. The radiosonde stations are very limited in the Southern Hemisphere, and the typical balloon burst altitude is ~30 hPa (e.g., Seidel et al., 2011, their Figures 1 and 2). Also, the number of available reanalysis datasets for the studies of the Mount Agung eruption is only four, which is much smaller than 9 for the studies of the Mount Pinatubo and El Chichon eruptions. Therefore, the uncertainty is greater for the Agung signals than for the Mount Pinatubo and El Chichon signals, although we cannot quantify it easily. We have added these points in the revised manuscript (Section 3.2., the 4 thparagraph). Seidel, D. J., Gillett, N. P., Lanzante, J. R., Shine, K. P., and Thorne, P. W.: Stratospheric temperature trends: our evolving understanding, WIREs Clim. Change, 2, 592-616, doi:10.1002/wcc.125, 2011. P 13326, L 27-28: Why the surface temperature response is good then? We did not say anything about surface temperature response. To clarify, we have rephrased this sentence as: “The modelled aerosol loading was probably too weak to simulate the lower stratospheric warming signals.” P 13327, L 9-10: Disagree, the El Chichon plume was mostly in the northern hemisphere. We have rephrased the sentence as: “The aerosol loading due to the Mount Agung eruption extended primarily to the Southern Hemisphere, that due to the El Chichón eruption was very large in the tropics and extended primarily to the Northern Hemisphere, and that due to the Mount Pinatubo eruption was very large in the tropics and extended to both hemispheres.” and moved this to the last paragraph of Section 3.2. P 13328, L 15: Could you compare the optical depth of small eruptions with one of mt. Pinatubo. We have completely removed the discussion on the temperature response to the three smaller-scale eruptions. See also our response to the comments by Reviewer #3. P 13328, L 28: There are no physical reasons for small eruptions to produce qualitatively different response. It is probably an artifact of your signal-extracting procedure.

temperature observations were also assimilated in most reanalysis datasets (except for JRA-25, and 20CR), but their impacts are limited to the region around 200-300 hPa and mostly to the Northern Hemisphere (see, e.g., discussion by Rienecker et al. (2011) for their Fig. 16). Also, the ERA-Interim, NCEP-CFSR, and JRA-55 assimilated data from the GNSS *3) /GPS *4) Radio Occultation  There are three components that differ in different reanalysis systems: (1) detailed "bias correction" methods (or, quality control, in other words) for the original radiosonde and microwave/infrared sounder data before the assimilation, (2) the assimilation scheme, and (3) the forecast model. Therefore, we can say that the main causes of the overall temperature difference among the reanalysis datasets (except for 20CR) are these three factors rather than the choice of original observations. For the temperature response to the volcanic eruptions, the same can be said. The reanalysis system is an operational analysis system at a particular time (see Table 1, the fourth column, of Mitchell et al. (2015)), and the operational analysis system has been continuously improved over time with the main motivation to improve the tropospheric weather prediction (at least at the ECMWF, JMA, and NOAA). Therefore, in principle, newer reanalysis datasets are considered to be better at all the above three components, and this would explain the differences shown in our study between the older (e.g., NCEP-1, NCEP-2,  and the newer (ERA-Interim, NCEP-CFSR, MERRA, and JRA-55) reanalysis datasets. The differences among the newer reanalysis datasets, which are smaller, are also due to the differences at these three components.
We have added a paragraph discussing these points in Introduction (the 4th paragraph) of the revised manuscript.
page 13320, line 5: Same question as above. Do all of the reanalyzes assimilate the datasets?
Please see above.
page 13320, line 7: It is mentioned here and elsewhere in the text that 20CR uses annual average volcanic aerosols. Is there a reference how this is done? It is not clear in Compo et al., 2011or Saha et al., 2010. Could this affect your analysis applied to this reanalysis? For example, if we assume that an annual average is for the period January to December of a given year then for Pinatubo the model erupted in January rather than June of 1991. Given the method to determine the volcanic signal (Page

13321, line 25) won't the pre-eruption period be affected?
This is a very good point.
We communicated with Gilbert Compo and Craig Long again and found that the descriptions in Compo et al. (2011) need to be revised. The following is the correct one, which have been included in the revised manuscript (Section 2, the first paragraph): The atmospheric forecast model of the 20CR v2 is nearly the same as used in the NCEP-CFSR but with a lower resolution. For both reanalysis datasets, monthly latitudinally-varying distributions of volcanic aerosols (averaged for 4 bands, i.e., 90N-45N, 45N-equator, equator-45S, and 45S-90S) were specified based on data from Sato et al. (1993), and a monthly climatological global distribution of aerosol vertical profiles on a 5 o grid was specified based on data from Koepke et al. (1997) (G. Compo of the true atmospheric state and is an operational analysis system at a particular time (e.g., 1995 for the NCEP-1 system and 2009 for the JRA-55 system; see Table 1, the fourth column, of Mitchell et al. (2015)). The operational analysis system has been continuously improved at each reanalysis centre, with the main motivation to improve the tropospheric weather prediction (at least for the ECMWF, JMA, and NOAA). The consistencies and differences among different reanalysis datasets will provide a measure of the confidence and uncertainty of our current understanding of the volcanic response.
Therefore, the results of this intercomparison study may be useful for validation of climate model responses to volcanic forcing and for assessing proposed geoengineering by stratospheric aerosol injection. Finally, the intercomparison results of this paper can also link studies using only a single reanalysis dataset to other studies using a different reanalysis dataset.
We have added these points to the Abstract of the revised manuscript.
P 13318, L 17-20: Did you make any conclusions regarding data quality and reanalysis improvements?
The recent four reanalysis datasets, i.e., JRA-55, MERRA, ERA-Interim, and NCEP-CFSR, showed similar signals for the El Chichon and Mount Pinatubo eruptions from the 1979-2009 analysis. Thus, these four reanalysis datasets are equally good for studies on the response to these two eruptions. The NCEP-1, NCEP-2, and JRA-25 showed different tropical stratospheric signals particularly for the El Chichon eruption. The use of older analysis systems may be the cause of these different signals. For the JRA-25, the known stratospheric cold bias in the radiative scheme of the forecast model should be part of the reason. The 20CR has no QBO because upper-air observations were not assimilated, and thus is not suitable for the study of this kind. However, the 20CR applied volcanic aerosols in the forecast model and showed volcanic signals at least qualitatively. For the Mount Agung eruption from the 1958-2001 analysis, three out of the four reanalysis datasets analyzed, i.e., the JRA-55, ERA-40, and NCEP-1, except 20CR, showed similar stratospheric warming signals with somewhat varied magnitude and spatial extent. It is found that the ERA-40 showed unknown, warming signals in the mid-1970s, which are probably not realistic. Considering the discussion for the 1979-2009 analysis above, and because it is the only dataset that employs the most recent reanalysis system, currently JRA-55 would be best for studies on the response to the Mount Agung eruption.
We have added these points in Conclusions of the revised manuscript (the last paragraph). We have rephrased this sentence as: "Investigation of climatic response to individual volcanic eruptions using multiple reanalysis datasets for the purpose of comparison and evaluation of reanalysis datasets is rather limited." There are several studies showing one or two reanalysis datasets to compare model simulations. But, most of the cases, they are the NCEP-1 and/or ERA-40 (e.g., Eyring et al, 2010;Karpechko et al., 2010). More recent studies used the ERA-Interim (e.g., Arfeuille et al., 2013;Toohey et al., 2014).
But, more recent reanalysis datasets such as the JRA-55, MERRA, and NCEP-CFSR have not been used for the volcanic studies (except for our previous study by Mitchell et al. (2015)) to the knowledge of the authors.
We assume that you are referring to P 13319, L 1-5.
We have done this. Please see our answers to your question at P 13318, L 17-20.
In the revised manuscript, we have clearly described this in the Conclusions section. This is because we focused on the climate indices that are the forcing, not the response, and are relevant to the zonal mean response, not to the regional response. In an early phase of this study, we tested to include the Arctic Oscillation (AO) index, Antarctic Oscillation (AAO) index, and Indian Ocean dipole mode index (Saji et al., 1999), but the obtained volcanic response was found to be quite similar to the one without considering these indices. Also, there was discussion within the coauthors that the AO and AAO should be considered as response, not as forcing. We did not consider the Indian Monsoon index, but we think that it is more related to regional response, not zonal mean response.
We have added a note on this in the revised manuscript (Section 2, the 3rd paragraph). Saji, N. H., Goswami, B. N., Vinayachandran, P. N., Yamagata, T.: A dipole mode in the tropical Indian Ocean, Nature, 401, 360-363, 1999. P 13322, L 9: It is really not clear and has to be explained.
This means that the regions evaluated as statistically significant are smaller than those in Mitchell et al. (2015) particularly for the solar and ENSO signals in the tropical lower stratosphere, but the general features are quite similar to those in Mitchell et al. In the revised manuscript, we have added this explanation (Section 3.1, the first paragraph). P 13323, L 15: In linear approximation, bias should not affect a response to external forcing.
The cold bias of the forecast model was "not fully" corrected by the observations. This means that depending on the situation (e.g., at large volcanic eruptions or during a specific period of time) the correction by the observations might worse (or better) than other periods. It is possible that the bias was not constant over time, in particular when unusual, volcanically affected temperature measurements came into the JRA-25 system. So, we think this could be a part of the reasons. We have added this explanation (Section 3.1, the second paragraph). However, this is only a speculation, and thus we have rephrased this sentence as follows: "may" has been changed to "might" "due to" has been changed to "related to"

P 13324, L 18-19: Repetition
We have inserted the word "again." P 13326, L 15-25: It is most important that the Agung period is not covered by satellite observations.

Could you please comment on this?
The weakness of the radiosonde dataset in comparison with the microwave and infrared sounders on operational satellites is its inhomogeneity in spatial distribution and their limited height range. The radiosonde stations are very limited in the Southern Hemisphere, and the typical balloon burst altitude is ~30 hPa (e.g., Seidel et al., 2011, their Figures 1 and 2). Also, the number of available reanalysis datasets for the studies of the Mount Agung eruption is only four, which is much smaller than 9 for the studies of the Mount Pinatubo and El Chichon eruptions. Therefore, the uncertainty is greater for the Agung signals than for the Mount Pinatubo and El Chichon signals, although we cannot quantify it easily.
We have added these points in the revised manuscript (Section 3.2., the 4 thparagraph).
We did not say anything about surface temperature response.
To clarify, we have rephrased this sentence as: "The modelled aerosol loading was probably too weak to simulate the lower stratospheric warming signals." P 13327, L 9-10: Disagree, the El Chichon plume was mostly in the northern hemisphere.
We have rephrased the sentence as: "The aerosol loading due to the Mount Agung eruption extended primarily to the Southern Hemisphere, that due to the El Chichón eruption was very large in the tropics and extended primarily to the Northern Hemisphere, and that due to the Mount Pinatubo eruption was very large in the tropics and extended to both hemispheres." and moved this to the last paragraph of Section 3.2.

Response to Referee #3.
Thank you very much for reviewing our manuscript.
Interactive comment on "Global temperature response to the major volcanic eruptions in multiple reanalysis datasets" by M. Fujiwara   The idea behind this study is interesting and worth to be explored, but I think that the analyses of the reanalyses datasets should be more detailed. Most of the manuscript is a description of the figure, and does not address the reasons for discrepancies, which makes impossible to assess which reanalyses system is doing a better job during specific time series.
In general, we found three groups, i.e., (1) newer reanalysis datasets, JRA-55, MERRA, ERA-Interim, and NCEP-CFSR, (2) older reanalysis datasets, JRA-25, ERA-40, NCEP-1, and NCEP-2, and (3) 20CR which is without atmospheric (upper-air) observations assimilated. For (1) and (2), the original observations that have the major impact on the reanalysis temperature are common, which are radiosondes and microwave and infrared sounders on several operational satellites. Therefore, the causes of the differences between (1) and (2), within (1), and within (2) should not be in the original observations assimilated but in the bias correction (i.e., quality control) methods for observational data before the assimilation, in the assimilation scheme, and in the forecast model. The newer reanalysis datasets use newer and thus basically better assimilation scheme and forecast model, with improved data quality control procedures. Even within the newer reanalysis datasets (1), we found some quantitative differences in the volcanic temperature response. At the moment, the exact causes of these differences are unknown, and thus what we can do is to regard these differences as the uncertainty information, i.e., uncertainty of our knowledge on the global temperature response to the major volcanic eruptions.
We have added these points in the revised manuscript (i.e., Abstract, Introduction, and Conclusions).  There is a technical difference for the satellite data assimilation. The NCEP-1 and NCEP-2 assimilated retrieved temperature profiles, while the other reanalysis systems (except the 20CR) directly assimilated radiance data using a radiative transfer model. The radiance assimilation is considered better than the retrieved data assimilation because the retrieval model can be an additional source of uncertainty.
In addition, there are two other types of temperature measurements as follows.  Given the change in temperatures simply due to the inclusion of additional datasets, would it be more appropriate to divide the data record in periods with a specific set of instruments (i.e. no instrument is added/dropped) and perform separate regression analyses for each period?
As explained above, except for 20CR, there is basically no difference in terms of the original observations assimilated. The year 1979 is the key year, and that is the reason why many reanalysis datasets start from 1979. Considering this fact, we made two separate data analyses, one for the period Thank you for pointing this out.
The stratospheric warming for the El Chichon eruption in the NCEP-1 and NCEP-2 is located at 10 hPa, the top boundary for these reanalysis systems, while that for the other reanalysis systems (including the 20CR) is located around 50 hPa. The major differences of the NCEP-1 and NCEP-2 from the other reanalysis systems include the lower model top height (3 hPa), older forecast model and assimilation scheme (of the 1990s; see Table 1, the fourth column, of Mitchell et al. (2015)), and the use of retrieved temperature data for the assimilation of SSU, MSU, and AMSU-A data. It is possible that these factors may be responsible for the different signals of the El Chichon eruption in NCEP-1 and NCEP-2.
However, this is not true for the Mount Pinatubo eruption: All the reanalysis systems except the 20CR show a lower stratospheric warming signal centered around 50 hPa. The NCEP-1 and NCEP-2 systems worked much better to capture the Mount Pinatubo signals for some reasons. The 20CR did not assimilate any upper-air observations but took into account the volcanic aerosols in the forecast model, and these facts should be responsible for the different response.
We have added these discussions in the revised manuscript (Section 3.1).
 page 13325 L 11: 20CR shows "unknown warming signals" in 198920CR shows "unknown warming signals" in /1990. There is no hypothesis about the origin of these signals?
As written above, the 20CR did not assimilate any upper-air observations but took into account the volcanic aerosols in the forecast model. In practice, the 20CR uses the same monthly-mean aerosol index data shown in Figure 3 (i.e., taken from Sato et al. (1993)) which were, for the case of 20CR, averaged into 4 evenly spaced latitude bins (i.e., 90S-45S, 45S-equator, equator-45N, and 45N-90N). This discussion has been added in the revised manuscript (Section 3.1, the second last paragraph). Before the introduction of horizontally dense satellite measurements in 1979, the upper-air temperature is constrained basically only by horizontally inhomogeneous, relatively sparse radiosonde data (see, e.g., Fig. 2 of Uppala et al., 2005). Also, the ERA-40 system is a relatively old system (the 2001 version of the ECMWF analysis system). These two facts are possible reasons why there occurred some unrealistic meandering in the upper tropospheric and lower stratospheric temperature during this period in the ERA-40 system. A stream change of the reanalysis execution could also be a potential reason. For the ERA-40, there were three streams, i.e., 1989, 1957-1972, and 1972-1988(Uppala et al., 2005. But, the stream change point of 1972 probably cannot explain the anomalous warming starting around the end of 1974. This discussion has been added in the revised manuscript (Section 3.2, the second paragraph).
 page 13327 L 21: "the former MAY correspond: : :" Why MAY? It should be possible to check in the lat-lon data, correct?
We have completely removed the discussion on the temperature response to the three smaller-scale eruptions. Please see below.
 page 13328 L2: Could the opposite response in the case of Fernandina be due to lingering effects of Agung in the three years before the Fernandina eruption?
As you pointed out, one possibility is the lingering effects of the Mount Agung eruption. For the three smaller-scale eruptions, we may need different definitions for each (e.g, different base period).
However, doing this would take time and make this paper complicated. Therefore, we decided that we completely remove the discussion on the temperature response to the three smaller-scale eruptions from this paper.
 page 13328 L5: are aerosol heating rates included in the reanalyses output? If so, the cause of the warming could be checked.
The 20CR and the NCEP-CFSR are the only reanalysis systems that considered volcanic aerosols in their forecast model. Therefore, there is no volcanic signal in the heating rate data for the other reanalysis datasets. Any temperature changes in association with the volcanic eruptions came from the temperature observations in the reanalysis systems except for the 20CR and NCEP-CFSR.
 page 13328 L 9: the structure in the residuals similar to the QBO response could be due to aerosol-induced effects in dynamics (e.g. Aquila et al. (2014) in the case of a tropical geoengineering aerosol injection). However, why would it be present only in the case of Fernandina? Any hypothesis?
Again, we decided that we completely remove the discussion on the temperature response to the three smaller-scale eruptions from this paper. Thank you very much for pointing us to this very interesting paper. and 1970s is also investigated. Comparison of the results from several different reanalysis datasets confirms the atmospheric temperature response to these major eruptions qualitatively, but also shows quantitative differences even among the most recent reanalysis datasets. The consistencies and differences among different reanalysis datasets provide a measure of the confidence and uncertainty in our current understanding of the volcanic response. The results of this intercomparison study may be 20 useful for validation of climate model responses to volcanic forcing and for assessing proposed geo-1 engineering by stratospheric aerosol injection, as well as to link studies using only a single reanalysis dataset to other studies using a different reanalysis dataset.

Introduction
Explosive volcanic eruptions inject sulphur species to the stratosphere in the form of SO 2 and H 2 S 25 which convert to H 2 SO 4 aerosols. These aerosols are then transported both vertically and horizontally into the stratosphere by the Brewer-Dobson circulation (Butchart, 2014), stay there to perturb the radiative budget on a timescale of a few years, and thus affect global climate (Robock, 2000).
The stratospheric volcanic aerosol layer is heated by absorption of near-infrared solar radiation and upward longwave radiation from the troposphere and surface. In the troposphere, the reduced near-30 infrared solar radiation is compensated by the additional downward longwave radiation from the aerosol layer. At the surface the large reduction in direct shortwave radiation due to the aerosol layer mainly contributes to is the main cause of net cooling there.
Stratospheric aerosol optical depth (AOD) is an indicator of volcanic eruptions that affect global climate and has been estimated from various information (e.g., Sato et al., 1993;Robock, 2000; On the other hand, Extending over a longer period, the global radiosonde network that provides global atmospheric (upper-air) temperature data has been operating since the 1940s, with improved 40 spatial resolution since the late 1950s (Gaffen, 1994) An atmospheric reanalysis dataset is constructed as system provides a best estimate of the past state of the atmosphere using atmospheric observations with a fixed assimilation scheme and a fixed 60 global forecast model (Trenberth and Olson, 1988;Bengtsson and Shukla, 1988). It is an operational analysis system at a particular time (e.g., 1995 for the NCEP-1 system and 2009 for the JRA-55 system), which has been continuously improved with the main motivation being to improve the tropospheric weather prediction. Using a fixed assimilation-forecast model to produce analyses of observational data that were previously analysed in the context of operational forecasting -hence the 65 "re" in "reanalysis" -prevents artificial changes being produced in the analysed fields due to system changes. But, as described above, the observational data inputs still vary over the period of the reanalysis. Currently, there are about 10 global atmospheric reanalysis datasets available worldwide. Table 1 lists the reanalysis datasets considered in this study. It is known that different reanalysis datasets give different results for the same diagnostic. Depending on the diagnostic, the different re-70 sults may be due to differences either in the observational data assimilated, the assimilation scheme or forecast model, or any combination of these (see, e.g., Fujiwara et al., 2012 for a list of some examples). It is therefore necessary to compare all (or some of the newer) reanalysis datasets for various key diagnostics for understanding of the data quality and for future reanalysis improvements (Fujiwara and Jackson, 2013). To be more specific to the current study, the major observational 75 sources of atmospheric (upper-air) temperature are basically common for all the reanalysis datasets in Table 1 (except for the 20CR which only assimilated surface pressure reports). They are radiosondes and satellite microwave and infrared sounders (i.e., MSU, SSU, and AMSU-A). There are three components that do differ in different reanalysis systems: (1) detailed bias-correction or qualitycontrol methods for the original observations before the assimilation, (2) the assimilation scheme, Investigation of climatic response to individual volcanic eruptions using multiple reanalysis datasets for the purpose of comparison and evaluation of reanalysis datasets is rather limited. For example, Harris and Highwood (2011) showed global mean surface temperature changes following the 90 Pinatubo eruption using NCEP-1 and ERA-40 reanalysis data for comparison with their model experiments. Analysing all available reanalysis datasets for the 20th-century three major eruptions separately and for the region covering both troposphere and stratosphere will provide valuable infor-mation for model validation as well as on the current reanalysis data quality for capturing volcanic signals. Such an analysis would also be valuable when assessing one of the proposed geoengineer-95 ing options, i.e., stratospheric aerosol injection to counteract global surface warming (e.g., Crutzen, 2006;Robock et al., 2013).
In the present study, we analyse zonal and monthly mean temperature data from nine reanal-

Data and Method
Monthly mean pressure-level temperature data from the nine reanalysis datasets listed in Table 1 were downloaded from each reanalysis-centre website or the US National Center for Atmospheric Research (NCAR) Research Data Archive (http://rda.ucar.edu/). Zonal means were derived for each 115 dataset before the analysis. All the reanalysis datasets except 20CR assimilated upper-air temperature measurements from radiosondes and from SSU, MSU, and AMSU-A satellite instruments, with varied assimilation techniques. 20CR assimilated only surface pressure reports and used observed monthly sea-surface temperature and sea-ice distributions as boundary conditions for the forecast model. Note also that for the 20CR, annual averages of volcanic aerosols were specified in the

140
A multiple regression technique is applied to extract volcanic signals (e.g., Randel and Cobb, 1994;Randel, 2010;von Storch and Zwiers, 1999, Chapt. 8.4). First, all major variabilities, except for volcanic effects, were evaluated and subtracted from the original zonal and monthly mean temperature data. The major variabilities include seasonal harmonics of the form, a 1 sin ωt+a 2 cos ωt+ a 3 sin 2ωt + a 4 cos 2ωt + a 5 sin 3ωt + a 6 cos 3ωt, with ω = 2π/(12 mon), linear trends, two QBO in-145 dices, ENSO, and solar cycle. For the latter five climatic indices, the six seasonal harmonics and a constant are further considered to construct seven indices for each of the five indices, as was done by Randel and Cobb (1994). For the two QBO indices, we use 20 and 50 hPa monthly mean zonal wind data taken at equatorial radiosonde stations provided by the Freie Universität Berlin. The crosscorrelation coefficient for these two QBO indices is −0.24 for 1979-2009 and −0.21 for 1958-2001. 150 For the ENSO index, we use the Niño 3.4 index, which is a standardized sea surface temperature anomaly in the Niño 3.4 region (5 • N-5 • S, 170-120 • W), provided by the NOAA Climate Prediction Center. As is often done, a time lag for atmospheric response is considered for the ENSO index.
We chose 4 months for the lag, following Free and Lanzante (2009). We confirmed that changing the ENSO lag from 0 to 6 months gives somewhat different ENSO signals particularly in the trop-155 ical stratosphere but does not alter other signals, including volcanic signals, significantly. For the solar cycle index, we use solar 10.7 cm flux data provided by the NOAA Earth System Research Laboratory. (Note that we do not consider other indices, e.g., the North Atlantic Oscillation index and the Indian Monsoon index because the former is considered to be a response not a forcing and both are considered to be more related to regional response, not zonal mean response.) The multiple 160 regression model that we use in this study is therefore, where Y (t) is the zonal and monthly mean temperature time series at a particular latitude and pressure grid point, and a l is the least squares solution of a parameter for climatic index time series x l (t). R(t) is the residual of this model which is assumed to be composed of volcanic signals and 165 random variations (Randel, 2010). Then, by following Randel (2010), the volcanic signal for each eruption is defined as the difference between the 12 month averaged R(t) after each eruption and the 36 month averaged R(t) before each eruption. There are several other possible minor variations for the methodological details, i.e., for the multiple regression model, the choice of particular index datasets, and the volcanic signal definition. The use of a consistent methodology is important 170 for comparisons of different datasets. Where possible, however, we will discuss the methodological dependence below.   Figure 4 shows the temperature signals for the El Chichón eruption from the 8 reanalysis datasets.

The 1979-2009 Analysis
As described in Sect. 2, the volcanic signal is defined as the difference between the 12 month averaged R(t) after each eruption and the 36 month averaged R(t) before each eruption. The coloured regions are also defined by following Randel (2010) Table 1), and the use of retrieved temperature data for the assimilation of SSU, MSU, and AMSU-A data. It is possible that these factors may be responsible for the different signals of the El Chichón eruption in NCEP-1 and NCEP-2. (See also discussion on the results for the Mount Pinatubo eruption below.) For 20CR, tropical stratospheric warming is present, but again, this is due to the specified volcanic aerosols in the forecast model. lower stratospheric features, though for the latter, the warming magnitude is smaller and the "random" variability becomes large above the 50 hPa level because of the reason described above (i.e., the cold bias and its disappearance in 1998). For NCEP-1 and NCEP-2, the tropical tropospheric and stratospheric features are similar to those for the latest four reanalysis datasets, though the lower stratospheric warming magnitude is somewhat small. slightly smaller than in most of the other re-270 analyses. Comparing with the El Chichón case, the NCEP-1 and NCEP-2 systems worked much better to capture the Mount Pinatubo signals for some reasons. For 20CR, the tropical stratospheric warming is not detected. This is because of the unknown warming signals in 20CR in 1989 and 8 in 1990 (see Fig. 3) that raised the 36 month averaged base in the volcanic signal definition. As in Fig. 3, there are no relevant signals in AOD around [1989][1990]. Thus, the unknown warming signals 275 are likely due to unrealistic (unforced) variations in the 20CR system.
The temperature signals for the Mount Pinatubo eruption shown in Randel (2010) are similar to the present results both in the tropical-midlatitude stratosphere and troposphere, though Randel's stratospheric warming peak value is somewhat greater (∼ 3 K) and his upper tropospheric cooling is somewhat greater (0.5-1 K) and more uniform in latitude. On the other hand, Free and Lanzante

280
(2009) show that the lower stratospheric warming signal is split near the equator with two maxima (1.6-2 K at 10 • N and > 2 K at 15 • S, both at 70-50 hPa) and that the upper tropospheric cooling signal has its peak (0.9-1.2 K) around 20 • S. In summary, the latest recent four reanalysis datasets (i.e., JRA-55, MERRA, ERA-Interim, and NCEP-CFSR) give more consistent signals for both eruptions compared to the two radiosonde data analyses using different homogenized datasets by Free 285 and Lanzante (2009) and Randel (2010).

The 1958-2001 Analysis
The multiple regression analysis is applied to the four reanalysis datasets, namely,  tally inhomogeneous, relatively sparse radiosonde data (see, e.g., Fig. 2 of Uppala et al., 2005). Also, the ERA-40 system is a relatively old system (the 2001 version of the ECMWF analysis system).
These two facts are possible reasons for the ERA-40's anomalous warming in the mid-1970s. A stream change of the reanalysis execution could also be a potential reason. For the ERA-40, there were three execution streams, that is, 1989is, , 1957is, -1972is, , and 1972is, -1988is, (Uppala et al., 2005.  For all the four reanalysis datasets, the 300 hPa cooling shown in Fig. 7 is not captured with the 330 current volcanic-signal definition (i.e., 12 month average after the eruption started).
Free and Lanzante (2009) showed a very similar Southern Hemisphere midlatitude lower stratospheric warming signal (> 2 K) in association with the Mount Agung eruption using a homogenized radiosonde dataset. Sato et al. (1993) showed that the aerosols emitted from the Mount Agung eruption were transported primarily to the Southern Hemisphere. Figure  The same volcanic-signal definition, i.e., the difference between the 12 month averaged R(t) after each eruption and the 36 month averaged R(t) before each eruption, was applied. Interestingly, for the Fernandina Island case (Fig. 10), JRA-55 and NCEP-1 show tropical upper tropospheric warming (peak value of 0.4-0.8 K at 300 hPa) and tropical 100-50 hPa cooling (1.2-1.6 K for JRA-55 and 360 1.6-2.0 K for NCEP-1), which is opposite to the response following the 3 major eruptions previously examined. ERA-40 shows a similar tropical lower stratospheric cooling, and ERA-40 and 20CR shows much weaker tropical tropospheric warming. It is possible that the upper tropospheric warming signal is a radiative response to aerosols that did not penetrate so deeply into the stratosphere.
In addition, despite the inclusion of QBO indices in the regression analysis, the residual signal 365 (interpreted as the volcanic response) has a structure in the stratosphere similar to the QBO response, with a tropical signal whose sign alternates in the vertical direction plus a weaker subtropical response of opposite sign. For the Mount Fuego case (not shown), ERA-40 showed very different signals from other three reanalysis datasets, as can be inferred from Fig. 7, and all four datasets basically showed no substantial signal exceeding twice the SD of annual mean residual. There also 370 occurred an eruption of Mount Awu, Indonesia in August 1966 (Sato et al., 1993), but the AOD time series do not show any substantial signal (Fig. 7). The same volcanic analysis for Mount Awu eruption showed cooling (0.8-1.6 K) in the Southern Hemisphere midlatitude lower stratosphere for all the four datasets (not shown).  signal was asymmetric about the equator and very different from the El Chichón and Pinatubo signals. We suggest that this may be due to differences in the transport of volcanic aerosols (Sato et al., 1993 different climatic response, such as lower stratospheric cooling, not warming, although the cases are limited. The characteristics in the temperature response are related to the transport of stratospheric aerosols together with the amount of sulphur species emitted into the stratosphere. Depending on the location, season, and magnitude of the eruption, the climatic response can be very different (e.g., Trepte and Hitchman, 1992). This needs to be taken into account when evaluating the stratospheric 430 sulphur injection as a geo-engineering option, and thus accurate estimations of stratospheric circulation and transport are essential for assessing the climate impacts. Also, it should be noted that accurate evaluation of naturally induced variability such as QBO, solar cycle, and ENSO is necessary to detect the effects of artificial injection. Finally, reanalysis intercomparison for this case gave us some more confidence on the volcanic and other naturally induced effects, even if there are several 435 known issues (e.g., inhomogeneity of observational data) in the current reanalysis systems.  (26287117) and by the UK Natural Environment Research Council (NERC). We thank Tetsu Nakamura, Koji Yamazaki, and Fumio Hasebe for valuable discussion on earlier versions of the work. We also thank three reviewers for valuable comments and suggestions. The Linear Algebra PACKage (LAPACK) was used for the matrix operations. Figures 1-10 Figures 1-9 were produced using the GFD-DENNOU Library. Solid and dashed lines denote positive and negative values, respectively. The contour interval is 0.2 K for QBO, and 0.1 K for solar cycle and ENSO. Coloured regions denote those greater (orange) and smaller (blue) than random variations with the 95 % confidence interval at each location.   (orange) and negative and smaller (blue) than twice the SD of annual mean residual R(t) at each location.