Cluster analysis of European surface ozone observations for evaluation of MACC reanalysis data

The high density of European surface ozone monitoring sites provides unique opportunities for the investigation of regional ozone representativeness and for the evaluation of chemistry climate models. The regional representativeness of European ozone measurements is investigated through a cluster analysis (CA) of 4 years of three-hourly ozone data from 1492 European surface monitoring stations in the Airbase database; the time resolution corresponds to the output frequency 10 of the model that is compared to the data in this study. K-means clustering is implemented for seasonal-diurnal variations (i) in absolute mixing ratio units, and (ii) normalized by the overall mean ozone mixing ratio at each site. Statistical tests suggest that each CA can distinguish between 4 and 5 different ozone pollution regimes. The individual clusters reveal differences in seasonal-diurnal cycles, showing typical patterns of the ozone behavior for more polluted stations or more rural background. The robustness of the clustering was tested with a series of k-means runs decreasing randomly the size of 15 the initial data set or lengths of the timeseries. Except for the Po Valley, the clustering does not provide a regional differentiation, as the member stations within each cluster are generally distributed all over Europe. The typical seasonal, diurnal, and weekly cycles of each cluster are compared to the output of the multi-year global reanalysis produced within the Monitoring of Atmospheric Composition and Climate (MACC) project. While the MACC reanalysis generally captures the shape of the diurnal cycles and the diurnal amplitudes it is not able to reproduce the seasonal cycles very well and it exhibits 20 a high bias up to 12 nmol/mol. The bias decreases from more polluted clusters to cleaner ones. Also, the seasonal and weekly cycles and frequency distributions of ozone mixing ratios are better described for clusters with relatively clean signatures. Due to relative sparsity of CO and NOx measurements these were not included in the cluster analysis. However, simulated CO and NOx mixing ratios are consistent with the general classification into more polluted and more background sites. Mean CO mixing ratios are ≈ 140-145 nmol/mol (CL1 – CL3) and ≈ 130-135 nmol/mol (CL4 and CL5), and NOx mixing ratios are 25 ≈ 4-6 nmol/mol and ≈ 2-3 nmol/mol, respectively. These results confirm that relatively coarse scale global models are more suitable for simulation of regional background concentrations, which are less variable in space and time. We conclude that cluster analysis of surface ozone observations provides a powerful and robust way to stratify sets of stations being thus more suitable for model evaluation. Atmos. Chem. Phys. Discuss., doi:10.5194/acp-2015-971, 2016 Manuscript under review for journal Atmos. Chem. Phys. Published: 8 February 2016 c © Author(s) 2016. CC-BY 3.0 License.


Introduction
Tropospheric ozone is a strong oxidant affecting people's health (Touloumi et al., 1997;Bell et al., 2006;Schwartz et al., 1994) and reducing yields of agricultural plants (Emberson et al., 2003;Ashmore, 2005;Pang et al., 2009). Furthermore, it is responsible for a significant fraction of global warming (IPCC, 2013). Ozone is photo-chemically produced in the troposphere in a chain of chemical reactions from precursors which concentrations are strongly influenced by anthropogenic 5 activities. Maximum ozone concentrations are therefore often found in or near large urban agglomerations during summer (National Research Council, 1991) giving rise to the summer smog episodes. Since the 1990s tropospheric ozone has been continuously monitored at many ground sites across Europe. Numerical models of atmospheric transport and chemistry (CTMs) have become indispensable tools for the interpretation of measurement data, the analysis of sensitivities towards, for example, emission changes, and the evaluation of potential future air quality changes in the context of climate change.
10 Since 2005, a major European effort is under way to establish an operational system for monitoring and predicting global and European air quality with the help of data assimilation and numerical models (Hollingsworth et al., 2008). This Copernicus Atmosphere Monitoring Service (http://www.copernicus-atmosphere.eu/) has been developed in a series of projects funded by the European Commission under the acronym of Monitoring Atmospheric Composition and Climate (MACC). One of the products from MACC is a global reanalysis of atmospheric chemical composition covering the period 15 (Inness et al., 2013. The quality of all model based estimates of atmospheric composition and its changes has to be assessed by in-depth model evaluations against observations. Currently model evaluation is often performed either on individual observations, or on the average of the set of measurements, selected from specific geographical regions. This is done for evaluation of global (Stevenson et al., 2006;Fiore et al., 2009;J.-F. Lamarque et al., 2012;Katragkou et al., 2015) as well as regional (van Loon disadvantage of being based on subjective assessments by the different station maintainers or regional agencies. Moreover the station information may become outdated: for example due to newly built industries, residential areas, roads or changes to forest areas. Such changes would transform stations from "background" to "urban", which would impede objective ozone analysis. Thus, a static category label as given in Airbase may not provide an objective and reproducible classification for use in further statistical analyses. Instead, we suggest applying cluster analysis (CA) to the measurement data as a data 5 driven classification. The main goal of this study is to identify typical European air quality ozone regimes, determine their indicative patterns with respect to the temporal behavior of ozone mixing ratios, to assess how well the classification works, and to apply the categorization to the evaluation of a global chemistry transport model. Analysis of groups separation was presented in Lyapina (2014) and won't be discussed here.
The output from the MACC reanalysis was sampled at all station locations, and the results were grouped into the same clusters as the measurement data. Through comparison of the mean seasonal, weekly, and diurnal cycles and analysis of the variability of clusters, we can identify how well the MACC reanalysis can reproduce the ozone mixing ratios and seasonaldiurnal features of each regime, and as a consequence which regime is most consistent with the model results, and thus representative for the scale of the model. The paper is structured as follows: section 2 describes the process of data filtering from the full Airbase database. The 15 extraction conditions for the MACC model data are given as well as further steps of the preparation of both data sets. Section 3 provides details about the applied k-means algorithm and Earth Mover's distance method. In section 4 the results of the two CAs are presented and compared to the MACC model data. Section 5 discusses the robustness of the cluster analyses and their application for the evaluation of models. Section 6 contains the conclusions.

Airbase
Airbase provides hourly integrated ground-based ozone data records, measured by UV photometric analyzers.
Geographically, the station network covers all countries from the European Union and the European Environment Agency (EEA) member countries (http://acm.eionet.europa.eu/databases/airbase/), albeit with varying density. Station altitudes vary from 0 to about 3100 m above sea level. In this study, Airbase version 6 data from 2007 to 2010 were used. Atmospheric 25 ozone content was recorded as ozone density in μg•m -3 units. For the analysis presented here these were converted to number densities (nmol/mol or ppb) using the density of dry air at T 0 = 20 ºС and pressure P 0 = 101325 Pa. This T 0 corresponds to a conversion factor of 2 (i.e. 0.5 nmol/mol correspond to 1 μg•m -3 of ozone). T 0 = 20 ºС and P 0 = 101325 Pa correspond to the standard settings of commercial ozone analyzers, which automatically convert measurements at actual temperature and pressure to these standard conditions.

30
Several datasets in Airbase contain incomplete data and some ozone records appear unreliable. Therefore a 4-step filtering procedure was applied to each dataset in order to identify suitable timeseries and to remove individual outliers which could corrupt the timeseries statistics. First, all data less than zero were eliminated, because they represent non-physical values.
Next, data above either 2.83 times the value of the 95 th percentile of the data or twice the value of the 99 th percentile were eliminated. For a Gaussian distributed random variable both values should be approximately identical. Even though the ozone probability density functions are generally not Gaussian (see Figure 9), this test can be used to define a reasonable upper limit value, because deviations from the normal distribution are mainly at the lowest percentile range of data. In a third 5 step those data points were removed which show erratic behavior near a missing value. The rationale behind this test is that a visual inspection of measurement timeseries sometimes indicates that data reporting stopped too late or resumed too early after a calibration procedure, an instrument maintenance or malfunction. On each side of the missing value, the five nearest measurements are tested if they lie in the range of the surrounding values or exhibit abnormal variability. Finally, another outlier test (multi-step low-pass filter) was performed using the 240 data points moving average in the first pass, which 10 removed data points exceeding 8 times the standard deviation within the moving sample. In the next two passes with a varying width between 10 and 72 points, thresholds of 8 and 6 standard deviations are applied.
The data filtering was tested extensively on many different ozone timeseries and found to reliably detect obvious errors while removing only very few valid data points. In order to retain a timeseries in the analysis it had to fulfill the following data capture criterion: in every year, at least 9 out of 12 months had to contain at least 2/3 of the theoretical maximum hourly 15 values. After application of this criterion, the original Airbase data set of more than 4000 stations was reduced to 1525 stations (see supplementary tables S1 and S2). Their timeseries were then visually inspected for sudden changes in the baseline (this phenomenon is not captured by the automated data quality filter; see also Solberg et al. (2009)). We adopted a conservative approach and flagged only those stations, where baseline shifts of 5 nmol/mol or greater occurred. The 33 stations which were filtered out at this step are presented in Table S2. Finally 1492 sites were used in the CA and model 20 evaluation (Table S1).
As input for the CA multi-annual monthly mean diurnal variations averaged over the four year period 2007 -2010 of the individual ozone timeseries were used. Seasonal -diurnal ozone variations appear as typical cycles and represent the concentrations resulting from many factors influencing the particular stations. We used 3-hourly resolution rather than the original hourly resolution in order to match the frequency of the MACC model output (see section 2.2). Thus each station is 25 represented by a vector of dimension 96 (12 months times 8 time steps per day). The time averaged data at all stations were arranged as a data matrix of dimension 1492 by 96.
Two different input matrices for the CA were constructed leading to two different types of CA runs (1 st CA and 2 nd CA from here on). At first, seasonal -diurnal ozone variations in absolute values are used as a set of properties. Second, we used normalized seasonal -diurnal ozone variations in order to avoid the influence of actual ozone concentrations on the results.

30
Each normalized variation had zero mean and unit standard deviation. This second CA produces different clusters than the first step, but allocates stations to clusters being based on seasonal and diurnal variations themselves regardless of absolute concentrations. Since the data generally exhibit no trend during the 2007-2010 period and interannual variability is much smaller than the diurnal or seasonal variability, we did not detrend the data prior to the CA.

MACC
The model data were taken from the MACC reanalysis (Inness et al., 2013). The reanalysis invoked data assimilation of meteorological variables, and trace gas columns of O 3 , CO, NO and NO 2 as well as ozone profile information from various satellite instruments. The model system was the European Centre for Medium Range Weather Forecasts (ECMWF) Integrated Forecasting System which was coupled to the Model for Ozone and Related Tracers (MOZART) (Flemming et 5 al., 2009;Stein et al., 2012). The model grid resolution was about 80 by 80 km 2 , and 60 hybrid sigma-pressure levels covering the atmosphere from the surface to about 60 km altitude. Output was stored every 3 hours. We extracted gridded timeseries for the years 2007-2010. The model data at the 1492 Airbase stations used in the cluster analysis were obtained by a horizontal as well as vertical bi-linear interpolation to the locations and heights of the 1492 Airbase stations from the eight nearest neighboring grid points. Similar to O 3 , also CO, and NO x were extracted and provided as mole fractions.

Cluster Analysis
Cluster Analysis (CA) is a data driven technique for classifying objects into groups whereby each object is described through a set of input parameters (properties or variables) which are used as criteria for grouping. Clusters are formed such that the intra cluster similarity between objects inside a cluster and the inter cluster dissimilarity between objects of different clusters 20 are jointly maximized. Initially the concept of "Cluster Analysis" was suggested by Tryon in 1939. Since then it has found applications in statistical processing of large data sets in biology, medicine, computer science, meteorology and atmospheric sciences (Zhang et al., 2007;Lee and Feldstein, 2013;Camargo et al., 2007;Christiansen, 2007;Beaver and Palazoglu, 2006;Dorling and Davies, 1995;Marzban and Sandgathe, 2006) as well as in other fields.
Several cluster algorithms have been developed and different choices can be made for the computation of distances between 25 objects or groups of objects. The most commonly used types of clustering are hierarchical and partitional (aka. centroidbased clustering, or k-means clustering). Hierarchical clustering progressively splits the data set into more and finer clusters, whereas partitional clustering groups the data into a pre-determined number of clusters. Clusters are non-overlapping groups, such that at the end of the computation each object will belong to exactly one cluster. In the present study we applied partitional clustering, because it allows for estimating the robustness of results and is less sensitive to outlier values than where x A and x B are two objects of the data set each having M properties (i.e. variables); A and B are two different stations.
In our case an object is a station timeseries of monthly averaged diurnal variations of three hourly ozone concentrations such that the Euclidean distance is evaluated from M=96 dimensions and is identical to the root mean square error between the two objects. The 1 st CA uses absolute mixing ratio values, while in the 2 nd CA the mixing ratios at each station are 5 normalized by the mean so, that each object had zero ozone mean and unit standard deviation. The k-means algorithm minimizes the average Euclidean distances between individual objects and the given number of cluster centroids. A centroid C is an artificial object that represents its cluster and is the arithmetic mean of all properties of cluster members: where n inumber of objects in i th cluster, c icentroid of the i th cluster, x ijj th object of the i th cluster. Minimization is 10 achieved iteratively in an analysis cycle of three steps. At the initial step of each k-means run, k centroids are defined randomly from the data array. The second step assigns each object to the closest centroid by sorting in ascending order the distances ( , ). By this an initial seed of clusters is formed. In the third step, each centroid C is recalculated, as the mean of the current cluster members. Steps 2 and 3 are then repeated until the centroid coordinates don't change anymore.
The goodness of the clustering can be assessed with the sum of squared distances (SSD) between all objects and their 15 corresponding centroids: where k is the number of clusters, and n k the number of objects inside the k th cluster. K-means requires that the number of clusters k is known for initialization of the algorithm, so prior to the CA we applied a method to determine the optimum value of k. Due to the random initialization, repetition of a k-means run with the same number of clusters will generate a 20 sample of different SSD values as a function of the number of allowed clusters. Figure 1 shows an "elbow" -curve (SSD versus number of clusters k), derived from 50 • 100 independent k-means runs of the 1 st set of properties (96 absolute seasonal-diurnal variations) with varying number of k from 1 to 100. The idea is to find the largest number of k where the SSD from the independent runs are consistent with each other, i.e. the curves in Figure  The "elbow" plots not only give the appropriate number of clusters to run k-means, but they also provide a preliminary answer on the question of stability of the CA run for the chosen k. For the presentation of results in sections 4 and 5 we picked the k-means run with the lowest SSD out of the 100 independent realizations shown in Figures 20 and 21, respectively for each kind of CA. Further details on the stability (i.e. reproducibility) of k-means runs are given in section 5.

Earth Mover's Distance
In order to quantitatively evaluate the model's ability to reproduce the observed frequency distributions in each cluster, we calculated the Earth Mover's Distance (EMD). Initially the EMD was suggested by Rubner et al. (1998). EMD provides an objective distance measure between two frequency distributions or estimates of probability density functions. It is a true distance measure in the sense that it is positive semidefinite and symmetric and fulfills the triangle inequality. Additionally it 5 has the property of being (asymptotically) proper meaning that the smallest distance is only achieved when the two probability densities are identical. The formula for EMD according to Rabin et al. (2008) is: where n b is the number of bins, F X(xi) and G X(xi) are two cumulative distribution functions of f and g, which themselves are the two corresponding estimated probability densities obtained from the normalization of the respective frequency 10 distribution histograms over the n b bins.  Table 2 presents a qualitative interpretation of the five clusters and shows the distribution of station altitudes for each cluster.

25
The cluster descriptions were derived based on the geographical and altitude distribution together with a contingency analysis of the station type and station type of area attributes in the Airbase metadata. A contingency table with Airbase station attributes is provided in Table 1 (a,b). According to the Airbase classification (see "Introduction") stations are marked as either "urban", "suburban" or "rural" depending on the area type and as "traffic", "industrial" or "background" according to the station type. Each row in Table 1 corresponds to one of the Airbase clusters and shows the number of stations related to each of 9 Airbase classification pairs. Most of the stations that we retained in our data filtering procedure (section "Data") are background stations, which could indicate that there are no local pollution sources in their vicinity.
Measured concentrations should ideally be representative for a larger area (and hence suitable for the evaluation of numerical models), except when local effects from orography, land use or land-sea contrast confound the analysis. There is a relatively even split between rural, suburban, and urban background stations. Industrial and traffic stations constitute about 10-15% 5 each and are concentrated in the suburban and urban environments, respectively. Table 3 presents the same information as Table 2

Frequency distributions of ozone in clusters
Above the comparison of ozone concentrations among the clusters and between the observations and the simulations was based upon quantiles characterizing the cumulative probability distribution. Another way are estimated probability density functions or normalized frequency distributions computed by binning all available 3 hourly observations from both the 25 Airbase and MACC data. Those frequency distributions are presented in Figure 9 for each cluster of the 1 st CA and distinguished between summer and winter.
In the Airbase winter time data the three clusters with more urban characteristics (CL1, CL2 and CL3) contain a significant MACC exhibits quite a good fit to CL4 and CL5 winter ozone concentrations and in general shows a greater similarity with the frequency distributions of the observations in winter compared to summer. During summer the measured ozone data are almost normally distributed (except for CL1), which is not seen for the MACC summer values. The model summer curves exhibit a high bias and contain two maxima for CL2 and CL4 (Figure 9).
In order to quantitatively evaluate the model's ability to reproduce the observed frequency distributions in each cluster, we 5 calculated the Earth Mover's Distance (EMD, described in section 3) ( Table 4). As expected from Figure 9, the largest EMD is found for CL1 and CL2 in summer, while the model shows greater skill in capturing the frequency distributions of CL4 and CL5 and to a lesser extent also CL3 (Table 4). This is again consistent with the previous characterizations of CL3 as "background, moderately polluted" and of CL4 and CL5 as (mostly rural) background stations (  Figure 9).
Frequency distributions of the 3-hourly surface ozone values of Airbase and MACC for each cluster of the 2 nd CA are 15 presented in Figure 10. As anticipated from the previous discussion, clusters with urban signatures CL1 and CL2 are expected to show a peak at low ozone concentrations, related to their higher pollution level. Indeed, the peaks of Airbase probabilities of zero ozone concentrations are pronounced for both clusters in comparison to the moderately polluted CL3, for example, where "zero" ozone occurs only half as often and the ozone maximum appears in the range 25-30 nmol/mol.
The shape of the relatively clean CL4 curve resembles a Gaussian distribution with maximum probability at ≈ 35 nmol/mol.

20
EMD calculated for comparison of observations to modeled frequency distributions (Table 5) show the strongest disagreement for CL1, followed by CL2 and CL3 with quite similar values, and finally is the smallest EMD value for CL4. values in CL2 exhibited the second highest bias (12 nmol/mol, Figure 6).

Analysis of seasonal
The seasonal cycles of the 1 st CA cluster centroids are displayed in Figure 12. In the observations CL1 and CL3 run almost parallel and show a broad maximum extending from April to July for CL1 and a slight maximum in April for CL3. More prominent spring maxima are evident in CL4 and CL5, but CL5 also exhibits a second small peak in July. The only cluster with a single pronounced maximum in summer (July) is CL2. The spring maximum is typical for seasonal cycles of western European sites and considered as Northern Hemispheric phenomenon (Monks, 2000). Indeed, a substantial subset of stations in CL3, CL4 and CL5 are situated along the western edge of the continent (see map, Figure 3). The decline of ozone mixing 5 ratios from spring till autumn in CL3 and CL4 suggests that summer photochemical ozone formation plays only a minor role at these sites. On the other hand, the double peak of CL5 suggests a superposition of the "natural" spring maximum with the "anthropogenic" summertime photochemical ozone production. The stations in CL5 are more elevated, therefore they can be influenced by ozone from the stratosphere-troposphere exchange, which is considered as a possible reason for the ozone spring maximum on high mountains (Elbern et al., 1997;Harris et al., 1998;Stohl et al., 2000;Monks, 2000;Zanis et al, 10 2003).
In contrast to the seasonal cycles of the Airbase cluster centroids, the cluster mean seasonal cycles of the MACC data all show a summer maximum of similar shape with peak in June. This suggests that either the summertime chemical ozone formation is exaggerated in the model, or the largely transport-driven springtime maximum is underestimated. A potential influence from inconsistencies in the data assimilation (see Inness et al., 2013) is unlikely, but cannot be excluded.

15
The seasonal cycles in Figure 12 indicate that the MACC model performs better during winter than during the summer. This is particularly evident for clusters 3, 4, and 5, whereas a significant bias persists throughout the year for CL1 and CL2. In the Validation Report of the MACC reanalysis (2013)  Diurnal amplitudes were calculated from averaged diurnal cycles of each station as an absolute difference between daily maximum and minimum, and then gathered into distributions for each cluster. Box and whisker plots of ozone average diurnal amplitudes ( Figure 13) show a clear signature that appears to be correlated with the ozone precursor concentrations 25 as simulated by the MACC model (see Figure 7). The largest diurnal amplitudes (mean 27 nmol/mol) are obtained for CL2 (Po Valley), followed by CL1 (mean 18 nmol/mol), CL3 (mean 18 nmol/mol) and CL4 (mean 17 nmol/mol). CL5 (relatively clean elevated) stations exhibit the lowest diurnal amplitude (mean 9 nmol/mol). This is consistent with earlier findings by

5
Weekly amplitudes are shown in Figure 15. These were calculated as the absolute difference between maximum and minimum ozone mixing ratios of averaged weekly cycles for each station and then grouped into clusters accordingly.
Weekly amplitudes were not used as initial parameters in the CA, but interestingly the classification of Airbase data shows a clear tendency of the weekly amplitudes decreasing from CL1 to CL5, even though there is considerable overlap between the various box-whisker plots. The weekly cycles of all cluster centroids show growth from Friday till Sunday, but no significant change during the week (not shown). This confirms our characterization of the clusters from more to less polluted, meaning that the less polluted sites are less influenced by local precursor emissions with distinct weekday cycles, notably traffic emissions (Beirle et al., 2003). As for the MACC model, the boundary conditions of its chemical equation system don't contain weekly variations of ozone precursor emissions, therefore simulated ozone has no significant weekly cycle.

20
The large seasonal and diurnal amplitudes in the Airbase data of CL2 are consistent with the relatively large emissions and active photochemistry in the Po Valley region (Bigi et al., 2012). While ozone precursor concentrations at stations in CL1 may be as large as those in CL2 (based on emission inventories and the MACC simulation results for CO and NOx, see Figure 7), the mean ozone concentrations at these stations are lower. As can be seen from the frequency distributions in Figure 9, there are a lot more incidents with very low ozone concentrations at the stations in CL1, and these occur both in 25 winter as well as in summer. In the Northern and Central part of Europe, where the majority of CL1 stations are located, the photochemistry is slow especially during winter, so that not much NO 2 is converted back to NO and ozone via photolysis.
CL2 also exhibits ozone titration, but in summer to a lesser extent than for CL1 (Figure 9). For CL2 ozone destruction by NO and dry deposition still occur during night time but the prevalence of the daily ozone production over the ozone titration is more obvious here than for CL1. Indeed, the seasonal and diurnal cycles of CL2 are more pronounced than for CL1 30 (Figures 12 and 14), and are indicative of the intensive photochemistry in the Po Valley region. This may be explained by the basin type of the Po Valley region and by its partly sub-tropical climate with plenty of available UV light, which is favorable for summer diurnal photochemical ozone production.

Second CA
The mean seasonal amplitudes for clusters of the 2 nd CA are presented in normalized units in Figure 16. MACC data was normalized in the same way as the Airbase data, and then grouped according to the clustering results. We notice narrowness of seasonal amplitudes distributions and the decrease of their average in order CL1 → CL2 → CL3 → CL4. MACC seasonal amplitudes follow the same dependence, but in a more "smoothed" way, and they have broader distributions. The means of 5 modeled amplitudes slightly overestimate average observed amplitudes for CL3 and CL4, are nearly equal for CL2 and underestimate CL1.
The seasonal cycles in normalized values of the cluster centroids from the 2 nd CA are depicted in Figure 17. In general the model performs better for the description of diurnal cycles rather than seasonal. The diurnal cycles ( Figure 19) show similar dependence on cluster number as seasonal cycles: the smoothest for CL4 and most pronounced for CL1. As 25 expected from the 1 st CA, all clusters exhibit diurnal minima at 6 am and maxima between midday and 3 pm, except for CL1, which maximizes in the late afternoon -after 3 pm, similarly to CL2 of the 1 st CA. Modeled diurnal minima and maxima are in accordance with the observations, except for CL1, where MACC shows daily maxima in between 12 and 3 pm like for other modeled groups.
Clustering based on the normalized set of properties shows a clear division of stations relevant to amplitudes of seasonal and 30 diurnal cycles (Figures 16, 18). Further analysis (not presented here) of the 2 nd CA clusters have shown that they are also distinguished by the short-term variability, expressed as the difference between 95 th and 5 th percentiles of ozone mixing ratios (Lyapina, 2014). Both these amplitudes, as well as variability, decrease uniformly and gradually from CL1 to CL4 in accordance with the level of pollution of these clusters. In contrast, there are no substantial differences of variability between clusters of the 1 st CA (Lyapina, 2014). And as mentioned earlier, the dominant clustering criteria of the 1 st CA are the average ozone concentrations (Figure 6), and only to lesser extent the seasonal-diurnal amplitudes.

Stability and robustness of the cluster analyses
As described in section 3.1 "Cluster analysis", repeated k-means runs do not necessarily lead to the same allocation of 5 stations to clusters due to the random assignment of the initial centroids. As explained there, different initialization may lead to somewhat better or worse separation of clusters as expressed by the SSD values. Here we analyze the reproducibility of results from many independent k-means runs. We call this the stability of the CA. Another important aspect investigated here is the robustness of the analysis, i.e. the reproducibility of the station classification when random subsets of stations or are excluded from the analysis or when the input data are shortened in time. 10

Stability of the CA
As mentioned in section 3 "Method", 100 independent k-means runs were carried out for each CA and from these runs the one with the smallest SSD was chosen for further analysis. These runs (one for the first and one for the second CA) will be referred to as reference runs.  Figure 20) with more than 99% identity to the reference run and 70 runs with more than 95% of stations are grouped into the same categories as in the reference case which is marked with a black diamond in Figure 20. The stability decreases when the SSD values become larger, but in all of the runs at least 89% of the stations are always classified 20 in the same way. Exemplary checks of how the stations are redistributed when the results differ indicate that we usually find CL3 stations from the reference run in CL1 and CL2, while some CL4 stations are moved to CL3. This indicates that the distinctions between these clusters may be less obvious if we base our analysis on mean concentrations as we did in this study.
Similar to Figure 20, Figure 21 shows the SSD values of the 100 k-means runs from the second CA. From the first look at 25 Figure 21 we notice that the SSD curve of 100 k-means runs based on 2 nd set of properties is less structured and exhibits no "stable states". However, the scale of SSD values is also very narrow here, and every run generates a classification which is at least 95% similar to the reference run of the second CA.

Robustness with respect to number of stations considered
Besides the 100 k-means runs with all 1492 stations we performed another 100 sets of 100 k-means runs each where we randomly reduced the number of stations to 90, 80, 70, 60 and 50% of the initial data set, respectively. For each of these sets we selected the run with the minimum SSD and compared the classification results with our reference run. The robustness of the CA results was then obtained from contingency tables, where diagonal elements reveal the number of stations that are classified to the same cluster as in the reference run. Table 6 summarizes the results of all of these tests by grouping the contingency results into three categories: better than 99% 5 agreement, 95-99% agreement, and less than 95% agreement of cluster allocations (in this case there were no cases with less than 89% agreement for k-means runs of the 1 st set of properties). Each row in Table 6 represents the results for one particular dataset size. As Table 6 shows, the CA classification is very robust (more than 95% agreement in 99 runs out of 100) even if only 60% of the stations remain in the dataset. Out of the 100 randomly selected subsets for each row, at least 25 yield a classification which is 99% consistent with the reference run. Only if we remove 50% of the stations from the 10 input data, this similarity starts to decline. Note again that each count in Table 6 is already the minimum SSD run out of 100 for a given random sample. Had we performed only one realization of each subset, the CA would appear much less robust because of the stability issues discussed above. Table 7 shows the robustness results of the second CA. Though the reproducibility of 2 nd CA runs with the full data set is higher (see Figure 21) than runs based on the 1 st set of properties (Figure 20), the reduced data sets give the opposite results.

15
Reduced till 70% data set delivers most of 2 nd CA runs into the second category (95-99% of similarity), what happened only for the half-size reduced data set of the 1 st CA runs. Nevertheless, in case of the 2 nd set of properties no single run produces less than 91% agreement with the reference run, which is slightly better than for the 1 st set of properties (89% of similarity).
But as there are very few such runs (maximum 8 runs out of 100) in both CAs, we can conclude that the most of runs with any reduction result in clustering with 95% and higher similarity to the reference runs. 20

Robustness with respect to the length of the timeseries
Obviously it is desirable to obtain a station classification which is independent of the precise time period that is chosen for the analysis. We therefore performed additional robustness tests of the two CAs by repeating the analysis for subsets of 3 years out of the total 4 years which we had available. Each CA was re-calculated in 4 sets of 100 realizations excluding all data from 2007, 2008, 2009 and 2010, respectively. As before, from each set the run with minimum SSD was selected and 25 compared to the reference runs. The similarities of the station classification were again taken from the diagonals of contingency tables and are given in Table 8. There are small differences depending on which year is removed from the analysis, and on average both CAs yield a classification which is 95% similar to the analysis of the complete dataset.

Conclusions
Starting from more than 4000 European Airbase surface stations monitoring ozone concentration for the period 2007 till 30 2010 finally 1492 were selected after filtering for incomplete timeseries and erroneous data. The classification of stations based on k-means cluster analysis is broadly consistent with the Airbase intrinsic description of area types, which divides station types into background, industrial and traffic and station area types into urban, suburban and rural. The consistency between this Airbase characterization and our classification is mainly reflecting the pollution levels in the individual clusters.

25
The ozone variability (expressed as difference between 95 th and 5 th percentiles) was not included as an input parameter for any of the CAs. As an outcome there are no substantial differences of variability between clusters of the 1 st CA. In contrast, for the CAs based on the normalized properties the variability reduces from CL1 to CL4 (Lyapina, 2014). This implies that the short-term variability of ozone concentrations at European stations is generally correlated with the seasonal and diurnal amplitudes at these sites.

30
Comparison of the model with observations for individual clusters reveals MACC pros and cons. First of all there are different overestimation biases for the 1 st CA (from ≈ 5 to ≈ 15 nmol/mol), and secondly differences mainly in seasonal behavior rather than diurnal for both 1 st and 2 nd CAs. The biases are mostly driven by summertime ozone rather than wintertime, where ozone is generally well predicted (biases less than 5 nmol/mol on average). The biases decrease when going from clusters indicative of higher pollution to cleaner ones. Also, the seasonal cycles are described better for clusters with relatively clean air signatures. The best fit between the MACC reanalysis and the observations is observed for CL5 of the 1 st CA as well as for CL4 of the 2 nd CA and is explained by the fact that these stations are influenced more by regional rather than local factors.
When applying the k-means clustering technique it is important to ensure that the results are stable and robust against spatial 5 and temporal subsampling of the data array. We analyzed the reproducibility of the clustering results based on an extensive number of repetitions and found that in general, more than 95% of stations are almost always grouped into the same category, even if the total number of stations is reduced to 60% of the total, or if one year is excluded from the analysis.
However, this robustness is only obtained if one performs several k-means runs for each subset and selects the run with minimum SSD for further analysis. We therefore conclude that k-means clustering presents a suitable analysis of ozone 10 mixing ratio data if applied in the described manner.
The robustness and clarity of the cluster analysis might be further improved by adding observations of other compounds (ozone precursor concentrations) and/or meteorological variables. Unfortunately, such data are only available for very few of the Airbase measurement sites. Inclusion of such data might also allow separation into more clusters where one might begin to see regional differences of the ozone behavior. As the robustness analysis indicates, our results should remain valid even 15 if the analysis were to be repeated with longer timeseries or with an extended or reduced set of stations. It would be interesting to perform similar analyses in other world regions and to find out if the clusters obtained there are related to the broad classification into pollution regimes which we found for Europe.