Mass spectrometric measurements commonly yield data on hundreds of variables
over thousands of points in time. Refining and synthesizing this raw data
into chemical information necessitates the use of advanced, statistics-based
data analytical techniques. In the field of analytical aerosol chemistry,
statistical, dimensionality reductive methods have become widespread in the
last decade, yet comparable advanced chemometric techniques for data
classification and identification remain marginal. Here we present an example
of combining data dimensionality reduction (factorization) with exploratory
classification (clustering), and show that the results cannot only reproduce and
corroborate earlier findings, but also complement and broaden our current
perspectives on aerosol chemical classification. We find that applying
positive matrix factorization to extract spectral characteristics of the
organic component of air pollution plumes, together with an unsupervised
clustering algorithm,
The field of chemometrics, i.e. “using mathematical and statistical methods
[
Yet, there is a considerable amount of work still to be done in the field of aerosol data chemometric analysis – a significant part of more advanced AMS data analysis is still done manually and thus inevitably limited by the expertise and capacity of the human analyst. In particular, the classification and interpretation of the AMS spectra are still largely based on a dozen or so mass spectral variables called “marker signals” and their ratios (Aiken et al., 2007, 2008; Cubison et al., 2011; Farmer et al., 2010; Mohr et al., 2012). Exploring the logical follow-up to the automatic data dimensionality reduction by applying similar mathematical, computer-aided tools also for un-/semi-supervised classification and identification of AMS spectra, has not been performed outside of the specific application of single-particle mass spectrometric studies (e.g. Freutel et al., 2013; Liu et al., 2013; Murphy et al., 2003; Rebotier and Prather, 2007). Automatic or machine-learning-based classification tools would likely prove invaluable for consistent and objective analysis of a much wider range of AMS data, lessening the outcomes' dependence on the subjective views of analysts or reliant on their years of expertise in mass spectral interpretation (Ausloos et al., 1999; Ulbrich et al., 2009). Even for an experienced analyst, exploratory data analysis has the potential to uncover previously unknown underlying mathematical structures within the data (Tukey, 1977), offering invaluable clues for the correct selection of solutions and their interpretations.
Sometimes individual, classical analysis methods are used at the edge of their appropriate domain. For example, factor analysis assumes constant profiles over long time frames, when chemical processes actually modify the profiles (Zhang et al., 2011), and hard classification methods can be used for data reduction when observations are non-discrete and form a continuum (Marcolli et al., 2006). More robust results may be obtained by sticking more closely to the core applicability area of each method, where their core assumptions are better expected to hold, and by combining multiple separate methods instead.
More diverse, combined techniques can also overcome inherent limitations of data models. They may, for example, enable the uncovering of the minor, “outlier” aerosol types that often go unnoticed in a long-term factor analysis because of their low relative contribution to total aerosol mass (Ulbrich et al., 2009). Such additional, quantitative information on aerosol chemotypes is widely beneficial for many types of studies involving, e.g. receptor modelling with chemical mass balance or constrained factor analysis models (Belis et al., 2013; Canonaco et al., 2013).
In this work we explore the possibility of complementing the current techniques for AMS data reduction with some analytical and processing methods found to be useful in similar mass spectrometric applications. We apply data dimensionality reduction to deconvolve ambient air pollution events, extracting the characteristic pollution spectra, and subsequently use and optimize unsupervised classification to resolve the pollution types. Our motivations are to (1) test whether a simple unsupervised data clustering method can be used to classify discrete aerosol mass spectral samples without a priori information provided by a human analyst, (2) explore the effect of similarity metric selection and data weighting to optimize data preprocessing (Horai et al., 2010; Kim et al., 2012; Stein and Scott, 1994) and thus enhance the structures in data, leading to improved classification (Anderberg, 1973; Spath, 1980). Finally we will (3) compare which classification measures best capture the differences between different atmospheric aerosol chemotypes, often referred to by the AMS scientific community, and examine how the observed structures in data translate to information on pollution types.
The results on the performance of distance metrics along with the relative importance of variables (optimized weight) for correctly grouping mass spectra can also lead the way towards an automated classification of AMS results against relevant library spectra, as well as providing ways to evaluate and classify (discrete or deconvolved) spectral results in large numbers, such as those often seen in sensitivity analysis or bootstrapping analyses in factor analytical model verification. Indirectly, the information may also help in any manual classification and identification tasks.
We will exemplify the functionality (and hopefully usefulness) of such a machine-learning approach in an analysis of an extensive set of AMS ambient air pollution spectra. We propose that, in this case, the methodology offers an improved way to derive not only reliable, reference spectra for the local, archetypal anthropogenic pollution types, but also quantitative estimates for their expected natural variation. This information has an immediate and direct use, e.g. in constrained factor analytical models that require an input for reference spectra variability (Canonaco et al., 2013).
From an aerosol chemical viewpoint, solid, quantitative knowledge on local pollution types and their chemical characteristics can help decipher and understand a variety of physicochemical observations during times when a site experiences pollution events. Concentrating on individual pollution events also enables the observation and identification of often overlooked minor or rare emission sources, which can provide new information on local or regional atmospheric aerosols.
Although the focus of this study is on statistical analysis of aerosol mass spectrometric measurements, any such venture is inherently affected by the nature and quality of the experimental data, as well as its preprocessing. We will first provide a short overview of the main features, advantages and shortcomings of the aerosol mass spectrometer instrument (Sect. 2.1) and describe the specific set of data we used (Sect. 2.2) in the testing of the methodology to serve as a background and put into context the choices made when selecting our specific statistical methods and algorithms.
The statistical methods themselves are described in a brief manner in Sect. 2.3, since their in-depth review or commentary is beyond the scope of this article. We encourage the interested reader to follow the references provided for a more comprehensive account and additional background information on the specifics and inner workings of these data analytical techniques and algorithms.
The data analysed in this study are acquired with an aerosol mass spectrometer featuring a compact time-of-flight (C-ToF) mass analyser. The instrument is developed and manufactured by Aerodyne Research Inc. (Billerica, MA, US). While outclassed mass resolution-wise by subsequent high-resolution (HR) ToF-AMS variants (Canagaratna et al., 2007; DeCarlo et al., 2006), the C-ToF-AMS mass analyser comes with the highest sensitivity, and thus signal-to-noise ratio (SNR) of any AMS instrument (DeCarlo et al., 2006; Drewnick et al., 2009). This high precision clearly benefits any statistical analysis and partly offsets the smaller number of variables available in unit mass resolution (UMR) data compared to high resolution.
The C-ToF-AMS, described in a thorough fashion by Drewnick et al. (2005), shares many common characteristics with most ToF-AMS
instruments – an aerodynamic lens to concentrate the sample aerosol into a
tight beam of particles upon entering through the inlet of the instruments, a beam
chopper to enable particle size measurement based on their flight time
through a vacuum chamber, a thermal vapourizer set at 600
In the particular C-ToF-AMS we used, the particle time-of-flight (PToF) chamber is considerably shortened (10 cm versus the normal 40 cm chamber). The advantage of this modification is an increased sample transmission from the inlet up to the vapourization and ionization region, at the cost of an increased signal from aerosols' carrier gas, due to the reduction in time and distance for air molecules to diverge from the beam. To combat the effect of an increased number of air molecules passing through the system, a helium flow is introduced to the PToF chamber to increase the pressure and thus the diffusion of air molecules in the beam passing through the chamber. With this arrangement the air signal is reduced by a factor of 10 to 100, depending on the setting, while only affecting the aerosol signal to a much lesser extent. The helium ions are later removed by an additional high pass mass filter located before the analyser. While not affecting this work, an additional negative effect of this geometry is the reduced resolution in particle time-of-flight sizing caused by the shortened PToF flight track.
The AMS data need a number of steps and corrections to first compute the per ion mass loading from the raw mass analyser signal, and then to estimate and propagate the errors arising from several sources along the way to the final results. As the general methodology of AMS standard data processing has fortunately been amply described (Allan et al., 2003; Jimenez et al., 2003), as has the derivation of mass spectral matrices and their error estimates (Ulbrich et al., 2009; Zhang et al., 2005), we will only summarize the procedure here and concentrate on the particulars that deviate from the standard approach in our preprocessing.
The AMS data were automatically corrected for changes in
The instrument signal response degrade over time was accounted for by using a
time period after instrument calibration as a reference point and normalizing
the measured air induced signal (airbeam, AB) to the mean value of the
reference time (AB
Airbeam non-linearity in the C-ToF-AMS is discussed in more detail by Hings et al. (2007), who attribute it to signal thresholding affecting low (< 1 ion per MS extraction) signals differently from air ions with multiple ions detected in each extraction.
The AMS fragmentation table was slightly modified to accommodate for the
issues arising from the modified PToF chamber and the increased airbeam.
Namely some air-related signal ratios used in the fragmentation calculations,
such as relative air contribution to observed signals at 15, 29 and 30 Th,
were recalculated based on exact molecular ratios obtained from filter runs,
automatically performed every 3 h. After this recalibration all the minor,
artificial, non-air signals originally seen during filter runs (e.g. in
organics and nitrates) at the aforementioned
Efficiency of the ionization process was determined via ammonium nitrate response factor calibration outlined by Jimenez (2003) and Allan et al. (2003). Finally, collection efficiency (CE; Huffman et al., 2005) was evaluated in relation to mass derived from a (twin) differential mobility particle sizer (DMPS; Aalto et al., 2001) after the subtraction of black carbon (BC) given by an aerosol aethalometer (Hansen et al., 1984). The CE correction proposed by Middlebrook et al. (2012) was only applied to one (March 2009) out of the three data sets used in this analysis (see Sect. 2.2), with a small modification to the base CE suggested by the DMPS comparison, as applying it was actually found to slightly weaken the correlation between DMPS and AMS data instead of improving it. For the other two data sets (May 2008, September 2008) a constant CE was applied based on a linear least squares fit to best match the mass observed by the AMS against that derived from the DMPS.
Finally matrices consisting of organic mass spectra and their estimated
errors at all measurement points, averaged in data acquisition phase to 5 min intervals, were extracted from the preanalysis software. The error of
the organic signals at each time and for each
The two matrices, one for organic mass spectra and one for the error estimate
were additionally preprocessed using the PMF Evaluation Tool (PET;
Ulbrich et al., 2009). The preprocessing features mainly take into account
an additional correction to the error matrix from electric noise of the
instrument, and allow the down-weighting of certain signals with poor SNR or
those derived directly from
The experimental data used in this statistical analysis exercise originate from long-term ambient air observations at the SMEAR II station in Hyytiälä, Juupajoki, Finland. The particulars of the measurement campaigns, environment and pollution events are described below.
As the practical example of the study is about applying our exploratory data analytical techniques to resolve the archetypal air pollution classes typical of a non-urban field site, we obviously strive for a high-quality, variable and comprehensive set of data to refine and test the methodology. Our data sets of choice originate from the comprehensively documented and characterized station of SMEAR II in Hyytiälä, during the well-covered intensive measurement EUCAARI campaigns of 2008 to 2009 (Kulmala et al., 2009). This already rather familiar data will offer a good test bed for the proposed methodology. Below, both the station and the intensive measurement campaigns are described.
Time frames of measurements and numbers of successfully extracted pollution spectra per data set.
Situated in southern Finland (61
A notable exception to the absence of major anthropogenic pollution sources in the local environment are the two lumber mills and the wood pellet factory in the small village of Korkeakoski, 7 km east-south-east from Hyytiälä, and the other small sawmills further away, which do have a marked influence on the volatile organic compounds (VOC) concentrations and aerosol population of Hyytiälä when the incoming air mass is advected over these mills (Eerdekens et al., 2009; Liao et al., 2011). Additionally, although the local aerosol sources such as passing vehicles, cooking emissions at the forestry station, nearby cottage, household or sauna heating are negligible on a large scale, they can momentarily affect the local air quality if emitted from close enough not to be diluted below the detection limits, and are thus detected by many of the more sensitive, high-time-resolution instruments at the SMEAR II stations.
Nevertheless, the preeminent cause of degraded air quality (relative to the background) at the station is the medium-to-long-range air convection from industrialized areas of southern Finland, the St. Petersburg region in Russia in particular (Kulmala et al., 2000; Patokoski et al., 2015; Riuttanen et al., 2013) and even all the way from the industrial heartlands of (mostly eastern) continental Europe (Niemi et al., 2009; Sogacheva et al., 2005).
As it is independent of these anthropogenic components, the local atmosphere is always influenced by the ever-present biogenic background aerosol and biogenic volatile organic compounds (BVOC). These exhibit their own seasonal and diurnal variations. BVOC concentrations are generally high, both during the afternoon due to maximum emissions and at night-time due to the trapping of emissions in the shallow, unmixed boundary layer (Rinne et al., 2005). The aerosol biogenic particle mass was also higher at night due to thermally driven condensation of semi-volatile species into existing seed particles. Due to the biological origin of the natural aerosol, the biogenic aerosol background is obviously higher in warmer months (e.g. Patokoski et al., 2014).
The relative lack of local anthropogenic sources and their pronounced dependence on air mass origins, manifesting as observations of isolated aerosol and gas phase plumes, make Hyytiälä an ideal natural laboratory for studying the effects and characteristics of local and transported air pollution on the otherwise clean atmosphere over the expansive subarctic biomes.
The EUCAARI study, conducted from 2007 to 2010, aimed at examining the interactions between air pollution and climate change (Kulmala et al., 2009). The results have been widely published (see Kulmala et al., 2011, for a summary of findings) and include discussions, e.g. on aerosol source apportionment and chemical ageing (Kulmala et al., 2011; Ng et al., 2010b). The intensive observation periods of the project took place in spring 2008, autumn 2008 and late winter 2009. The exact time frames of the AMS measurements are available in Sect. 3.1, Table 1.
An especially comprehensive analysis of the intensive EUCAARI AMS data is written by Crippa et al. (2014), providing a reference point to compare our results with. Their analysis provides plausible estimates of aerosol speciation using consistent methodology. Due to the obvious limitations induced by the very large number of data sets and the somewhat rigid methodology of using general reference spectra from quite different types of environments for all the sites and by applying rather strict constraints to their allowed variability, it is possible for the analysis to miss out on some divergent, locally relevant phenomena. As an additional motive for this work, we aim to provide considerably enhanced prerequisite information for applying such a factor analytical methodology to an individual site, by observing the local anthropogenic aerosol characteristics and variation, and tailoring the input reference spectra and variation estimates accordingly.
The term “air pollution event” is in the context of this study defined loosely as a period of significantly increased concentrations relative to a stable background aerosol. This implies that the pollution episode has a distinct beginning, a point in time when a relatively stable background aerosol is first complemented with a specific pollution aerosol, and an end, when the pollution vanishes, leaving a background aerosol similar in composition and mass to that observed before the event. During the pollution episode we assume the observed total aerosol is a two- (or in some cases multi-) component superposition of background and chemically invariable “pollution plume” aerosols. Although there are undoubtedly aerosol dynamical processes going on between the two aerosol types, the background and the plume, we assume these to be of minor importance due to the generally short time frames of the events. Some examples of typical pollution event types and durations are given in the results (Fig. 1).
Examples of time series of extracted PMF factors for four different
pollution cases, ranging from simple to very complex cases.
When basing the pollution event definition on aerosol mass, while we can pick
out the clearest instances of pollution with confidence, we run into problems
when the increase in mass concentration approaches the magnitude of noise in
the instrument. Also, after pollution events that last a better part of a day
or longer, it is questionable whether the background aerosol has remained the
same and whether it is even possible to describe the changing pollution
unambiguously. To address these issues of demarcation, we needed to define
further conditions to separate real pollution from short instrument noise
peaks on one hand and long periods of increased concentrations of gradually
changing or evolving pollution aerosol on the other. Hence, the qualitative
requirements of a pollution episode to be accepted for our analysis were set
as follows.
A distinct temporary rise of organic aerosol mass concentration above
background level. The ability to unambiguously separate the pollution plume from the
background. The supposed pollution spectra needs to be physically reasonable (i.e.
not a minor spike of instrument noise).
The above criteria were taken as qualitative, and their fulfilment evaluated
visually from the time series and mass spectra. Still, some rough guideline
values for required plume features are given below to convey the magnitudes
of the thresholds used in our manual event screening.
A rough estimate of a 10 % rise of plume peak mass versus surrounding
background aerosol mass was taken as an approximate threshold for criterion 1
selection. The timescales considered ranges of pollution events from minutes
(two data points equalling 10 min) to several days (maximum was 5 days, in
the case shown in Fig. 1d). The first criterion can thus be considered lax
and likely overly inclusive. Criterion two was much stricter and in practice
required the deconvolved pollution time series to exhibit a small enough
correlation between the temporal behaviours of plume and background aerosols: in the
background time series a low level of jitter around mean background mass
during the plume was accepted in mild cases, as long as it did not contain
marked dips (anti-correlation) or hills (correlation) matching or mirroring
the plume behaviour. The approximate limit was that the (anti-)correlation
should not affect the pollution factor mass by more than
We note that stricter, fully qualitative and preferably statistics-based limits for event screening would certainly be preferential, and might enable automatic event identification. From a more quantitative analysis the exact uncertainty induced in this sample selection process could be also derived. Our rough estimate for the magnitude of maximum uncertainty is the 20 % deriving mainly from the effect of mass misapportionment allowed in the correlation examination phase (criterion 2). Unfortunately, due to the chemometric focus of this study, a more exhaustive time-series-based analysis of the phenomenon of air pollution plumes was not achieved here.
It should be noted that, while the first and third conditions could be examined using a simpler non-statistical analysis method like background subtraction, there are many pollution episodes where this would not suffice. While we could define the pollution-plus-background spectrum from the event period and the background spectrum as an average of background before and after the event, and subtract the latter from the previous, yielding the characteristic pollution spectrum (e.g. in Fig. 1a) the second condition is much more problematic if the background varies, not even necessarily in a linear way, or several pollution plumes overlap (e.g. shown in Fig. 1d). This especially applies to events with a long duration, from several hours to even a couple of days, and to events with multiple consecutive peaks, which still represent the same event (case in Fig. 1b). To account for these complications we feel a more advanced data reduction method, such as the factor analytical approach presented in Sect. 2.3.1, is indeed required to be able to thoroughly evaluate which of the events satisfy our selection criteria.
As numerous studies have already been conducted on feature extraction and data dimensionality reduction in connection with AMS results (Ng et al., 2010a; Zhang et al., 2011), in this work we will focus more on classification and identification techniques suitable for AMS data.
Based on a brief review of suitable data analytical methods, we selected the
specific methods and algorithms used in this work: for pollution feature
extraction we selected a model already tried and tested for AMS data, the
positive matrix factorization (PMF; applying the ME-2 algorithm; Paatero,
1999) and for feature classification we use the elegantly simple and
long-established
In this work we will use PMF in a non-classical way, to extract
characteristic air pollution spectra from air pollution plumes, which are
often considered to be anomalous data and discarded from long time series
analyses. As there often exists considerable variation among mathematically
equally good PMF solutions, termed ”rotational ambiguity” (see Supplement
Sect. S1), the issue of selecting the correct solution needs to be resolved.
We propose that in the context of this work, selecting the PMF solution with
non-correlating time series of plume and background can be used to identify
the rotation that best separates the characteristic pollution factor, and
thus largely avoids the factor analytical models' inherent weakness of
rotational ambiguity, which also afflicts PMF (Paatero et al., 2014). We will
then demonstrate classification of the extracted spectra to aerosol types
using
In the analysis of aerosol mass spectrometric results, data reductive methods are put to good use for reasons explained in the introduction. To respond to the challenges and requirements posed by the AMS data, experts in statistics and modelling have updated many traditional analysis tools and developed new ones to answer the specific needs of this type of environmental data analysis. Perhaps the best-known technique developed specifically for feature extraction from environmental, multidimensional data is the use of the positive matrix factorization (PMF) model to deconvolve and interpret the enigmatic organic aerosol chemistry reflected by the often complex AMS mass spectra.
The PMF technique developed by Paatero and Tapper (Paatero, 1997; Paatero and
Tapper, 1993, 1994) is an iterative, factor analytical model to explain
observations at a receptor site, i.e. time series (
One of the noteworthy improvements, summarized in detail by Paatero and
Tapper (1994), over previous feature extraction methods such as principal
component analysis (PCA; Hotelling, 1933; Jolliffe, 1986; Pearson, 1901) is
that the PMF model does not blindly minimize
Therefore, for the objective function to be minimized,
PMF analysis is performed for each air pollution event (defined in Sect. 2.2.2) individually. The time window of the analysis period is selected around the event to include both the pollution episode and some background before and after the event. The advantage of studying this type of relatively short-term phenomena is that we can easily evaluate the fulfilment of the criteria outlined in Sect. 2.2.2, and we can additionally discriminate between mathematically equal solutions, mostly evading the issue of rotational ambiguity. Essentially, knowing beforehand what the (qualitative) temporal behaviour of a pollution and background factors should be like (i.e. the time series of the factors should be uncorrelated), we can explore a number of factors and the solution space to select the solution best fulfilling our criteria of physical correctness. By adhering to these criteria, we strive to minimize the ambiguity related to our selection of solutions, as well as considerably reduce the effect of subjectivity with regard to selecting solutions.
The uncertainties and limitations in PMF are related to measurement errors, uncertainty in data and modelling errors of the PMF bilinear model and rotational ambiguity of the PMF results (Paatero et al., 2014). Rotational uncertainty is inherent to all linear algebraic, factor analytical models (Henry, 1987) and arises from the existence of several model solutions of mathematically equal rates of explanations of the observations. Rotational ambiguity was decreased or even eliminated as the number of zero values available in the data increased (Anderson, 1984; Paatero and Hopke, 2009), making this type of plume modelling by PMF propitious (as the background regions contain ample observations close to zero concentrations of plume chemical constituents). The topic of rotational ambiguity is discussed in more detail, and further references are available in the Supplement (Sect. S1). The modelling errors in PMF relate to simplifications and unrealistic assumptions (e.g. unchanging factor profiles in PMF, translating to neglecting the effect of atmospheric chemistry on mass spectra). Paatero et al. (2014) note that the effect of measurement (random) errors along with rotational uncertainty decrease (in cases where there are more zero values available in larger data) with increasing data set size, while the modelling errors are exacerbated.
Regarding the PMF uncertainties overall, we propose that aerosol plume and pollution event modelling of AMS data by PMF generally decreases analysis uncertainty compared to analysing the full time series, specifically in our case of high SNR data with small random errors. The higher rotational ambiguity induced by having a smaller set of data is offset by the abundance of zero values in concentrations, while the modelling errors from the neglect of chemistry are decreased due to shorter analysis time windows. Additionally, the selection of solution and rotation is facilitated by the external, physical criterion of minimal correlation of factor time series. We propose that using the [modulus of] correlation minimum as a guideline for rotation selection resolves the physically correct rotation (source-wise differentiation of factors) from among the ones available in the solution space.
An inherent feature of factor analytical receptor models is also that they are unable to separate components with a high degree of correlation in profiles (spectra) or loadings (mass concentration) (Henry et al., 1984). This problem of multicollinearity hinders or even prevents the extraction of spectra from individual sources that are collocated. In such a case a mixed air mass, e.g. one transported from an urban location with multiple sources (traffic, cooking and combustion) would be resolved as a single factor, the characteristic spectra which would be a linear sum of the actual, single sources. A similar effect, profile-wise, occurs with extensive oxidation of organic aerosol components – the spectral similarity of organic aerosol increases upon atmospheric oxidation, leading to difficulties in differentiating between highly oxidized aerosol types, even if they originate from chemically distinct emissions (Ng et al., 2010b; Zhang et al., 2011). Both of these effects are likely to impact negatively on data reduction achieved by PMF and can carry over to clustering, which by default assumes discrete or deconvolved samples.
Nevertheless, we should be warranted in expecting that not all the sources are similarly collocated; thus some extracted samples indeed represent “pure” spectra from single sources (e.g. a passing car or an individual aerosol plume from a single smokestack close by). This would allow identification and proper handling of the mixed plumes within the framework of sample classification and weighting, covered in the next Sect. 2.3.2 and 2.3.4). We will evaluate and address these issues further in light of results in Sect. 3.4 and 3.6.
K-means clustering is one of the most popular, widely used and well-known
classification algorithms developed as far back as the 1950s and 1960s (Ball
and Hall, 1965; MacQueen, 1967; Steinhaus, 1956). It is a simple iterative
partitioning clustering algorithm that partitions a set of objects in a
multidimensional space into a preset number (
Further discussion on the selection of user parameters for
Among the most important questions in clustering is the selection of a measure for distance or (dis)similarity between two objects, a topic for which there are both theoretical (Anderberg, 1973) and experimental (e.g. Stein and Scott, 1994) considerations to be taken into account. Fortunately for the choice of metric we have plenty of recommendations available for our selection: there are several guidelines and recommendations (e.g. Cormack, 1971; Gordon, 1999; Kaufman and Rousseeuw, 2009) available of which similarity metric best to apply for various types of problems, including problems related to identification, comparison and classification of mass spectra similar to ours. As an experimental basis for the metric comparison we cite the informative and thorough study by Stein and Scott (1994) of NIST Mass Spectrometry Data Center, the conclusions of whose are covered in wider detail further below. Importantly, the distance metric selected needs to be mathematically compatible with the type of variable on hand. This point in question is addressed in Sect. S3.
Some common approaches available for and often used as distance ( the squared Euclidian distance the city block distance (or “Manhattan distance”; Johnson and Wall,
1969; Carmichael and Sneath, 1969) the cosine distance (Sokal and Sneath, 1963) the correlation distance (Fortier and Solomon, 1966; McQuitty, 1966;
Sokal, 1958)
where
Additionally to experimentally evaluating the metrics, Stein and Scott (1994)
recommend data-weighting methods such as signal intensity scaling and mass
scaling to be examined. They find modest improvement of a couple of percent
accuracy in the dot product (cosine) and Euclidean-based matching when
scaling the signal intensities by their square root to emphasize smaller
signals, or when scaling all the signals by a power of their “mass” (i.e.
Although the theory and literature seem to favour the cosine (dis)similarity
as a measure of mass spectral objects' associations with each other, we ran
several comparisons using different parameters for
To evaluate and, to an extent, validate the clustering analysis we need an objective, diagnostic metric to compare different results. There are several alternatives available, four of which we tested in relation to this work. We considered the four evaluation criteria available in the Matlab software statistics toolbox (R2015a), namely the silhouette (Rousseeuw 1987), Calinski–Harabasz (Caliński and Harabasz, 1974), Davies–Bouldin (Davies and Bouldin, 1979) and gap (Tibshirani et al., 2001) criteria, the evaluation results of which are presented in Sect. S4, Figs. S2 to S4.
The downside of the three latter methods is that they do not (at least
unmodified) accept all non-distance dissimilarity metrics such as the cosine
(dis)similarity. For squared Euclidean distance, which was compatible with
the evaluation functions, the methods yield mixed results. Upon examining the
Rousseeuw (1987) defines for each object
The silhouette value has some convenient properties for interpreting the
quality of the clustering assignments that can be applied on single-point,
cluster and total solution levels. When
The average
The
Nevertheless, we do have additional, diagnostic information available to us
outside of the simple list returned by the
Assuming the objects nearer the cluster centre are a better representation of
the class than the ones on the edges, or indeed the ones likely
misclassified, they should carry more weight when a typical representative of
the class is selected or constructed. In this work we construct
characteristic centroid objects, i.e. spectra, by taking a weighted
arithmetic mean of the cluster members instead of the original, unweighted
sample mean. As a weight we use the silhouette values indicating the confidence
we have on the representability of the object. Any likely misclassified
objects with negative silhouette values have their weight set to zero. The
weighted cluster centroid
Overall the variabilities represented by the weighted standard deviations are generally smaller than the unweighted ones, due to the down-weighting of outlier objects' influence. The aim of this post-processing is to derive more representative characteristic mass spectra for the pollution types, and to decrease the error from the ambiguity of the classification. This should allow us to derive plausible estimates for the actual natural variation within a specific aerosol type.
In the following chapter, we present some examples of pollution spectrum extraction (Sect. 3.1), and evaluate the similarity and weighting parameters used for their subsequent grouping (Sect. 3.2). We then offer an aerosol chemical interpretation for the different aerosol types (clusters) for the grouping we consider most realistic (Sect. 3.3 and 3.4) and further try to understand and interpret the metastructure of clustering solutions, i.e. how the solutions relate to each other, what drives them and how they are related to divisions in chemical characteristics (Sect. 3.5). Finally some basic estimates of natural variability within the pollution types are given in Sect. 3.6.
Although time consuming, applying the pollution feature extraction approach (described in Sect. 2.3.1) to the identified pollution events (Sect. 2.2.2) allowed us to extract the pollution factors' spectral profiles. Applying our simplistic selection criteria to find the most physically correct rotation among the solutions, we hope to have minimized the rotational ambiguity, as well as the need for subjective choices by the analyst. Following the procedure described in Methods, we managed to extract a total of 81 characteristic mass spectra, corresponding to as many unique pollution plumes. In the Supplement, namely the local time and above-canopy wind direction taken at the time of peak mass concentration was recorded for all plumes. The background spectra were not further considered in this analysis. The per-campaign distribution of the successfully extracted pollution events are presented in Table 1.
Some examples of factor time series of various types of accepted extractions are given in Fig. 1, illustrating the considerable (temporal) variability in the types and conditions of pollution events, e.g. from a single plume with a stable background (Fig. 1a) to very complex events with multiple overlapping plumes (Fig. 1d and e).
Solution silhouette value of clustering solutions for
As discussed in Sect. 2.3.2, there are several options for the standard
We note that using a low number of repetitions (< 10) does not
reliably return the exact same optimal solution, so there seem to be several
similar, but non-identical, local minima as well as the global one for
The clustering initialization method was not found to notably affect our
results in any way, at least with a generally high number of repetitions. Due
to literature recommendations based on comparison (Shindler, 2008), the
We set the additional option “empty cluster action”, i.e. what
would happen if an empty cluster is created in the course of the iterative
process, as a “singleton”, meaning that the point with the highest distance
score to its cluster centre was assigned its own cluster. This forces the
solution to always conform to the original cluster number. Generally, an
empty cluster was produced in much less than 1 % of all the total
iterative processes, so we do not consider this to have affected the overall
result, especially since the
The selection of cluster number
A contour plot for a field of solution silhouette values for the
studied dissimilarity metrics, as a function of cluster size
Having no definitive preconception on the number of clusters, we evaluated
clustering results for a range of
Based on the silhouette value comparison for the unscaled data (Fig. 2) we
conclude the following: the city block distance metric seems to perform
poorly compared to the other three alternatives. The squared Euclidean,
correlation and cosine methods are more or less equal in their silhouette
quality, making the selection based on this test alone a difficult task. We
also find the silhouette values for the latter methods between values of 0.25
and 0.50, suggested by Kaufman and Rousseeuw (1989) as a region of weak
structure in the set of data, and calling for the use of additional methods
to probe whether the implied structure is real or artificial. We additionally
note that there is clear variation in silhouette values as a function of
For additional visualization, similar diagnostics for when (not specific
metric optimization) mass scaling is applied are also presented in Fig. 2. The
example mass scaling factor
As preprocessing options we also tested the two methods recommended by Stein
and Scott (1994), namely intensity and mass scaling of the data variables, as
explained in Sect. 2.3.3, Eq. (10). Similar to Stein and Scott, we also
explore values for
Diagnostics values, clustering parameters and cluster populations
for solutions of 6 to 10 clusters. Oxidation level is described for each
cluster centroid and potential sources are (preliminarily) identified.
Within-cluster silhouette values are colour coded for readability
(< 0.24 orange,
Similarly, intensity scaling was charted for
Finally, we tested the combination of mass and intensity scaling, but found
the results worse than for mass scaling alone. We additionally tested two
methods with similar aims, namely omitting the low end mass spectrum
< 45 Th and down-weighting
In conclusion of the preprocessing methods, we find mass scaling is the only method to consistently (but non-homogeneously) enhance the data cluster structure. Whether the other procedures mentioned above might under certain circumstances or specific combinations also prove beneficial, is a question left for a more detailed future study. In the following, we will overview the silhouette maxima obtained using variable mass scaling, as presented in Fig. 3, and the information it reveals on the general structures within our set of data.
Having utilized the optimized parameters and preprocessing methods, based on the test results, we are yet left with a number of plausible solutions of almost equal mathematical quality. These solutions, shown as the bright silhouette maxima in Fig. 3, are connected to various structures in our set of pollution data. In the following we will try and interpret these data structures both from mathematical and physicochemical viewpoints.
Weighted cluster centroid spectra for solution
Beginning with an overview of the favourable solutions of highest
mathematical quality, we located and tabulated the maximum silhouette values
obtainable for each dissimilarity metric and each number of clusters
A brief overview of the two distinct classes, emerging from the original data without any mass
scaling, or a more complex classification leading to 6 to 10 separable classes when
optimized mass scaling is applied.
We hypothesize that these alternative classifications correspond to different
types of structures present in the data – a two-cluster structure would
imply separation based on a dominant, binary-type variable, or a two-part
division along a single axis (i.e. dimension, property), whereas 6 to 10
clusters likely imply divisions along more than one dimension. In the
following we first investigate and interpret the binary (two-cluster)
structure (Sect. 3.3.1), and subsequently aim to explain the finer,
multidimensional structures and classifications reflected by the 6 to 10
cluster solutions (Sect. 3.3.2).
The two-cluster (
It should be noted that without mass scaling this separation is thus the most
natural one (with silhouette maximum at
As the number of plausible solutions with similar magnitude (0.45 to 0.49) silhouettes for the preprocessed set of data is larger than just a few, we will not thoroughly describe all the solutions here or at this point claim one is superior to the others, but instead try to formulate a synthesis of the results, and identify the common features exhibited by the solutions.
When looking at all highest total silhouette values of the solutions,
excluding the two-cluster solutions covered above and inspecting the mass
spectra derived from the cluster centroids, we can find several analogous
characteristics shared by essentially all the solutions. Presenting an
overview of the
Secondly, there seems to always be a group of two to three outlier clusters
(O-I to O-III), each with very unique individual mass spectra. Here we brand
them outlier groups due to their small cluster populations (
The remaining clusters are much less pronounced (within-silhouettes typically
< 0.4), and much less stable as
Observing the composition of the weak clusters, they seem to form a structure
independent to those of the clear-cut and outlier groups; the sum of
population over the weak clusters is rather invariable (
To examine the effect of outliers in data, we additionally excluded the outlier and/or the strong clustered objects and reran the analysis for the remaining data, but the results were found to revert to an analogous situation with the same problem of silhouette-wise ambiguity and low inner cohesion of the weak subclusters.
From examining the within-cluster silhouette values of the clusters we would
be inclined to look primarily to the Table 2 solution at
As ever, when applying inherently mathematical algorithms such as PMF2/ME-2
and
When interpreting aerosol mass spectra measured from ambient air, it should be kept in mind that the aerosol is not only the product of the primary emission or nucleation, but also the physicochemical processes taking place post-emission. These notably include condensation and evaporation of trace gases, as well as interaction with other aerosol types. In particular, it has been suggested that the interactions between the primary and secondary aerosols, and likewise anthropogenic and natural ones and their precursors, may play a considerable role in forming and transforming the atmospheric aerosols we observe (e.g. Weber et al., 2007; Carlton et al., 2010). These interactions are poorly understood and usually not taken into account when analysing ambient observations. It seems likely, though, that these effects would hinder attempts of classification by smearing out differences between aerosols from various different sources.
Also, the issues caused by collinearities in loadings (i.e. the inability of PMF to separate collocated single sources, but rather produce an extraction containing a linear combination of the single sources) may produce samples with inadequate deconvolution, which would exhibit a superposition of spectral features of many aerosol source types. In the clustering phase, these samples would be expected to show up as between-cluster objects, falling between the “pure” samples, and exhibiting low silhouette values. The posterior processing (Sect. 2.3.5) we applied should thus down-weight these mixed observations, diminishing their influence on the final cluster centroids. The collinearity problems caused by spectral profile similarities, on the other hand, are harder to resolve. In case of high similarities between spectral profiles from various combustion processes (hydrocarbon-like organic aerosol (HOA) vs. cooking organic aerosol (COA) vs. biomass burning organic aerosol (BBOA); Mohr et al., 2012) as well as the tendency for the most highly oxidized organized aerosols to closely resemble each other (Ng et al., 2010b; Zhang et al., 2011), the low dissimilarities between the objects hinder robust classification even if PMF manages to correctly extract these spectra from the background. In the end, differentiation between highly similar spectra comes down to the quality of (a) the classification algorithm, (b) the dissimilarity metric and (c) the data-weighting optimization. Although our algorithm selection was fixed in this work, the parameter optimizations described in Sect. 3.2 should provide us with an improved resolution for examining the aerosol chemical classes and structures in the set of data.
For AMS data we are fortunate to have access to years of research by the
worldwide AMS users' community and the numerous studies reporting
compositions of various aerosol types, which significantly helps us in
understanding the cluster centroid spectra. An especially helpful information
repository relevant for any AMS-related mass spectral identification and
comparison exercises, such as the problem at hand, is the AMS spectral
database (described by Ulbrich et al., 2009), containing a total of 248 unit
resolution AMS spectra from both ambient air, chamber and combustion
experiments. The spectra contain examples of source-attributed aerosol types
obtained in laboratory experiments or ambient aerosol feature extraction,
various averaged mass spectra of ambient aerosols over longer periods and
laboratory standards measured using the AMS. As the spectra are obtained
using many different AMS variants and under very different conditions, not
all the spectra contain the same variables (
To help interpret our obtained clusters we calculated the similarities
between the AMS spectral database specimens and the mass spectra derived from
our cluster centroids. Where needed we would then refer to the specific
publications describing the details of the comparison spectra of interest. As
a measure of similarity we use the metric found to perform best for the
tested data set (as per evaluation in Sect. 3.2), the Pearson product-moment
correlation (Eq. 8), scaled dynamically by (
As mentioned in Sect. 2.2, information on pollution event hourly times and peak wind directions was logged during the feature extraction analysis. The summary of these diagnostics sorted according to clustering results are shown in the Supplement (Sect. S9; Figs. S11, S12). However, due to the small number of objects in most clusters, sample sizes are too low for solid conclusions to be made from this auxiliary data.
Mass spectra derived from [correlation,
The clusters we can identify, quantify and interpret with high confidence are
the strong clusters (S-I and S-II), clearly set apart by the
Looking at the correlations with database spectra, we find the first cluster
(S-I) correlates highly (
The second cluster that can be easily identified is S-II. The absolutely
highest correlation (
Mass spectra corresponding to clusters W-I (A-SV-OOA), W-II (HOA)
and W-III (COA). Error bars denote within-cluster variability
(silhouette-weighted within-cluster standard deviation). Solution for correlation
The mass spectra of biomass burning and the sawmill aerosol groups, derived
from the highest silhouette (0.49) solution (correlation,
Whether due to a very low number of observations, limits imposed by
instrument SNR ratio, high chemical similarity between the weak clusters or
inconclusively resolved PMF extractions due to plumes consisting of multiple
sources, the mass spectral structures separating the weak groups of aerosols
from each other is much less pronounced than the division between the strong
and the outlier cluster characteristic spectra relative to other aerosol
types. Although it is hard to judge based on this set of data alone, we think
the faults lie mostly with the collinearity issues arising from the chemical
similarity and/or source collocation hypotheses, since the number of
observations related to the weak groups overall is quite large (around
40 % of total) and the instrument SNR seems to enable the classification
of other groups without ambiguity. From the general outlook of the weak
clusters' spectra we observe many hints (low
The actual differentiation between AMS aerosol spectra from cooking, and
traffic is notoriously hard for unit mass resolution spectra, as discussed by
Mohr et al. (2012), and is traditionally mostly based on the relative
abundances of signals at
Looking again at the highest silhouette solution (correlation,
The wind direction analysis combined with a potential source survey (available in Sect. S9; Figs. S11, S13 to S16) additionally points to the conclusion that the source of this aerosol is in the sector with a nearby public road. We thus term W-II as hydrocarbon-like organic aerosol (HOA) in accordance with AMS aerosol naming conventions.
The other “fresh” aerosol type, W-III (Fig. 3.6) exhibits highest
similarities (
Separating both of these fresh two subclasses from the weak supercluster
leaves us with the semi-volatiles species in the form of one to three
clusters. The solution with one semi-volatile aerosol pollution type, W-I, in
(correlation,
W-I pollution type exhibits mixed mass spectral characteristics between the
HOA and COA types (Fig. 6). The main difference with the former two-cluster
spectra is the spectra from the remaining part of the weak cluster (W-X)
group implies a considerably higher oxidation state (estimated O : C ratio
of 0.40 compared to 0.18 and 0.15 for HOA and COA; Aiken et al., 2008;
Eq. 14). The library spectra similarity examination brands this aerosol a
general semi-volatile oxidized organic aerosol (SV-OOA), with the closest
similarity to SV-OOA observed in Barcelona (
However, we will also present an on-going interpretation based on the
Mass spectra for the correlation
This interpretation is indeed supported to some extent by the correlation
examination against HOA, COA and BBOA spectra of the AMS spectral database
(Table S3 in the Supplement), and in the case of SV(HOA), the only group with
a moderate number of observations (
While the spectra examined thus far seem interpretable in the “traditional” framework of AMS aerosol-type classification (LV-OOA, SV-OOA, BBOA, HOA, COA), the outlier clusters do not fit these conventional categories. There are also no spectra matching our observations in the AMS spectral library. We therefore additionally examine the spectral features and compare them to observations in other mass spectrometry literature.
To begin with, we note the distinctive feature of all the mass spectra
outlier clusters are rather “exotic”, at least in an AMS context, with
peaks at 58, (72), 86 and 100 Th (Fig. 8). These even molecular masses are
relatively rarely observed in the AMS organic spectra due to the nitrogen
rule implying the presence of a nitrogen atom. The homologous ion series of
amine compounds (C
Ge et al. (2011) calculated in their review article a grand total of 67 aerosol phase amine observations (1972–2009). Amine spectra with some similar features have been observed elsewhere (Aiken et al., 2009; Huffman et al., 2009; Sun et al., 2011; Chang et al., 2011), and amines have been postulated to contribute to the AMS organic signal at 30 Th at the SMEAR II station (Allan et al.,2006). None of the studies cited offer a close match, however, and many of the main spectral features and signal ratios differ markedly from the ones seen here. Additionally, some rather similar spectra have been presented by Murphy et al. (2007), who measured secondary aerosol generated from various aliphatic amines using an AMS, and the 70 eV electron impact ionization spectrum of trimethylamine, available from the US National Institute of Standards and Technology (NIST).
The aerosol phase amine sources have thus far mostly been attributed to either local industrial pollution or marine biological production (see Ge et al., 2011 for a review of observations). In our case both of these sources would be surprising considering the inland location and the scarcity of nearby industrial plants, along with the apparent seasonal dependence of the observations (only observed in the springtime measurements). However, we cannot rule them out at this point. As additional hypotheses for the origin we offer the following: (1) biodegradation-produced volatile aerosol precursors released at snowmelt (Kieloaho et al., 2013; Kuhn et al., 2011), (2) manure application to agricultural fields or amine emissions from nearby cattle farm (Schade and Crutzen, 1995; Ge et al., 2011; Kuhn et al. 2011; Sintermann et al., 2014; Sect. S12, Fig. S14) and (3) biogenic amine emissions (Kieloaho et al., 2013) related to clear-cutting a nearby patch of forest (Virkkula et al., 2014).
An additional discussion on the amines and their potential origins is
available in the Supplement (Sect. S12). This section also includes a
discussion on alternative sources for these signals and, while the final
decision on the sources and origins of the outlier clusters' spectra remains
speculative, we believe the most likely explanation is the amine compounds.
Therefore we name the outlier I-III peaks “amine-58” (O-I), “amine-100”
(O-II) and “amine-86” (O-III) with respect to their major characteristic
peaks, most likely corresponding to fragment ions with elemental composition
CH
Mass spectra corresponding to weighted cluster centroids (correlation,
Below we try to summarize what we consider the most important dimensions or
axes, on which the more complex (
Traditional AMS spectral analysis revolves around studying the process of
oxidation or the ageing of an aerosol particle in the atmosphere. The
oxidation process depends on particle chemical structure, number and type of
oxidant radicals available and the time spent in the atmosphere, so it is
highly variable and difficult to model. From this branch of study and the
connection of volatility to oxidation level (Donahue et al., 2011, 2012;
Kroll et al., 2011; Jimenez et al., 2009), some of the “standard” labels
for atmospheric processed aerosol types (LV-OOA, SV-OOA) also originate. It
has been known for a long time in the AMS community that mass spectra peaks
such as
The other axes for cluster separation seem to relate to their source-specific
fingerprints. The results presented in Sect. 3.1 and 3.2, and particularly
the solution diagnostics values shown in Table 2, suggest that there are one
or more source-related divisions resulting in a fairly clear-cut separation
of clusters. One such clear division seems to be between the anthropogenic
aerosol groups considered primary, and thus usually originated from a
combustion process (such as biomass or fossil fuel burning, combustion
engines exhaust or aerosol formed in high-temperature cooking), and the
secondary aerosol from particle conversion biogenic organic vapours (albeit
in our case from anthropogenic sources in the form of the sawmills). In our
case this distinction separates the sawmill secondary organic aerosol
(cluster S-I) from other aerosols of similar age and oxidation from different
sources (especially W-I) in a clear-cut manner. A short examination on a
potential S-I spectral marker at
The structure that is the most difficult to explain conclusively is the set
of mass spectral features setting apart the various components of the
observed weak cluster structure. The separation of fresh HOA from COA and
BBOA has been discussed and characterized in many studies (e.g. Crippa et
al., 2013; Mohr et al., 2009), but in practise classifying these aerosols in
an unambiguous manner remains troublesome. It does, however, seem clear from
the results presented here that the
As the important separation of the sawmill-SOA cluster (S-I) also happens to
be clearly reflected in the
Correlation
For this set of observations we did not obtain a separate distinct (fresh)
BBOA cluster, so we were unable to evaluate the difference between BBOA and
COA. As for the more controversial classification of A-SV-OOA subtypes, the
separation can be visualized in the (
In addition to these more traditional fingerprints in the AMS spectra, in
this case we also have outlier observations, distinguishable by their unusual
high mass (
Classification of A-SV-OOA (Cluster W-I) types into subgroups W-Ia,
W-Ib and W-Ic, corresponding to SV(HOA), SV(COA) and SV(BBOA). 2-D
visualization is given in
Outlier clusters (O-I to O-III) assumed to contain amines, and their
respective centroids plotted in colour, and any other clusters in grey in
(
Silhouette-weighted standard deviation, as a function of
Finally, we briefly examine the intracluster variabilities, translating to
inferred mass spectral variability within the aerosol types. While we feel it
would be dangerous to claim that the variation within the spectra of a
specific group can be directly understood as the natural variability of that
aerosol type at this site, we propose it can be considered as an upper limit
estimate of this variability, since the within-cluster variation is caused
both by the actual variability in the natural aerosol, and the uncertainty
induced by its measurement and analysis. Overall, the effects of instrument
(white) noise is filtered in the feature extraction (PMF) phase, and the
effects of possible misclassifications or presence of mixed source pollution
events (collocated sources incompletely resolved by PMF) in clustering are
likely limited to borderline, between cluster cases that have minimal
influence on the final spectra due to the silhouette-based posteriori
weighting. The collinearity effects discussed in Sect. 2.3.1 do contribute to
the total analysis error, but their quantitative determination was not
achieved in this work. For the clean-cut, high-silhouette strong and outlier
clusters collinearity is unlikely to introduce significant uncertainty, and
the analysis uncertainties would be expected to afflict mostly the weak
clusters more susceptible to source collocation and high spectral
similarities. However, such a distinct difference seems absent between the
estimated variabilities of strong and weak clusters, lending confidence that
the collinearity effect is not consequential for this set of data. The
rotational ambiguity of PMF remains an issue, and while we have done our best
to find the cleanest possible separation of the pollution and background
spectra, some degree of uncertainty is unavoidable. Although there some tools
have been proposed to assess the rotational sensitivity (e.g. bootstrapping;
Norris et al., 2008; Tibshirani et al., 2001), the exact level of mass
spectral uncertainty arising from the rotational ambiguity remains difficult
to quantify. Also as the standard
We examined the within-cluster variabilities of the aerosol types studied,
and calculated silhouette-weighted standard deviations as a function of
Within-cluster, silhouette-weighted variabilities parameterized using
constant, linear and exponential (quadratic) least squares regressions to the
clustered data (variability [
The variability parameters are especially important for (partially or fully)
constrained factor analysis, such as techniques utilizing the ME-2 algorithm.
For example, in the most commonly approach used in the Source Finder (SoFi), a single
value (
While advanced data analytical techniques, such as PMF, have already been widely adopted for AMS data reduction and feature extraction, the application of similar chemometric methods for AMS spectra identification and classification is as of yet an uncommon sight.
In this study we make a pitch for adopting some of the tried and tested
statistical methods from other mass spectrometric fields into the analysis of
AMS results. As a practical example we present a case of applying simple
clustering to a set of AMS pollution spectra, and show that even a simple
algorithm such as the
Clustering as a method is especially sensitive to certain parameters; the
algorithm used, data preprocessing (scaling) and the dissimilarity measure
(distance metric) used for the objects' spatial representation. In this work
we compared the performance of some of the most basic measures of dissimilarity in
Optimized mass scaling, that is, weighting the mass spectra signals by an
exponential function of their
Without scaling as a preprocessing step,
Limiting the role of PMF to solving short-term air pollution events and plumes with minimal number of factors facilitates identification of the physically meaningful rotation, which best temporally separates the pollution plume from the background aerosol. Using correlation minimum between the time series of the pollution and the background as a selection criterion minimizes the need for expert judgement by human analysts when exploring PMF solutions of pollution events. Similarly, applying computer-aided, unsupervised classification, any result should be more or less free of analyst bias when deciding the classifications of mass spectra to organic aerosol subtypes. An appropriately chosen clustering quality metric, such as the silhouette value, can be used to infer the natural number of clusters in data as well as to optimize scaling factors, to magnify the structures present in data. Naturally, this does not excuse the human analyst from the final responsibility of physicochemical interpretation, comparison and evaluation of the mathematical solutions produced by any classification algorithm. It should also be noted that the first phases of the task, of manually identifying pollution events and performing the feature extraction step for each of the events individually, is very labour-intensive and some subjectivity remains in deciding the selection criteria for what constitutes a pollution episode – hence the process should ideally be made more automatic and statistics based.
Despite the laboriousness, compared to currently used approaches such as trying to use PMF to directly extract source-specific, anthropogenic organic aerosol subtypes from extended data sets and identify the correct rotations from the resulting solution space, we suggest the methodology presented here has several advantages. (1) Due to limiting the PMF time windows to short pollution episodes explained with fewer factors and variability driven by pollution plume behaviour, the major factor analytical problem of rotational ambiguity is diminished and there is less need for expert judgement in the selection of solutions. Concentrating on short-lived events also better fulfils the factor analytical receptor model's assumption of constant profiles (i.e. aerosol bulk composition is not driven by chemical processes but a result mixing from different sources). (2) From the clustering solution it is possible to derive solid, quantitative estimates of the archetypal pollution spectra along with their uncertainties – information which has direct use and value when applying the reference spectra in e.g. constraining future factor analyses. Detailed chemical knowledge on pollution types may also help in further understanding the physicochemical properties of anthropogenic atmospheric aerosols and their interactions. Finally, (3) by analysing air pollution cases individually we can also identify and extract minor sources and identify outlier aerosol types, which fall way under the PMF's limit of detection, of explaining approximately 5 % of the variability of the total aerosol mass (Ulbrich et al., 2009). These outlier groups may ultimately prove important and offer new scientific information, as exemplified by the observation of suspected amine compounds presented in our results.
In our example of applying feature extraction (PMF) and unsupervised
classification (
We propose that optimizing the similarity metrics, both via correctly selecting the algorithm and data weighting, provides not only a basis for exploratory classification but also a means for identifying AMS spectra by comparing them with references available in the AMS spectral database. We can also see prospective use for exploratory classification, with clustering as an obvious example, in evaluating sets of discrete or deconvolved AMS spectral samples, such as the samples often produced in large numbers in bootstrapping or sensitivity analysis exercises when evaluating factor analytical models.
Ultimately, we hope to have demonstrated that statistics-based, computer-aided classification of AMS spectra seems promising, and in that the differences and characteristic features of mass spectra can indeed be parameterized for an increasingly machine-learning-oriented approach to AMS advanced data analysis.
Data used are available upon request from the authors. Cluster centroid
spectra will be made available in the AMS Spectral Database
(
The authors declare that they have no conflict of interest.
We acknowledge the funding from the following programmes: European Commission FP6 projects EUCAARI (036833-2), FP7 ACTRIS (262254), the Horizon 2020 project ACTRIS-2 (654109), ERC Grant COALA (638703), the Finnish COE project CRAICC (272041) and the Academy of Finland COE in Atmospheric Science (2008–2019).
We wish to thank the Hyytiälä and Helsinki technical staff (Pasi Aalto, Erkki Siivola, Frans Korhonen, Heikki Laakso, Toivo Pohja, Veijo Hiltunen and Janne Levula) for support with experimental measurements and the AMS users' community for valuable feedback on the data analytical topics.
The authors are grateful for the comments and suggestions provided by the anonymous referees for improving the manuscript. Edited by: D. Spracklen Reviewed by: two anonymous referees