An introduction to py-smps#

py-smps is a general-purpose library designed to make analyzing size-resolved aerosol data as easy and repeatable as possible. It is not meant to cover all possible instruments (by default, at least) or use cases (at least, not yet!). This guide should give you an overview of the capabilities of the software, but will not be completely comprehensive. Please read through the API documentation if you have any questions or post to the discusssions section of the GitHub repository.

Working with data#

The data format for each make and model of particle-sizer instrument is going to be different depending on the manufacturer’s specifications and how you have gone about logging the data. This library includes a few helper functions for common instruments (e.g., TSI SMPS), but there are some general requirements that all instruments and datasets need to follow in order to work with this library:

  1. The raw data must be a pandas DataFrame (pd.DataFrame). If you’re unfamiliar with the pandas, it may be a good idea to read one of the many great online tutorials or read through the docs here.

  2. The index of the DataFrame should be a time series (though, it is not explicitly required)

  3. There must be a unique column (and column name) for each particle size bin

So long as the above requirements are met, you should be able to use this library to analyze your data.

[1]:
import smps
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl

# Filter out the warnings for this guide
warnings.simplefilter(action='ignore')

sns.set('notebook', style='ticks', font_scale=1.25, palette='colorblind')

# Set the default smps visuals
smps.set()

Example datasets#

The library includes a few example datasets to help you get up and running with examples and understand the desired structure of the data. To import a sample SMPS dataset from Boston, you can use the load_sample helper function:

[2]:
bos = smps.io.load_sample('boston')

Working with SMPS data from AIM#

TSI’s AIM software allows you to export data to analyze offline. There is a loader function (smps.io.load_file) to help you load and work with data specifically from AIM; however, each version of AIM is slightly different and isn’t always labeled with the correct version making it difficult to maintain a general-purpose loader for these files. If your attempt to load data from AIM using the helper function is unsuccessful, please open an issue on the GitHub repository so that we can add support for your version of AIM or otherwise help you navigate the process.

Working with data via a csv file#

If your data is in csv format, you can use native pandas to load your data and ensure it is in the correct format. Below, we use a sample file available in the GitHub repository from a QuantAQ MODULAIR-PM device to demonstrate how you can load the data, convert the timestamp column to a python datetime object, and get it ready for analysis.

[3]:
# Load the file into a pandas dataframe
df = pd.read_csv(
    "https://raw.githubusercontent.com/quant-aq/py-smps/master/tests/datafiles/MOD-PM-SAMPLE.csv",
    parse_dates=['timestamp']
).set_index('timestamp')

df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1009 entries, 2022-02-15 21:27:30+00:00 to 2022-02-15 04:39:32+00:00
Data columns (total 54 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   id               1009 non-null   int64
 1   timestamp_local  1009 non-null   object
 2   sn               1009 non-null   object
 3   sample_rh        1009 non-null   float64
 4   sample_temp      1009 non-null   float64
 5   sample_pres      1009 non-null   int64
 6   bin0             1009 non-null   float64
 7   bin1             1009 non-null   float64
 8   bin2             1009 non-null   float64
 9   bin3             1009 non-null   float64
 10  bin4             1009 non-null   float64
 11  bin5             1009 non-null   float64
 12  bin6             1009 non-null   float64
 13  bin7             1009 non-null   float64
 14  bin8             1009 non-null   float64
 15  bin9             1009 non-null   float64
 16  bin10            1009 non-null   float64
 17  bin11            1009 non-null   float64
 18  bin12            1009 non-null   float64
 19  bin13            1009 non-null   float64
 20  bin14            1009 non-null   float64
 21  bin15            1009 non-null   float64
 22  bin16            1009 non-null   float64
 23  bin17            1009 non-null   float64
 24  bin18            1009 non-null   float64
 25  bin19            1009 non-null   float64
 26  bin20            1009 non-null   float64
 27  bin21            1009 non-null   int64
 28  bin22            1009 non-null   int64
 29  bin23            1009 non-null   float64
 30  opcn3_temp       1009 non-null   int64
 31  opcn3_rh         1009 non-null   int64
 32  opcn3_pm1        1009 non-null   float64
 33  opcn3_pm25       1009 non-null   float64
 34  opcn3_pm10       1009 non-null   float64
 35  pm1_env          1009 non-null   float64
 36  pm25_env         1009 non-null   float64
 37  pm10_env         1009 non-null   float64
 38  neph_bin0        1009 non-null   float64
 39  neph_bin1        1009 non-null   float64
 40  neph_bin2        1009 non-null   float64
 41  neph_bin3        1009 non-null   float64
 42  neph_bin4        1009 non-null   int64
 43  neph_bin5        1009 non-null   int64
 44  flag             1009 non-null   int64
 45  lat              1009 non-null   float64
 46  lon              1009 non-null   float64
 47  device_state     1009 non-null   object
 48  pm10             1009 non-null   float64
 49  pm1_model_id     1009 non-null   int64
 50  pm25_model_id    1009 non-null   int64
 51  pm25             1009 non-null   float64
 52  pm10_model_id    1009 non-null   int64
 53  pm1              1009 non-null   float64
dtypes: float64(39), int64(12), object(3)
memory usage: 433.6+ KB

The data above, from a QuantAQ MODULAIR-PM, contains a total of 53 columns in addition to the timestamp column that has been set as the index. The ‘binned’ data corresponding to the OPCs bins are labeled with the format “bin” where “x” is an integer between 0 and 23 (24 total size bins). Each bin contains the particle concentration (pp/cc) at various size ranges and is the raw data we need to perform the analysis - the other columns are considered meta data and are not explicitly needed. There is no need to remove the columns prior to continuing on.

Now that we have the data in the proper format, we can go ahead and discuss the GenericParticleSizer object.

The GenericParticleSizer object#

The heart of the py-smps library is the GenericParticleSizer obejct. The GenericParticleSizer object is the base class for all available particle sizing instruments and contains all of the basic functionality and methods used for making calculations and/or figures. A GenericParticleSizer assumes the following about any instrument:

  1. Raw data can be represented as a dataframe where there is one column for each particle size bin

  2. The instrument sizes particles into ‘bins’ which have a left boundary, midpoint, and right boundary

That’s it. Those are the only two things that must be true for the instrument and instrument data to work with this library.

Note

If you are creating a custom GenericParticleSizer object, there is a helper function for creating the 3xn array needed to properly define size bins (smps.utils.make_bins).

In addition to the GenericParticleSizer object, which is the parent class, there are a number of child classes for individual instruments that make working with them easier. These include:

  • SMPS

  • Grimm11D

  • POPS

  • ParticlesPlus

  • AlphasenseOPCN2

  • AlphasenseOPCN3

  • Modulair

  • ModulairPM

All of these classes have the same methods and capabilities of their parent class and it is recommended you use one of these classes where appropriate. If you use the parent class, you will need to define the size bins yourself and is only recommended when you have either changed the default parameters of the instrument or you are working with an instrument that is not supported by default.

For the rest of this tutorial, we will use the data we loaded above for the QuantAQ MODULAIR-PM device. We will walk through the basic functionality of the GenericParticleSizer class as shown via the ModulairPM class.

Initializing an instance of the GenericParticleSizer class#

To begin, we need to create an instance of the object with the data. Since we are working with data from a QuantAQ MODULAIR-PM, we will use the ModulairPM class:

[4]:
# Load the data into a ModulairPM object
obj = smps.models.ModulairPM(data=df)

Explore the attributes of the GenericParticleSizer#

Let’s begin by exploring all of the attributes available to us. As mentioned above, all GenericParticleSizer objects are defined by their data and bins. To view the bin definition, you can use the bins attribute:

[5]:
obj.bins
[5]:
array([[ 0.35 ,  0.405,  0.46 ],
       [ 0.46 ,  0.56 ,  0.66 ],
       [ 0.66 ,  0.83 ,  1.   ],
       [ 1.   ,  1.15 ,  1.3  ],
       [ 1.3  ,  1.5  ,  1.7  ],
       [ 1.7  ,  2.   ,  2.3  ],
       [ 2.3  ,  2.65 ,  3.   ],
       [ 3.   ,  3.5  ,  4.   ],
       [ 4.   ,  4.6  ,  5.2  ],
       [ 5.2  ,  5.85 ,  6.5  ],
       [ 6.5  ,  7.25 ,  8.   ],
       [ 8.   ,  9.   , 10.   ],
       [10.   , 11.   , 12.   ],
       [12.   , 13.   , 14.   ],
       [14.   , 15.   , 16.   ],
       [16.   , 17.   , 18.   ],
       [18.   , 19.   , 20.   ],
       [20.   , 21.   , 22.   ],
       [22.   , 23.5  , 25.   ],
       [25.   , 26.5  , 28.   ],
       [28.   , 29.5  , 31.   ],
       [31.   , 32.5  , 34.   ],
       [34.   , 35.5  , 37.   ],
       [37.   , 38.5  , 40.   ]])

We see a 3xn array where each bin is defined by its left boundary, midpoint, and right boundary. The units for these values are µm.

If you want to grab just the midpoints, we can use the midpoints attribute which returns a 1xn array:

[6]:
obj.midpoints
[6]:
array([ 0.405,  0.56 ,  0.83 ,  1.15 ,  1.5  ,  2.   ,  2.65 ,  3.5  ,
        4.6  ,  5.85 ,  7.25 ,  9.   , 11.   , 13.   , 15.   , 17.   ,
       19.   , 21.   , 23.5  , 26.5  , 29.5  , 32.5  , 35.5  , 38.5  ])

We mentioned above that any additional columns are available as meta data - those can be accessed via the scan_stats attribute:

[7]:
obj.scan_stats.head(3)
[7]:
id sample_temp pm25_env pm1 pm1_model_id lon timestamp_local neph_bin1 flag pm25 ... pm25_model_id neph_bin0 pm10_env pm10 device_state opcn3_rh neph_bin3 neph_bin4 neph_bin2 pm10_model_id
timestamp
2022-02-15 21:27:30+00:00 87893809 7.00 0.0 0.3809 3380 -103.757 2022-02-15T14:27:30Z 40.625 0 0.6173 ... 3381 47.625 0.0 2.5584 ACTIVE 0 0.0 0 0.0 3382
2022-02-15 21:26:30+00:00 87893808 6.99 0.0 0.3383 3380 -103.757 2022-02-15T14:26:30Z 40.875 0 0.7299 ... 3381 42.250 0.0 0.8970 ACTIVE 0 0.0 0 0.0 3382
2022-02-15 21:25:30+00:00 87893807 6.98 0.0 0.3091 3380 -103.757 2022-02-15T14:25:30Z 36.125 0 0.6270 ... 3381 38.625 0.0 2.0459 ACTIVE 0 0.0 0 0.0 3382

3 rows × 30 columns

You will notice that the above call returns all columns in the original data that are not “binned” particle concentrations.

Next, there are a number of different ways in which you can grab and view the raw data, all available as attributes. These include:

  • dn: particle number concentration by bin

  • ds: surface area by bin

  • dv: volume by bin

  • dndlogdp: normalized particle number concentration by bin

  • dsdlogdp: normalized surface area by bin

  • dvdlogdp: normalized volume by bin

As an example, here is the number concentation by bin:

[8]:
obj.dn.head(3)
[8]:
bin0 bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 ... bin14 bin15 bin16 bin17 bin18 bin19 bin20 bin21 bin22 bin23
timestamp
2022-02-15 21:27:30+00:00 1.5300 0.2194 0.0579 0.0064 0.0097 0.0129 0.0129 0.0032 0.0033 0.0000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-02-15 21:26:30+00:00 1.6633 0.1843 0.0811 0.0225 0.0228 0.0227 0.0163 0.0000 0.0000 0.0000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-02-15 21:25:30+00:00 1.6052 0.1605 0.0767 0.0134 0.0132 0.0133 0.0201 0.0033 0.0033 0.0033 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 24 columns

And here is the normalized number concentration by bin:

[9]:
obj.dndlogdp.head(3)
[9]:
bin0 bin1 bin2 bin3 bin4 bin5 bin6 bin7 bin8 bin9 ... bin14 bin15 bin16 bin17 bin18 bin19 bin20 bin21 bin22 bin23
timestamp
2022-02-15 21:27:30+00:00 12.890747 1.399359 0.320854 0.056168 0.083258 0.098264 0.111791 0.025613 0.028962 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-02-15 21:26:30+00:00 14.013843 1.175487 0.449417 0.197467 0.195699 0.172914 0.141256 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2022-02-15 21:25:30+00:00 13.524331 1.023688 0.425034 0.117602 0.113299 0.101311 0.174187 0.026413 0.028962 0.034052 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

3 rows × 24 columns

Explore the methods of the GenericParticleSizer#

There are several key methods available under the GenericParticleSizer class that allow you to quickly and easily compute various parameters. They can be summarized as follows:

  • copy: create a copy of the model

  • dump: save a local copy of the model to disk

  • resample: resample data onto a new time basis

  • slice: slice the data between two datetimes

  • stats: compute the aerosol size distribution statistics (e.g., number of particles, GM, GSD, etc.)

  • integrate: compute the integrated number, surface area, volume, or mass between two diameters

Below, we will walk through each of these methods to demonstrate their capabilities and document their use cases.

copy#

The copy method is used to create a copy of the model so that you can slice it or modify it in some non-reversible way:

[10]:
obj2 = obj.copy()

obj2
[10]:
<smps.models.ModulairPM at 0x1502104f0>

dump#

The dump method is used to pickle and save a copy of the file locally to filepath:

[11]:
obj.dump("smps-obj-1.sav")
[11]:
['smps-obj-1.sav']

Note

Currently, this file must be in the same directory as you’re currently working, otherwise joblib will be unhappy.

resample#

The resample method allows you to resample your data onto a new time base. For example, if you originially are working with 1-minute data, but want to work with hourly data, you can resample:

[12]:
rs = obj.resample(rs='1h', inplace=False)

slice#

The slice method allows you to quickly remove all data that is not between the start and end timestamps. You can use the inplace argument to determine whether a new copy is returned or the current data is overwritten:

[13]:
obj2 = obj.slice(start="2022-02-15 22:00", end="2022-02-15 04:00", inplace=False)

stats#

The stats method allows you to easily calculate common aerosol distribution statistics as weighted by one of (‘number’, ‘surface’, ‘volume’, or ‘mass’). The values returned include:

  • number: the total number of particles

  • surface_area: the total surface area of particles

  • volume: the total volume of particles

  • mass: the total mass of particles

  • AM: the arithmetic mean particle diameter

  • GM: the geometric mean particle diameter

  • Mode: the mode particle diameter

  • GSD: the geometric standard deviation

[14]:
# Compute the number-weighted stats for the entire size distribution
stats = obj.stats(weight='number')

stats.head(3)
[14]:
number surface_area volume mass AM GM Mode GSD
timestamp
2022-02-15 21:27:30+00:00 1.8589 2.542669 1.171222 1.932516 496.016999 451.248888 405.0 1.381490
2022-02-15 21:26:30+00:00 2.0130 2.113701 0.411202 0.678483 493.207899 455.856515 405.0 1.383312
2022-02-15 21:25:30+00:00 1.9123 2.612000 0.967307 1.596057 504.500863 454.677924 405.0 1.415795

You can also limit the diameter over which you compute statistics over using the dmin and dmax arguments:

[15]:
# Compute the number-weighted stats for just PM2.5
stats = obj.stats(weight='number', dmin=0, dmax=2.5)

stats.head(3)
[15]:
number surface_area volume mass AM GM Mode GSD
timestamp
2022-02-15 21:27:30+00:00 1.839986 1.468447 0.202913 0.334807 460.899308 441.685395 405.0 1.274885
2022-02-15 21:26:30+00:00 2.001357 1.856839 0.297754 0.491294 480.660802 451.212560 405.0 1.344825
2022-02-15 21:25:30+00:00 1.888043 1.594091 0.239221 0.394714 466.448930 443.238476 405.0 1.303166

We can also go ahead and compute the volume-weighted stats:

[16]:
# Compute the volume-weighted stats for PM2.5
stats = obj.stats(weight='volume', dmin=0., dmax=2.5)

stats.head(3)
[16]:
number surface_area volume mass AM GM Mode GSD
timestamp
2022-02-15 21:27:30+00:00 1.839986 1.468447 0.202913 0.334807 1390.018045 1087.859802 405.0 2.104975
2022-02-15 21:26:30+00:00 2.001357 1.856839 0.297754 0.491294 1492.979396 1232.860263 2000.0 1.968189
2022-02-15 21:25:30+00:00 1.888043 1.594091 0.239221 0.394714 1491.962402 1185.952258 2650.0 2.080376

integrate#

The integrate method is certaintly the most interesting, complex, and most used method. It can compute integrated mass values, number concentrations, and make adjustments for denstiy and hygroscopic growth! We’ll go over many such options below.

First, we will begin with the most common use case - integrating the number of particles between two diameters. If we want to calculate the total number of particles between two diameters, we can do as as follows:

[17]:
obj.integrate(weight='number', dmin=0, dmax=1)
[17]:
timestamp
2022-02-15 21:27:30+00:00    1.8073
2022-02-15 21:26:30+00:00    1.9287
2022-02-15 21:25:30+00:00    1.8424
2022-02-15 21:24:30+00:00    2.0080
2022-02-15 21:23:30+00:00    1.8171
                              ...
2022-02-15 04:43:32+00:00    1.2142
2022-02-15 04:42:32+00:00    1.3960
2022-02-15 04:41:32+00:00    1.2146
2022-02-15 04:40:32+00:00    1.3789
2022-02-15 04:39:32+00:00    1.3231
Length: 1009, dtype: float64

If we want to compute the PM1 value (integrated mass between 0 and 1 µm):

[18]:
obj.integrate(weight='mass', dmin=0., dmax=1., rho=1.65)
[18]:
timestamp
2022-02-15 21:27:30+00:00    0.149699
2022-02-15 21:26:30+00:00    0.163484
2022-02-15 21:25:30+00:00    0.154365
2022-02-15 21:24:30+00:00    0.155946
2022-02-15 21:23:30+00:00    0.143755
                               ...
2022-02-15 04:43:32+00:00    0.093075
2022-02-15 04:42:32+00:00    0.106044
2022-02-15 04:41:32+00:00    0.098637
2022-02-15 04:40:32+00:00    0.111538
2022-02-15 04:39:32+00:00    0.109767
Length: 1009, dtype: float64

Above, we assumed a constant density of 1.65 g/cm3; however, the integrate method also supports custom functions to compute density as a function of particle diameter. To do so, you simply need to provide a callable function that will return the desired density as a function of particle diameter in µm. For example, if we want to compute PM10 and assume that density is 1.65 g/cm3 below 3µm and 2.5 g/cm3 above 3 µm:

[19]:
def custom_density(dp):
    """Calculate density as a function of particle diameter in µm.

    Parameters
    ----------
    dp : float
        Particle diameter in microns.
    """
    return 1.65 if dp < 3 else 2.5

# Compute PM10
obj.integrate(weight='mass', dmin=0., dmax=10., rho=custom_density)
[19]:
timestamp
2022-02-15 21:27:30+00:00    2.679262
2022-02-15 21:26:30+00:00    0.678483
2022-02-15 21:25:30+00:00    2.096019
2022-02-15 21:24:30+00:00    1.577014
2022-02-15 21:23:30+00:00    1.631535
                               ...
2022-02-15 04:43:32+00:00    0.849622
2022-02-15 04:42:32+00:00    1.532512
2022-02-15 04:41:32+00:00    2.868572
2022-02-15 04:40:32+00:00    5.942367
2022-02-15 04:39:32+00:00    2.217525
Length: 1009, dtype: float64

One additional very cool feature is the ability to correct measurements for hygroscopic growth! While this likely isn’t relevant for research or EPA FEM instrumentation, hygroscopic growth due to water uptake is a big challenge for optical instruments operating in ambient environments without inline driers. While there are several peer-reviewed approaches to correcting for such growth, we follow an approach originally proposed by Di Antonio et al (2018) which corrects the size of the underlying bins prior to computing the step-wise integration.

To compute integrated mass values post hygroscopic-growth corection, you must provide a few pieces of additional information:

  1. You must provide an estimate for the hygroscopic growth correction factor, kappa. This can be either a constant or a callable function

  2. You must provide the column name where relative humidity is stored

For example, if we want to compute the corrected PM1 value with a fixed kappa of 0.3:

[20]:
obj.integrate(weight='mass', dmin=0, dmax=1, kappa=0.3, rh='sample_rh')
[20]:
timestamp
2022-02-15 21:27:30+00:00    0.127272
2022-02-15 21:26:30+00:00    0.143090
2022-02-15 21:25:30+00:00    0.133221
2022-02-15 21:24:30+00:00    0.133941
2022-02-15 21:23:30+00:00    0.124912
                               ...
2022-02-15 04:43:32+00:00    0.081037
2022-02-15 04:42:32+00:00    0.095688
2022-02-15 04:41:32+00:00    0.085639
2022-02-15 04:40:32+00:00    0.098006
2022-02-15 04:39:32+00:00    0.093921
Length: 1009, dtype: float64

If we want to define a complex callable function for kappa alongside a callable function for density:

[21]:
def custom_density(dp):
    """Calculate density as a function of particle diameter in µm.

    Parameters
    ----------
    dp : float
        Particle diameter in microns.
    """
    return 1.65 if dp < 3 else 2.5

def custom_kappa(dp):
    """Calculate kappa as a function of particle diameter in µm.

    Parameters
    ----------
    dp : float
        Particle diameter in microns.
    """
    return 0.3 if dp < 1 else 0.05

# Compute PM10
obj.integrate(
    weight='mass', dmin=0., dmax=10.,
    rho=custom_density, kappa=custom_kappa, rh='sample_rh'
)
[21]:
timestamp
2022-02-15 21:27:30+00:00    2.578059
2022-02-15 21:26:30+00:00    0.638148
2022-02-15 21:25:30+00:00    2.014149
2022-02-15 21:24:30+00:00    1.508324
2022-02-15 21:23:30+00:00    1.562393
                               ...
2022-02-15 04:43:32+00:00    0.811724
2022-02-15 04:42:32+00:00    1.471832
2022-02-15 04:41:32+00:00    2.768024
2022-02-15 04:40:32+00:00    5.742915
2022-02-15 04:39:32+00:00    2.510176
Length: 1009, dtype: float64

That about sums up the basic functionality for the py-smps library. For more detailed information, please check out the API documentation and/or the additional guides that cover more specific topics.