An introduction to py-smps#
py-smps
is a general-purpose library designed to make analyzing size-resolved aerosol data as easy and repeatable as possible. It is not meant to cover all possible instruments (by default, at least) or use cases (at least, not yet!). This guide should give you an overview of the capabilities of the software, but will not be completely comprehensive. Please read through the API documentation if you have any questions or post to the
discusssions section of the GitHub repository.
Working with data#
The data format for each make and model of particle-sizer instrument is going to be different depending on the manufacturer’s specifications and how you have gone about logging the data. This library includes a few helper functions for common instruments (e.g., TSI SMPS), but there are some general requirements that all instruments and datasets need to follow in order to work with this library:
The raw data must be a
pandas
DataFrame (pd.DataFrame
). If you’re unfamiliar with thepandas
, it may be a good idea to read one of the many great online tutorials or read through the docs here.The index of the DataFrame should be a time series (though, it is not explicitly required)
There must be a unique column (and column name) for each particle size bin
So long as the above requirements are met, you should be able to use this library to analyze your data.
[1]:
import smps
import warnings
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
# Filter out the warnings for this guide
warnings.simplefilter(action='ignore')
sns.set('notebook', style='ticks', font_scale=1.25, palette='colorblind')
# Set the default smps visuals
smps.set()
Example datasets#
The library includes a few example datasets to help you get up and running with examples and understand the desired structure of the data. To import a sample SMPS dataset from Boston, you can use the load_sample
helper function:
[2]:
bos = smps.io.load_sample('boston')
Working with SMPS data from AIM#
TSI’s AIM software allows you to export data to analyze offline. There is a loader function (smps.io.load_file
) to help you load and work with data specifically from AIM; however, each version of AIM is slightly different and isn’t always labeled with the correct version making it difficult to maintain a general-purpose loader for these files. If your attempt to load data from AIM using the helper function is unsuccessful, please open an issue on the GitHub repository so that we can add
support for your version of AIM or otherwise help you navigate the process.
Working with data via a csv file#
If your data is in csv format, you can use native pandas
to load your data and ensure it is in the correct format. Below, we use a sample file available in the GitHub repository from a QuantAQ MODULAIR-PM device to demonstrate how you can load the data, convert the timestamp column to a python datetime object, and get it ready for analysis.
[3]:
# Load the file into a pandas dataframe
df = pd.read_csv(
"https://raw.githubusercontent.com/quant-aq/py-smps/master/tests/datafiles/MOD-PM-SAMPLE.csv",
parse_dates=['timestamp']
).set_index('timestamp')
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1009 entries, 2022-02-15 21:27:30+00:00 to 2022-02-15 04:39:32+00:00
Data columns (total 54 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 1009 non-null int64
1 timestamp_local 1009 non-null object
2 sn 1009 non-null object
3 sample_rh 1009 non-null float64
4 sample_temp 1009 non-null float64
5 sample_pres 1009 non-null int64
6 bin0 1009 non-null float64
7 bin1 1009 non-null float64
8 bin2 1009 non-null float64
9 bin3 1009 non-null float64
10 bin4 1009 non-null float64
11 bin5 1009 non-null float64
12 bin6 1009 non-null float64
13 bin7 1009 non-null float64
14 bin8 1009 non-null float64
15 bin9 1009 non-null float64
16 bin10 1009 non-null float64
17 bin11 1009 non-null float64
18 bin12 1009 non-null float64
19 bin13 1009 non-null float64
20 bin14 1009 non-null float64
21 bin15 1009 non-null float64
22 bin16 1009 non-null float64
23 bin17 1009 non-null float64
24 bin18 1009 non-null float64
25 bin19 1009 non-null float64
26 bin20 1009 non-null float64
27 bin21 1009 non-null int64
28 bin22 1009 non-null int64
29 bin23 1009 non-null float64
30 opcn3_temp 1009 non-null int64
31 opcn3_rh 1009 non-null int64
32 opcn3_pm1 1009 non-null float64
33 opcn3_pm25 1009 non-null float64
34 opcn3_pm10 1009 non-null float64
35 pm1_env 1009 non-null float64
36 pm25_env 1009 non-null float64
37 pm10_env 1009 non-null float64
38 neph_bin0 1009 non-null float64
39 neph_bin1 1009 non-null float64
40 neph_bin2 1009 non-null float64
41 neph_bin3 1009 non-null float64
42 neph_bin4 1009 non-null int64
43 neph_bin5 1009 non-null int64
44 flag 1009 non-null int64
45 lat 1009 non-null float64
46 lon 1009 non-null float64
47 device_state 1009 non-null object
48 pm10 1009 non-null float64
49 pm1_model_id 1009 non-null int64
50 pm25_model_id 1009 non-null int64
51 pm25 1009 non-null float64
52 pm10_model_id 1009 non-null int64
53 pm1 1009 non-null float64
dtypes: float64(39), int64(12), object(3)
memory usage: 433.6+ KB
The data above, from a QuantAQ MODULAIR-PM, contains a total of 53 columns in addition to the timestamp column that has been set as the index. The ‘binned’ data corresponding to the OPCs bins are labeled with the format “bin” where “x” is an integer between 0 and 23 (24 total size bins). Each bin contains the particle concentration (pp/cc) at various size ranges and is the raw data we need to perform the analysis - the other columns are considered meta data and are not explicitly needed. There is no need to remove the columns prior to continuing on.
Now that we have the data in the proper format, we can go ahead and discuss the GenericParticleSizer
object.
The GenericParticleSizer
object#
The heart of the py-smps
library is the GenericParticleSizer
obejct. The GenericParticleSizer
object is the base class for all available particle sizing instruments and contains all of the basic functionality and methods used for making calculations and/or figures. A GenericParticleSizer
assumes the following about any instrument:
Raw data can be represented as a dataframe where there is one column for each particle size bin
The instrument sizes particles into ‘bins’ which have a left boundary, midpoint, and right boundary
That’s it. Those are the only two things that must be true for the instrument and instrument data to work with this library.
Note
If you are creating a custom GenericParticleSizer
object, there is a helper function for creating the 3xn array needed to properly define size bins (smps.utils.make_bins
).
In addition to the GenericParticleSizer
object, which is the parent class, there are a number of child classes for individual instruments that make working with them easier. These include:
SMPS
Grimm11D
POPS
ParticlesPlus
AlphasenseOPCN2
AlphasenseOPCN3
Modulair
ModulairPM
All of these classes have the same methods and capabilities of their parent class and it is recommended you use one of these classes where appropriate. If you use the parent class, you will need to define the size bins yourself and is only recommended when you have either changed the default parameters of the instrument or you are working with an instrument that is not supported by default.
For the rest of this tutorial, we will use the data we loaded above for the QuantAQ MODULAIR-PM device. We will walk through the basic functionality of the GenericParticleSizer
class as shown via the ModulairPM
class.
Initializing an instance of the GenericParticleSizer
class#
To begin, we need to create an instance of the object with the data. Since we are working with data from a QuantAQ MODULAIR-PM, we will use the ModulairPM
class:
[4]:
# Load the data into a ModulairPM object
obj = smps.models.ModulairPM(data=df)
Explore the attributes of the GenericParticleSizer
#
Let’s begin by exploring all of the attributes available to us. As mentioned above, all GenericParticleSizer
objects are defined by their data and bins. To view the bin definition, you can use the bins
attribute:
[5]:
obj.bins
[5]:
array([[ 0.35 , 0.405, 0.46 ],
[ 0.46 , 0.56 , 0.66 ],
[ 0.66 , 0.83 , 1. ],
[ 1. , 1.15 , 1.3 ],
[ 1.3 , 1.5 , 1.7 ],
[ 1.7 , 2. , 2.3 ],
[ 2.3 , 2.65 , 3. ],
[ 3. , 3.5 , 4. ],
[ 4. , 4.6 , 5.2 ],
[ 5.2 , 5.85 , 6.5 ],
[ 6.5 , 7.25 , 8. ],
[ 8. , 9. , 10. ],
[10. , 11. , 12. ],
[12. , 13. , 14. ],
[14. , 15. , 16. ],
[16. , 17. , 18. ],
[18. , 19. , 20. ],
[20. , 21. , 22. ],
[22. , 23.5 , 25. ],
[25. , 26.5 , 28. ],
[28. , 29.5 , 31. ],
[31. , 32.5 , 34. ],
[34. , 35.5 , 37. ],
[37. , 38.5 , 40. ]])
We see a 3xn array where each bin is defined by its left boundary, midpoint, and right boundary. The units for these values are µm.
If you want to grab just the midpoints, we can use the midpoints
attribute which returns a 1xn array:
[6]:
obj.midpoints
[6]:
array([ 0.405, 0.56 , 0.83 , 1.15 , 1.5 , 2. , 2.65 , 3.5 ,
4.6 , 5.85 , 7.25 , 9. , 11. , 13. , 15. , 17. ,
19. , 21. , 23.5 , 26.5 , 29.5 , 32.5 , 35.5 , 38.5 ])
We mentioned above that any additional columns are available as meta data - those can be accessed via the scan_stats
attribute:
[7]:
obj.scan_stats.head(3)
[7]:
id | sample_temp | pm25_env | pm1 | pm1_model_id | lon | timestamp_local | neph_bin1 | flag | pm25 | ... | pm25_model_id | neph_bin0 | pm10_env | pm10 | device_state | opcn3_rh | neph_bin3 | neph_bin4 | neph_bin2 | pm10_model_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||||||||||
2022-02-15 21:27:30+00:00 | 87893809 | 7.00 | 0.0 | 0.3809 | 3380 | -103.757 | 2022-02-15T14:27:30Z | 40.625 | 0 | 0.6173 | ... | 3381 | 47.625 | 0.0 | 2.5584 | ACTIVE | 0 | 0.0 | 0 | 0.0 | 3382 |
2022-02-15 21:26:30+00:00 | 87893808 | 6.99 | 0.0 | 0.3383 | 3380 | -103.757 | 2022-02-15T14:26:30Z | 40.875 | 0 | 0.7299 | ... | 3381 | 42.250 | 0.0 | 0.8970 | ACTIVE | 0 | 0.0 | 0 | 0.0 | 3382 |
2022-02-15 21:25:30+00:00 | 87893807 | 6.98 | 0.0 | 0.3091 | 3380 | -103.757 | 2022-02-15T14:25:30Z | 36.125 | 0 | 0.6270 | ... | 3381 | 38.625 | 0.0 | 2.0459 | ACTIVE | 0 | 0.0 | 0 | 0.0 | 3382 |
3 rows × 30 columns
You will notice that the above call returns all columns in the original data that are not “binned” particle concentrations.
Next, there are a number of different ways in which you can grab and view the raw data, all available as attributes. These include:
dn
: particle number concentration by binds
: surface area by bindv
: volume by bindndlogdp
: normalized particle number concentration by bindsdlogdp
: normalized surface area by bindvdlogdp
: normalized volume by bin
As an example, here is the number concentation by bin:
[8]:
obj.dn.head(3)
[8]:
bin0 | bin1 | bin2 | bin3 | bin4 | bin5 | bin6 | bin7 | bin8 | bin9 | ... | bin14 | bin15 | bin16 | bin17 | bin18 | bin19 | bin20 | bin21 | bin22 | bin23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||||||||||
2022-02-15 21:27:30+00:00 | 1.5300 | 0.2194 | 0.0579 | 0.0064 | 0.0097 | 0.0129 | 0.0129 | 0.0032 | 0.0033 | 0.0000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2022-02-15 21:26:30+00:00 | 1.6633 | 0.1843 | 0.0811 | 0.0225 | 0.0228 | 0.0227 | 0.0163 | 0.0000 | 0.0000 | 0.0000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2022-02-15 21:25:30+00:00 | 1.6052 | 0.1605 | 0.0767 | 0.0134 | 0.0132 | 0.0133 | 0.0201 | 0.0033 | 0.0033 | 0.0033 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 rows × 24 columns
And here is the normalized number concentration by bin:
[9]:
obj.dndlogdp.head(3)
[9]:
bin0 | bin1 | bin2 | bin3 | bin4 | bin5 | bin6 | bin7 | bin8 | bin9 | ... | bin14 | bin15 | bin16 | bin17 | bin18 | bin19 | bin20 | bin21 | bin22 | bin23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
timestamp | |||||||||||||||||||||
2022-02-15 21:27:30+00:00 | 12.890747 | 1.399359 | 0.320854 | 0.056168 | 0.083258 | 0.098264 | 0.111791 | 0.025613 | 0.028962 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2022-02-15 21:26:30+00:00 | 14.013843 | 1.175487 | 0.449417 | 0.197467 | 0.195699 | 0.172914 | 0.141256 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2022-02-15 21:25:30+00:00 | 13.524331 | 1.023688 | 0.425034 | 0.117602 | 0.113299 | 0.101311 | 0.174187 | 0.026413 | 0.028962 | 0.034052 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 rows × 24 columns
Explore the methods of the GenericParticleSizer
#
There are several key methods available under the GenericParticleSizer
class that allow you to quickly and easily compute various parameters. They can be summarized as follows:
copy
: create a copy of the modeldump
: save a local copy of the model to diskresample
: resample data onto a new time basisslice
: slice the data between two datetimesstats
: compute the aerosol size distribution statistics (e.g., number of particles, GM, GSD, etc.)integrate
: compute the integrated number, surface area, volume, or mass between two diameters
Below, we will walk through each of these methods to demonstrate their capabilities and document their use cases.
copy
#
The copy
method is used to create a copy of the model so that you can slice it or modify it in some non-reversible way:
[10]:
obj2 = obj.copy()
obj2
[10]:
<smps.models.ModulairPM at 0x1502104f0>
dump
#
The dump
method is used to pickle and save a copy of the file locally to filepath
:
[11]:
obj.dump("smps-obj-1.sav")
[11]:
['smps-obj-1.sav']
Note
Currently, this file must be in the same directory as you’re currently working, otherwise joblib
will be unhappy.
resample
#
The resample
method allows you to resample your data onto a new time base. For example, if you originially are working with 1-minute data, but want to work with hourly data, you can resample:
[12]:
rs = obj.resample(rs='1h', inplace=False)
slice
#
The slice
method allows you to quickly remove all data that is not between the start
and end
timestamps. You can use the inplace
argument to determine whether a new copy is returned or the current data is overwritten:
[13]:
obj2 = obj.slice(start="2022-02-15 22:00", end="2022-02-15 04:00", inplace=False)
stats
#
The stats
method allows you to easily calculate common aerosol distribution statistics as weighted by one of (‘number’, ‘surface’, ‘volume’, or ‘mass’). The values returned include:
number
: the total number of particlessurface_area
: the total surface area of particlesvolume
: the total volume of particlesmass
: the total mass of particlesAM
: the arithmetic mean particle diameterGM
: the geometric mean particle diameterMode
: the mode particle diameterGSD
: the geometric standard deviation
[14]:
# Compute the number-weighted stats for the entire size distribution
stats = obj.stats(weight='number')
stats.head(3)
[14]:
number | surface_area | volume | mass | AM | GM | Mode | GSD | |
---|---|---|---|---|---|---|---|---|
timestamp | ||||||||
2022-02-15 21:27:30+00:00 | 1.8589 | 2.542669 | 1.171222 | 1.932516 | 496.016999 | 451.248888 | 405.0 | 1.381490 |
2022-02-15 21:26:30+00:00 | 2.0130 | 2.113701 | 0.411202 | 0.678483 | 493.207899 | 455.856515 | 405.0 | 1.383312 |
2022-02-15 21:25:30+00:00 | 1.9123 | 2.612000 | 0.967307 | 1.596057 | 504.500863 | 454.677924 | 405.0 | 1.415795 |
You can also limit the diameter over which you compute statistics over using the dmin
and dmax
arguments:
[15]:
# Compute the number-weighted stats for just PM2.5
stats = obj.stats(weight='number', dmin=0, dmax=2.5)
stats.head(3)
[15]:
number | surface_area | volume | mass | AM | GM | Mode | GSD | |
---|---|---|---|---|---|---|---|---|
timestamp | ||||||||
2022-02-15 21:27:30+00:00 | 1.839986 | 1.468447 | 0.202913 | 0.334807 | 460.899308 | 441.685395 | 405.0 | 1.274885 |
2022-02-15 21:26:30+00:00 | 2.001357 | 1.856839 | 0.297754 | 0.491294 | 480.660802 | 451.212560 | 405.0 | 1.344825 |
2022-02-15 21:25:30+00:00 | 1.888043 | 1.594091 | 0.239221 | 0.394714 | 466.448930 | 443.238476 | 405.0 | 1.303166 |
We can also go ahead and compute the volume-weighted stats:
[16]:
# Compute the volume-weighted stats for PM2.5
stats = obj.stats(weight='volume', dmin=0., dmax=2.5)
stats.head(3)
[16]:
number | surface_area | volume | mass | AM | GM | Mode | GSD | |
---|---|---|---|---|---|---|---|---|
timestamp | ||||||||
2022-02-15 21:27:30+00:00 | 1.839986 | 1.468447 | 0.202913 | 0.334807 | 1390.018045 | 1087.859802 | 405.0 | 2.104975 |
2022-02-15 21:26:30+00:00 | 2.001357 | 1.856839 | 0.297754 | 0.491294 | 1492.979396 | 1232.860263 | 2000.0 | 1.968189 |
2022-02-15 21:25:30+00:00 | 1.888043 | 1.594091 | 0.239221 | 0.394714 | 1491.962402 | 1185.952258 | 2650.0 | 2.080376 |
integrate
#
The integrate
method is certaintly the most interesting, complex, and most used method. It can compute integrated mass values, number concentrations, and make adjustments for denstiy and hygroscopic growth! We’ll go over many such options below.
First, we will begin with the most common use case - integrating the number of particles between two diameters. If we want to calculate the total number of particles between two diameters, we can do as as follows:
[17]:
obj.integrate(weight='number', dmin=0, dmax=1)
[17]:
timestamp
2022-02-15 21:27:30+00:00 1.8073
2022-02-15 21:26:30+00:00 1.9287
2022-02-15 21:25:30+00:00 1.8424
2022-02-15 21:24:30+00:00 2.0080
2022-02-15 21:23:30+00:00 1.8171
...
2022-02-15 04:43:32+00:00 1.2142
2022-02-15 04:42:32+00:00 1.3960
2022-02-15 04:41:32+00:00 1.2146
2022-02-15 04:40:32+00:00 1.3789
2022-02-15 04:39:32+00:00 1.3231
Length: 1009, dtype: float64
If we want to compute the PM1 value (integrated mass between 0 and 1 µm):
[18]:
obj.integrate(weight='mass', dmin=0., dmax=1., rho=1.65)
[18]:
timestamp
2022-02-15 21:27:30+00:00 0.149699
2022-02-15 21:26:30+00:00 0.163484
2022-02-15 21:25:30+00:00 0.154365
2022-02-15 21:24:30+00:00 0.155946
2022-02-15 21:23:30+00:00 0.143755
...
2022-02-15 04:43:32+00:00 0.093075
2022-02-15 04:42:32+00:00 0.106044
2022-02-15 04:41:32+00:00 0.098637
2022-02-15 04:40:32+00:00 0.111538
2022-02-15 04:39:32+00:00 0.109767
Length: 1009, dtype: float64
Above, we assumed a constant density of 1.65 g/cm3; however, the integrate
method also supports custom functions to compute density as a function of particle diameter. To do so, you simply need to provide a callable function that will return the desired density as a function of particle diameter in µm. For example, if we want to compute PM10 and assume that density is 1.65 g/cm3 below 3µm and 2.5 g/cm3 above 3 µm:
[19]:
def custom_density(dp):
"""Calculate density as a function of particle diameter in µm.
Parameters
----------
dp : float
Particle diameter in microns.
"""
return 1.65 if dp < 3 else 2.5
# Compute PM10
obj.integrate(weight='mass', dmin=0., dmax=10., rho=custom_density)
[19]:
timestamp
2022-02-15 21:27:30+00:00 2.679262
2022-02-15 21:26:30+00:00 0.678483
2022-02-15 21:25:30+00:00 2.096019
2022-02-15 21:24:30+00:00 1.577014
2022-02-15 21:23:30+00:00 1.631535
...
2022-02-15 04:43:32+00:00 0.849622
2022-02-15 04:42:32+00:00 1.532512
2022-02-15 04:41:32+00:00 2.868572
2022-02-15 04:40:32+00:00 5.942367
2022-02-15 04:39:32+00:00 2.217525
Length: 1009, dtype: float64
One additional very cool feature is the ability to correct measurements for hygroscopic growth! While this likely isn’t relevant for research or EPA FEM instrumentation, hygroscopic growth due to water uptake is a big challenge for optical instruments operating in ambient environments without inline driers. While there are several peer-reviewed approaches to correcting for such growth, we follow an approach originally proposed by Di Antonio et al (2018) which corrects the size of the underlying bins prior to computing the step-wise integration.
To compute integrated mass values post hygroscopic-growth corection, you must provide a few pieces of additional information:
You must provide an estimate for the hygroscopic growth correction factor, kappa. This can be either a constant or a callable function
You must provide the column name where relative humidity is stored
For example, if we want to compute the corrected PM1 value with a fixed kappa of 0.3:
[20]:
obj.integrate(weight='mass', dmin=0, dmax=1, kappa=0.3, rh='sample_rh')
[20]:
timestamp
2022-02-15 21:27:30+00:00 0.127272
2022-02-15 21:26:30+00:00 0.143090
2022-02-15 21:25:30+00:00 0.133221
2022-02-15 21:24:30+00:00 0.133941
2022-02-15 21:23:30+00:00 0.124912
...
2022-02-15 04:43:32+00:00 0.081037
2022-02-15 04:42:32+00:00 0.095688
2022-02-15 04:41:32+00:00 0.085639
2022-02-15 04:40:32+00:00 0.098006
2022-02-15 04:39:32+00:00 0.093921
Length: 1009, dtype: float64
If we want to define a complex callable function for kappa alongside a callable function for density:
[21]:
def custom_density(dp):
"""Calculate density as a function of particle diameter in µm.
Parameters
----------
dp : float
Particle diameter in microns.
"""
return 1.65 if dp < 3 else 2.5
def custom_kappa(dp):
"""Calculate kappa as a function of particle diameter in µm.
Parameters
----------
dp : float
Particle diameter in microns.
"""
return 0.3 if dp < 1 else 0.05
# Compute PM10
obj.integrate(
weight='mass', dmin=0., dmax=10.,
rho=custom_density, kappa=custom_kappa, rh='sample_rh'
)
[21]:
timestamp
2022-02-15 21:27:30+00:00 2.578059
2022-02-15 21:26:30+00:00 0.638148
2022-02-15 21:25:30+00:00 2.014149
2022-02-15 21:24:30+00:00 1.508324
2022-02-15 21:23:30+00:00 1.562393
...
2022-02-15 04:43:32+00:00 0.811724
2022-02-15 04:42:32+00:00 1.471832
2022-02-15 04:41:32+00:00 2.768024
2022-02-15 04:40:32+00:00 5.742915
2022-02-15 04:39:32+00:00 2.510176
Length: 1009, dtype: float64
That about sums up the basic functionality for the py-smps
library. For more detailed information, please check out the API documentation and/or the additional guides that cover more specific topics.