.. highlight:: sh Usage ===== The primary purpose of the *quantaq-cli* is to make it easier for you - the user - to munge and analyze your sensor data. A quick overview of the available functions are below, with more detailed documentation on their complete functionality in the :doc:`api`. * **concat** enables you to concatenate large groups of files together into one * **merge** allows you to combine a number of files together based on their timestamp * **resample** helps you up- or down-sample your data * **expunge** sets all flagged data to NaN's * **flag** allows you to drop rows based on specific criteria or statistical methods (**coming soon**) Overview of Available Commands ------------------------------ Concatenate Files ^^^^^^^^^^^^^^^^^ The purpose of this command is to take a bunch of files of the same type and concatenate them together. To use, you must provide either a list of files or a wildcard argument that will glob all of the files together. **FILES** is the only required argument, as can be seen in the :doc:`api`. Below is an example of a wildcard argument that will grab all files in the directory that begin with **data** and are **.csv**'s. We would expect this command to concatenate all of those files and output them to **path/output.csv**. Additionally, the verbosity flag has been set (**-v**) which will print out additional debugging information to the console. .. code-block:: bash $ quantaq-cli concat -v -o path/output.csv path/data*.csv If you wanted to explicitly define the individual files to concatenate, you can do that as well: .. code-block:: bash $ quantaq-cli concat -v path/file-1.csv path/file-2.csv This time, we didn't define the output path (**-o**), so the default will be used which will save the file to your current working directory. Additionally, there is support for concatenating files from the on-board µSD card log files which are fairly hard-to-parse txt files with a ton of embedded information. By adding the **-l, --logs** flag, you can easily convert the entire directory to a single csv file that is usable and makes sense! .. code-block:: bash $ quantaq-cli concat -v -l -o final-logs.csv path/to/logs/*.txt .. warning:: Arguments must come at the **end** of the command. For this CLI, this usually means the filepath for the files being read in. However, you can always check the :doc:`api` for complete documentation. Merge Files ^^^^^^^^^^^ Often times, there is a need to merge two (or more) files on their timestamp. For example, if we have **raw** and **final** data files (or sensor data and reference station data), we need to merge them into a single file to make analysis easier. We can leverage the **merge** command to easily accomplish this. The only required argument is the file(s) to merge together. Additionally, you can override the name of the timestamp column (**-ts, --tscol**) as well as define the output file destination (**-o, --output**). To merge together two files with the default timestamp column and output destination: .. code-block:: bash $ quantaq-cli merge -v path/file-1.csv path/file-2.csv If we want to go ahead and override the timestamp column to one named **tstamp**: .. code-block:: bash $ quantaq-cli merge -v -ts tstamp path/file-1.csv path/file-2.csv If we want to override the output file destination: .. code-block:: bash $ quantaq-cli merge -v -o dest-path/final-file.csv path/file-1.csv path/file-2.csv .. warning:: The timestamp column name must be the same in all files. Flag Data ^^^^^^^^^^ While all raw data files contain a **flag** column, the **flag** command provides an easy way to set additional flags. This method **WILL NOT** remove the data, but it will set a flag that can be removed with the **expunge** command detailed below. There are four required arguments: the file path, the column name, the comparator, and the value. Additionally, you can set the device model using the **model** keyword argument. The goal is to make it easy to flag all data that falls outside some threshold range based on your domain knowledge and intuition. The column must be named identically to a column in the file otherwise an exception will be raised. The comparators that can be chosen/used are: * **lt** : less than ( < ) * **le** : less than or equal to ( <= ) * **eq** : equals ( == ) * **gt** : greater than ( > ) * **ge** : greater than or equal to ( >= ) In addition to the required arguments, there are a few optional arguments that can be used inlcuding the **verbosity** (-v, --verbosity) and **output** (-o, --output) flags prevelant throughout this library. Last is the **flag** (-f, --flag) option. The **flag** option allows you to set the flag that is used where the default is the **FLAG_ROW** flag which will NaN the entire row of data. Flags are specific to each sensor and you should look up the options for your sensor in the sensors documentation. However, there are several flags that can be used and are (as of June 2020) the same for all sensors: * **FLAG_OPC** will NaN all particle data * **FLAG_CO** will NaN all CO data * **FLAG_CO2** will NaN all CO2 data * **FLAG_NO** will NaN all NO data * **FLAG_NO2** will NaN all NO2 data * **FLAG_O3** will NaN all O3 data * **FLAG_NEPH** will NaN all nephelometer data (MODULAIR-PM only) * **FLAG_RHTP** will NaN all relative humidity, temp., and pressure data (MODULAIR-PM only) Examples: If we want to flag all rows where the **co_ae** column is less than 500 mV: .. code-block:: bash $ quantaq-cli flag -v file-1.csv co_ae lt 500 If we want to eliminate only the CO data with the same conditions, we would need to just change the flag we want to use: .. code-block:: bash $ quantaq-cli flag -v -f FLAG_CO file-1.csv co_ae lt 500 It is quite possible that you will want to use multiple filters but only save one file. The **flag** command only allows one set of commands at a time for now, but you can easily accomplish this by using the previous output file path as the input to the second command. Here, we filter out the entire row where **co_ae** is either less than 500 mV or greater than 3300 mV: .. code-block:: bash $ quantaq-cli flag -v -o output.csv file-1.csv co_ae lt 500 $ quantaq-cli flag -v -o final.csv output.csv co_ae gt 3300 Using this approach, complex workflows can be built. .. note:: There are plans to support various statistical methods for flagging outliers. If you have recomendations or thoughts, please add an issue to the GitHub repository. Expunge Data ^^^^^^^^^^^^ All raw data files have a **flag** column that contains a single integer with several flag values combined as a bitmask. To clean this data, we use the **expunge** command. When we say *clean*, what we mean is that the columns associated with a given flag are set to NaN's whenever that flag is set. For more information on the sensor-specific flags, please check out your sensors documentation. There are a few additional options available for this command including **-d, --dry-run** which will generate the flag report and print it out to the terminal screen without saving the final data file, as well as the same **-o, --output** flag to define the output file path as in other commands. The model of the device you are trying to flag can be set with the **-m, --model** flag where the available options are [**v100**, **v200**, and **modulair_pm**]. Last, if you are using your own files and have renamed the **flag** column, you can overrride the name of that column with the **-f, --flag** option. If you running with the verbose flag set (**-v, --verbose**) or with the dry-run (**-d, --dry-run**) flag set, a table with the flag report will be output to the terminal screen. For example, we can run the default **expunge** command in dry-run mode: .. code-block:: bash $ quantaq-cli expunge --dry-run -m v200 path/file-1.csv When you run this, you will see a report generated which will look something like: .. image:: flag-output.png It contains the name of each possible flag, the flag's value, the number of occurences, and the percentage of time the flag was set. To run normally with all defaults: .. code-block:: bash $ quantaq-cli expunge -v -m v200 path/file-1.csv Resample Data ^^^^^^^^^^^^^ The **resample** command makes it easy to up- or down-sample your data (e.g., converting your secondly data into 5-minutely data). The only required columns are the **FILE** and the **INTERVAL**. The **INTERVAL** should be a string that contains both the number and sampling interval, where available sampling interval definitions are below: * **M** : month * **W** : week * **d**: day * **h** : hour * **min** : minute * **s** : second * **ms**: millisecond So, if you wanted to resample your data from 1-second frequency to 5-minute frequency, your **INTERVAL** would be **5min**. In addition to required arguments, there are a few options including the **method** (-m, --method) and the **tscol** (-ts, --tscol). The **tscol** allows you to override the name of the timestamp column which is **timestamp** by default. The **method** column allows you to override the method by which you resample, which defaults to **mean**. Available options for **method** are **mean**, **median**, **sum**, **min**, and **max**. Now, for some examples! If we want to take our data file which is at 10-second frequency and output a file that is 5-minute averaged: .. code-block:: $ quantaq-cli resample -v path/file-1.csv 5min If we want to do the same, but get the median of each 5-min interval instead of the mean: .. code-block:: $ quantaq-cli resample -v -m median path/file-1.csv 5min What if we have a different timestamp colum named **col_time** and want the 24 hour average? .. code-block:: $ quantaq-cli resample -v -ts col_time path/file-1.csv 24h .. warning:: When resampling your data, any non-numeric columns will be dropped. Playbook -------- .. note:: Feather-format data. Feather is a fast, lightweight, easy-to-use binary file format for storing data frames that is programming-language agnostic and extremely efficient when working with time-series data. The process of converting string to python datetime objects is fairly inefficient, especially for large data files. Thus, if working with large files and you desire to manipulate time-series data, it is highly recommended that you use the feather file format! This is supported by this CLI by simply defining the output file with a file extension that is **.feather**. This playbook contains an example of a common workflow for QuantAQ users - you have a ton of raw and final data files, and you need to concatenate them, merge them together, and then expunge them. We will also throw in a few optional flagging steps just to show you how it could be incorporated. This entire workflow could be automated using a tool such as `Snakemake `_ or via bespoke bash commands/files. How to munge and clean your data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ First, we will assume there is some directory containing all files with two subdirectories called **raw** and **final**. Additionally, we have an extra folder to hold our munged data: .. code-block:: dir/ dir/raw/* dir/final/* dir/munged/ We begin by concatenating together all raw files into a single file called **dir/munged/concat-raw.feather** and do the same for the final data files and save to **dir/munged/concat-final.feather**. We assume that all files in the respective directories are csv's and we are using all of them. .. code-block:: bash $ quantaq-cli concat -v -o dir/munged/concat-raw.feather dir/raw/*.csv $ quantaq-cli concat -v -o dir/munged/concat-final.feather \ dir/final/*.csv At this point, we have two large files. Next, we will **merge** the two files together into a single file called **dir/munged/merged.feather**: .. code-block:: bash $ quantaq-cli merge -v -o dir/munged/merged.feather \ dir/munged/concat-raw.feather dir/munged/concat-final.feather Next, let's (optionally) flag the data based on temperature to throw out any periods that have truly ridiculous values (which likely means the sensor was misbehaving): .. code-block:: bash $ quantaq-cli flag -v -o dir/munged/tmp.feather dir/munged/merged.feather \ temp_manifold ge 100 Next, we will **expunge** the data and set the flagged data to NaN's: .. code-block:: bash $ quantaq-cli expunge -v -o dir/munged/expunged.feather dir/munged/tmp.feather At this point, we could stop as we have a file (**expunged.feather**) that contains the final, de-flagged data. However, it is likely still at a 10-second sample frequency which is a lot of data! Let's go ahead and **resample** it to both 1min and 5min intervals: .. code-block:: bash $ quantaq-cli resample -v -o dir/munged/final-1min.csv dir/expunged.feather 1min $ quantaq-cli resample -v -o dir/munged/final-5min.csv dir/expunged.feather 5min And that's it! Just ~7 bash commands and you've gone from two directories full of data to 2 files that contain the final 1min and 5min sampled data!