EasyGC

This is a high throughput, command-line GC-MS analysis pipeline. It is built on top of a modified version of the PyMS python library (http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-13-115 )

This pipeline makes it easy to analyse a large set of GC-MS runs. The input is a directory full of runs (in CDF or JDX format), and the output is a matrix of peaks aligned across all your samples, including their retention times, TIC areas, and mini mass spec per called peak. You can tweak the way peaks are called, filtered and aligned with a whole range of parameters. The key thing, though, is that this pipeline makes it VERY easy to quantitate a lot of peaks across a lot of samples with minimal fuss. It is especially useful for population-level analyses where relative peak size or presence matters most, rather than extremely accurate peak identification and quantitation.

version: 0.0.1: very early. There are currently bugs in the multithreading on linux.

prerequisites

Python 2.7
matplotlib
netCDF
pycdf 0.6-3 (this will only work on Linux)
scipy.ndimage package

(see PyMS_UserGuide.pdf for dependencies installation instructions).

The PyMS lib in this repository (/pyms) should be used instead of the original PyMS python library, as it has a few extra features and bug fixes.

manual

usage: python easyGC.py [command] [options]

peakcall

run the peak caller on a directory of GC-MS files. This will produce a .expr file output for each GC-MS run which is a binary format of called peaks and mass specs. The .expr files are used later by the aligner.

usage: easyGC peakcall [-h] -i INDIR -f FTYPE [-TS TRIMSTART] [-TE TRIMEND]
                       [-W WINDOW] [-S SCANS] [-N MINIONS] [-R MININTENSITY]
                       [-M NOISEMULT] [-I TOPIONS]

optional arguments:
  -h, --help            show this help message and exit
  -i INDIR, --indir INDIR
                        directory containing your GC-MS files to be processed
                        (default: None)
  -f FTYPE, --ftype FTYPE
                        CDF or JDX. This is the type of input GC-MS files you
                        have. CDF is not supported on Windows (default: None)
  -TS TRIMSTART, --trimstart TRIMSTART
                        time in minutes (X.XX) in the chromatogram from where
                        the analysis should begin. Helps to cut out junk at
                        the start (default: 0.0)
  -TE TRIMEND, --trimend TRIMEND
                        time in minutes (X.XX) in the chromatogram where the
                        analysis should end. Helps to cut out junk at the end
                        (default: 20.0)
  -W WINDOW, --window WINDOW
                        peak calling: width (in scans) of window over which
                        local ion maxima are detected. Should be similar to
                        the width off your peaks. (default: 9)
  -S SCANS, --scans SCANS
                        peak calling: distance (in scans) at which locally
                        apexing ions can be combined into one peak (default:
                        3)
  -N MINIONS, --minions MINIONS
                        peak calling: min number of apexing ions with
                        intensity above a threshold required for a peak to be
                        called. Higher = less peaks called (default: 4)
  -R MININTENSITY, --minintensity MININTENSITY
                        peak calling: min intensity (percent) of an ion
                        relative to max peak intensity for that ion to be
                        included in the peak (default: 5)
  -M NOISEMULT, --noisemult NOISEMULT
                        peak calling: total peak intensity must be at least
                        this multiple of the base noise level to be called.
                        Higher multiple means fewer peaks called (default: 4)
  -I TOPIONS, --topions TOPIONS
                        from the list of most important ions in a peak, how
                        many should be outputted as a mini mass-spec?
                        (default: 10)

align

run the peak aligner on a directory of .expr files that were produced by peakcall . This will produce three CSV files as ouput:

aligned_rt.csv
aligned_area.csv
aligned_ions.csv

This command is especially useful if you are not happy with the aligned output from 'pipeline' and want to tweak the alignment parameters to see how they affaect your output matrix, without having to call peaks all over again.

usage: easyGC align -e EXPRDIR [-D DISTANCE] [-G GAP] [-C MINCOMMON] [-T THREADS]

optional arguments:
  -h, --help            show this help message and exit
  -e EXPRDIR, --exprdir EXPRDIR
                        the path to the .expr files from a previous peak
                        calling run. These will be aligned. (default: None)
  -D DISTANCE, --distance DISTANCE
                        local alignment: distance in retention time (seconds)
                        over which the local peak aligner should search for
                        similar peaks to this one (default: 2.5)
  -G GAP, --gap GAP     local alignment: gap penalty. Lower G results in more
                        peaks in the output. Higher G result in fewer output
                        peaks but possibly some peaks contain multiple merged
                        peaks (default: 0.4)
  -C MINCOMMON, --mincommon MINCOMMON
                        local alignment: minimum number of samples that an
                        aligned peak must be called in for it to be outputted
                        (default: 1)
  -T THREADS, --threads THREADS
                        number of threads to use. Currently only multithreaded
                        on linux! (default: 1)

pipeline

run the whole shebang on a directory of GC-MS runs, inclduing peak calling through to aligned output.

usage: easyGC pipeline [-h] -i INDIR -f FTYPE [-TS TRIMSTART] [-TE TRIMEND]
                       [-W WINDOW] [-S SCANS] [-N MINIONS] [-R MININTENSITY]
                       [-M NOISEMULT] [-I TOPIONS] [-D DISTANCE] [-G GAP]
                       [-C MINCOMMON] [-T THREADS]

TO DO

include dependencies in repository
produce excel file output with error checking results shown by cell colour.

EasyGC

A high throughput GC-MS analysis pipeline built on the Python PyMS library

EasyGC

prerequisites

manual

TO DO