viral-ngs: genomic analysis pipelines for viral sequencing

Contents

Description of the methods

_images/viral-ngs-overview.png

Taxonomic read filtration

Human, contaminant, and duplicate read removal

The assembly pipeline begins by depleting paired-end reads from each sample of human and other contaminants using BMTAGGER and BLASTN, and removing PCR duplicates using M-Vicuna (a custom version of Vicuna).

Taxonomic selection

Reads are then filtered to to a genus-level database using LASTAL, quality-trimmed with Trimmomatic, and further deduplicated with PRINSEQ.

Viral genome analysis

Viral genome assembly

The filtered and trimmed reads are subsampled to at most 100,000 pairs. de novo assemby is performed using Trinity. Reference-assisted assembly improvements follow (contig scaffolding, orienting, etc.) with MUMMER and MAFFT.

Each sample’s reads are aligned to its de novo assembly using Novoalign and any remaining duplicates were removed using Picard MarkDuplicates. Variant positions in each assembly were identified using GATK IndelRealigner and UnifiedGenotyper on the read alignments. The assembly was refined to represent the major allele at each variant site, and any positions supported by fewer than three reads were changed to N.

This align-call-refine cycle is iterated twice, to minimize reference bias in the assembly.

Intrahost variant identification

Intrahost variants (iSNVs) were called from each sample’s read alignments using V-Phaser2 and subjected to an initial set of filters: variant calls with fewer than five forward or reverse reads or more than a 10-fold strand bias were eliminated. iSNVs were also removed if there was more than a five-fold difference between the strand bias of the variant call and the strand bias of the reference call. Variant calls that passed these filters were additionally subjected to a 0.5% frequency filter. The final list of iSNVs contains only variant calls that passed all filters in two separate library preparations. These files infer 100% allele frequencies for all samples at an iSNV position where there was no intra-host variation within the sample, but a clear consensus call during assembly. Annotations are computed with snpEff.

Taxonomic read identification

Nothing here at the moment. That comes later, but we will later integrate it when it’s ready.

Cloud compute implementation

This assembly pipeline is also available via the DNAnexus cloud platform. RNA paired-end reads from either HiSeq or MiSeq instruments can be securely uploaded in FASTQ or BAM format and processed through the pipeline using graphical and command-line interfaces. Instructions for the cloud analysis pipeline are available at https://github.com/dnanexus/viral-ngs/wiki

Installation

Manual Installation

Install Conda

To use viral-ngs, you need to install the Conda package manager which is most easily obtained via the Miniconda Python distribution. Miniconda can be installed on your system without admin priviledges.

After installing Miniconda for your platform, be sure to update it:

conda update -y conda
Configure Conda

The viral-ngs software and its dependencies are distributed through the bioconda channel for the conda package manager. It is necessary to add this channel to the conda config:

conda config --add channels bioconda
Make a conda environment and install viral-ngs

It is recommended to install viral-ngs into its own conda environment. This ensures its dependencies do not interfere with other conda packages installed on your system. A new conda environment can be created with the following command, which will also install conda:

conda create -n viral-ngs-env viral-ngs
Activate the viral-ngs environment and complete the install

In order to finish installing viral-ngs, you will need to activate its conda environment:

source activate viral-ngs-env

Due to license restrictions, the viral-ngs conda package cannot distribute and install GATK directly. To fully install GATK, you must download a licensed copy of GATK from the Broad Institute, and call “gatk-register,” which will copy GATK into your viral-ngs conda environment:

# (download licensed copy of GATK)
gath-register /path/to/GenomeAnalysisTK.jar

The single-threaded version of Novoalign is installed by default. If you have a license for Novoalign to enable multi-threaded operation, viral-ngs will copy it to the viral-ngs conda environment if the NOVOALIGN_LICENSE_PATH environment variable is set. Alternatively, the conda version of Novoalign can be overridden if the NOVOALIGN_PATH environment variable is set. If you obtain a Novoalign license after viral-ngs has already been installed, it can be added to the conda environment by calling:

# obtain a Novoalign license file: novoalign.lic
novoalign-register-license /path/to/novoalign.lic
Activating viral-ngs once installed

After viral-ngs has been installed, only one command is needed to load the environment and all of its dependencies. This is the command that must be run each time before using viral-ngs:

source activate viral-ngs-env

To deactivate the conda environment:

source deactivate

Easy deployment script for viral-ngs

viral-ngs can be deployed with help from the script, easy-deploy/easy-deploy-viral-ngs.sh. This script will install an independent copy of viral-ngs from the latest source, install all dependencies, and make it simple to activate the viral-ngs environment and create projects.

One-line install command

This one-line command will install viral-ngs on a 64-bit macOS or Linux system:

./easy-deploy-script/easy-deploy-viral-ngs.sh setup
One-line install command for Broad Institute users

This one-line command will download the easy-deploy-viral-ngs.sh script and setup viral-ngs in the current working directory. Simply ssh to one of the Broad login nodes and paste this command:

wget https://raw.githubusercontent.com/broadinstitute/viral-ngs/master/easy-deploy-script/easy-deploy-viral-ngs.sh && chmod a+x ./easy-deploy-viral-ngs.sh && reuse UGER && qrsh -cwd -N "viral-ngs_deploy" -q interactive ./easy-deploy-viral-ngs.sh setup

Note: The script will run the install on a UGER interactive node, so you must have the ability to create to start a new interactive session. A project can be specified via qrsh -P "<project_name>"

Usage

Installation

  • easy-deploy-viral-ngs.sh setup Installs a fresh copy of viral-ngs, installs all dependencies, and creates a directory, viral-ngs-etc/, in the current working directory.

Resulting directories:

viral-ngs-etc/
    venv/
    viral-ngs/

Activating the environment

  • source easy-deploy-viral-ngs.sh load Loads the dotkits needed by viral-ngs and activates the Python virtual environment

Creating a project directory

  • easy-deploy-viral-ngs.sh create-project <project_name> Creates a directory for a new Snakemake-compatible project, with data directories and symlinked run scripts. Copies in the files Snakefile and config.yaml

Resulting directories:

viral-ngs-analysis-software/
    projects/
        <project_name>/
            Snakefile
            bin/ (symlink)
            config.yaml
            data/
            log/
            reports/
            run-pipe_LSF.sh (symlink)
            run-pipe_UGER.sh (symlink)
            samples-assembly-failures.txt
            samples-assembly.txt
            samples-depletion.txt
            samples-runs.txt
            tmp/
            venv/ (symlink)
            [...other project files...]

Virtualized Installation (Easy Deploy)

The viral-ngs package includes a script that can be used to set up a complete virtualized environment for running viral-ngs either on a local machine via VirtualBox, or on AWS EC2. This is an easiesr way to get the software up and running, as it sets up most dependencies automatically within an environment known to work.

Requirements

As noted above, GATK and NovoAlign cannot be installed automatically due to licensing restrictions. In order to run the easy deployment script, you will first need to license and download these tools, and set the GATK_PATH and NOVOALIGN_LICENSE_PATH environment variables.

The easy deployment script has been tested to run on OS X 10.11 (El Capitan) and Ubuntu 15.04 (Vivid Vervet).

Requirements for running on AWS EC2

In order to deploy a virtualized viral-ngs environment to AWS EC2, you will first need to set up the appropriate credentials for creating EC2 instances. AWS credentials and SSH keypairs are passed in as environment variables, and run.sh will prompt for the values if the environment variables are not set (though the values given interactively are ephemeral).

The following environment variables are needed:

  • EC2_ACCESS_KEY_ID
  • EC2_SECRET_ACCESS_KEY
  • EC2_REGION (ex. “us-west-2”)
  • EC2_KEYPAIR_NAME (ex. “my-ssh-keypair”)
  • EC2_PRIVATE_KEY_PATH (ex. “my-ssh-keypair.pem”)
  • EC2_SECURITY_GROUP (ex. “ssh-only-group”)

For more information, see the following AWS pages:

Note that the EC2 instance created by the easy-deploy script is currently configured to be an m4.2xlarge, which costs ~$0.55/hour to run. It is suggested that the instance be terminated via the AWS web console once processing with viral-ngs is complete. See the AWS page for current pricing .

Limitations

As viral-ngs does not currently build a depletion database for BMTagger or BLAST automatically, it is the responsibility of the user to create a depletion database for use within the virtualized viral-ngs environment. It can be created within the virtual machine (VM), or uploaded after the fact via rsync.

Running Easy Deploy

Running Easy Deploy to create a virtualized viral-ngs environment is as simple as running easy-deploy-virtualized/run.sh. Before running this script, copy any data you wish to have in the vm to the easy-deploy-virtualized/data directory on your local machine. During setup, the files will be copied into the ~/data/ directory of virtual machine.

To start, the script run.sh installs the necessary dependencies on the user’s machine (ansible, vagrant, virtualbox, and virtualbox-aws). The provisioning is handled by Ansible, with Vagrant handling creation of the VMs and EC2 instances. On OSX it depends on Homebrew, and will install it if it is not present. It depends on having apt on linux. Ruby >=2.0 is required for vagrant-aws, so versions of Ubuntu older than 15.04 (notably 14.04 LTS) will need to have ruby >=2.0 installed and made default.

Details on Easy Deploy

Per the Vagrantfile, local VM RAM usage is set to 8GB. On EC2 it currently uses an m4.2xlarge instance with 32GB of RAM and 8 vCPUs.

Ansible clones the master branch of viral-ngs from GitHub, creates a Python 3 virtual environment, and installs the viral-ngs Python dependencies. The viral-ngs tool unit tests are run to download, install, and build all of the viral-ngs tools. A Snakefile for viral-ngs is copied to the home directory of the VM (locally: /home/vagrant/, on EC2: /home/ubuntu/), along with an associated config.yaml file. Files to contain sample names (sample-depletion.txt, etc.) are also created. A directory is created within the VM, ~/data/, to store data to be processed. This directory on the VM is synced to the ./data/ directory on the host machine, relative to the location of the easy-deploy-virtualized/Vagrantfile. On local VMs, syncing of the directory is two-way and fast. On EC2 instances, the syncing is currently one way (local->EC2) due to Vagrant limitations.

Command line tools

metagenomics.py - metagenomic analyses

This script contains a number of utilities for metagenomic analyses.

usage: metagenomics.py subcommand
Sub-commands:
kraken

Classify reads by taxon using Kraken

usage: metagenomics.py kraken [-h] [--outReport OUTREPORT]
                              [--outReads OUTREADS]
                              [--filterThreshold FILTERTHRESHOLD]
                              [--numThreads NUMTHREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              inBam db
Positional arguments:
inBam Input unaligned reads, BAM format.
db Kraken database directory.
Options:
--outReport Kraken report output file.
--outReads Kraken per read output file.
--filterThreshold=0.05
 Kraken filter threshold (default %(default)s)
--numThreads=1 Number of threads to run. (default %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
diamond

Classify reads by the taxon of the Lowest Common Ancestor (LCA)

usage: metagenomics.py diamond [-h] [--outM8 OUTM8] [--outLca OUTLCA]
                               [--numThreads NUMTHREADS]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam db taxDb outReport
Positional arguments:
inBam Input unaligned reads, BAM format.
db Diamond database directory.
taxDb Taxonomy database directory.
outReport Output taxonomy report.
Options:
--outM8 Blast m8 formatted output file.
--outLca Output LCA assignments for each read.
--numThreads=1 Number of threads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
krona

Create an interactive HTML report from a tabular metagenomic report

usage: metagenomics.py krona [-h] [--queryColumn QUERYCOLUMN]
                             [--taxidColumn TAXIDCOLUMN]
                             [--scoreColumn SCORECOLUMN] [--noHits] [--noRank]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             inTsv db outHtml
Positional arguments:
inTsv Input tab delimited file.
db Krona taxonomy database directory.
outHtml Output html report.
Options:
--queryColumn=2
 Column of query id. (default %(default)s)
--taxidColumn=3
 Column of taxonomy id. (default %(default)s)
--scoreColumn Column of score. (default %(default)s)
--noHits=False Include wedge for no hits.
--noRank=False Include no rank assignments.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
report_merge

Merge multiple metegenomic reports into a single metagenomic report. Any Krona input files created by this

usage: metagenomics.py report_merge [-h]
                                    [--outSummaryReport OUT_KRAKEN_SUMMARY]
                                    [--krakenDB KRAKEN_DB]
                                    [--outByQueryToTaxonID OUT_KRONA_INPUT]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    metagenomic_reports
                                    [metagenomic_reports ...]
Positional arguments:
metagenomic_reports
 Input metagenomic reports with the query ID and taxon ID in the 2nd and 3rd columns (Kraken format)
Options:
--outSummaryReport
 Path of human-readable metagenomic summary report, created by kraken-report
--krakenDB Kraken database (needed for outSummaryReport)
--outByQueryToTaxonID
 Output metagenomic report suitable for Krona input.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

taxon_filter.py - tools for taxonomic removal or filtration of reads

This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.

usage: taxon_filter.py subcommand
Sub-commands:
deplete_human

Run the entire depletion pipeline: bmtagger, mvicuna, blastn. Optionally, use lastal to select a specific taxon of interest.

usage: taxon_filter.py deplete_human [-h] [--taxfiltBam TAXFILTBAM]
                                     --bmtaggerDbs BMTAGGERDBS
                                     [BMTAGGERDBS ...] --blastDbs BLASTDBS
                                     [BLASTDBS ...] [--lastDb LASTDB]
                                     [--threads THREADS]
                                     [--JVMmemory JVMMEMORY]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam revertBam bmtaggerBam rmdupBam
                                     blastnBam
Positional arguments:
inBam Input BAM file.
revertBam Output BAM: read markup reverted with Picard.
bmtaggerBam Output BAM: depleted of human reads with BMTagger.
rmdupBam Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
blastnBam Output BAM: rmdupBam run through another depletion of human reads with BLASTN.
Options:
--taxfiltBam Output BAM: blastnBam run through taxonomic selection via LASTAL.
--bmtaggerDbs Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
--blastDbs One or more reference databases for blast to deplete from input.
--lastDb One reference database for last (required if –taxfiltBam is specified).
--threads=4 The number of threads to use in running blastn.
--JVMmemory=4g JVM virtual memory size for Picard FilterSamReads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
trim_trimmomatic

Trim read sequences with Trimmomatic.

usage: taxon_filter.py trim_trimmomatic [-h]
                                        [--leadingQCutoff LEADING_Q_CUTOFF]
                                        [--trailingQCutoff TRAILING_Q_CUTOFF]
                                        [--minlengthToKeep MINLENGTH_TO_KEEP]
                                        [--slidingWindowSize SLIDING_WINDOW_SIZE]
                                        [--slidingWindowQCutoff SLIDING_WINDOW_Q_CUTOFF]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        inFastq1 inFastq2 pairedOutFastq1
                                        pairedOutFastq2 clipFasta
Positional arguments:
inFastq1 Input reads 1
inFastq2 Input reads 2
pairedOutFastq1
 Paired output 1
pairedOutFastq2
 Paired output 2
clipFasta Fasta file with adapters, PCR sequences, etc. to clip off
Options:
--leadingQCutoff=15
 minimum quality required to keep a base on the leading end (default: %(default)s)
--trailingQCutoff=15
 minimum quality required to keep a base on the trailing end (default: %(default)s)
--minlengthToKeep=30
 minimum length of reads to be kept (default: %(default)s)
--slidingWindowSize=4
 the number of bases to average across (default: %(default)s)
--slidingWindowQCutoff=25
 specifies the average quality required in the sliding window (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_lastal_bam

Restrict input reads to those that align to the given reference database using LASTAL.

usage: taxon_filter.py filter_lastal_bam [-h]
                                         [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                         [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                         [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                         [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                         [--JVMmemory JVMMEMORY]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inBam db outBam
Positional arguments:
inBam Input reads
db Database of taxa we keep
outBam Output reads, filtered to refDb
Options:
-n=1 maximum gapless alignments per query position (default: %(default)s)
-l=5 minimum length for initial matches (default: %(default)s)
-L=50 maximum length for initial matches (default: %(default)s)
-m=100 maximum initial matches per query position (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_lastal

Restrict input reads to those that align to the given reference database using LASTAL. Also, remove duplicates with prinseq.

usage: taxon_filter.py filter_lastal [-h]
                                     [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                     [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                     [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                     [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inFastq refDb outFastq
Positional arguments:
inFastq Input fastq file
refDb Reference database to retain from input
outFastq Output fastq file
Options:
-n=1 maximum gapless alignments per query position (default: %(default)s)
-l=5 minimum length for initial matches (default: %(default)s)
-L=50 maximum length for initial matches (default: %(default)s)
-m=100 maximum initial matches per query position (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
partition_bmtagger

Use bmtagger to partition input reads into ones that match at least one of several databases and ones that don’t match any of the databases.

usage: taxon_filter.py partition_bmtagger [-h] [--outMatch OUTMATCH OUTMATCH]
                                          [--outNoMatch OUTNOMATCH OUTNOMATCH]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          inFastq1 inFastq2 refDbs
                                          [refDbs ...]
Positional arguments:
inFastq1 Input fastq file; 1st end of paired-end reads.
inFastq2 Input fastq file; 2nd end of paired-end reads. Must have same names as inFastq1
refDbs Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
Options:
--outMatch Filenames for fastq output of matching reads.
--outNoMatch Filenames for fastq output of unmatched reads.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_bam_bmtagger

Use bmtagger to deplete input reads against several databases.

usage: taxon_filter.py deplete_bam_bmtagger [-h] [--threads THREADS]
                                            [--JVMmemory JVMMEMORY]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            inBam refDbs [refDbs ...] outBam
Positional arguments:
inBam Input BAM file.
refDbs Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
outBam Output BAM file.
Options:
--threads=4 The number of threads to use in running blastn.
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_blastn

Use blastn to remove reads that match at least one of the databases.

usage: taxon_filter.py deplete_blastn [-h] [--threads THREADS]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      [--version] [--tmp_dir TMP_DIR]
                                      [--tmp_dirKeep]
                                      inFastq outFastq refDbs [refDbs ...]
Positional arguments:
inFastq Input fastq file.
outFastq Output fastq file with matching reads removed.
refDbs One or more reference databases for blast.
Options:
--threads=4 The number of threads to use in running blastn.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_blastn_paired

Use blastn to remove reads that match at least one of the databases.

usage: taxon_filter.py deplete_blastn_paired [-h] [--threads THREADS]
                                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                             [--version] [--tmp_dir TMP_DIR]
                                             [--tmp_dirKeep]
                                             infq1 infq2 outfq1 outfq2 refDbs
                                             [refDbs ...]
Positional arguments:
infq1 Input fastq file.
infq2 Input fastq file.
outfq1 Output fastq file with matching reads removed.
outfq2 Output fastq file with matching reads removed.
refDbs One or more reference databases for blast.
Options:
--threads=4 The number of threads to use in running blastn.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_blastn_bam

Use blastn to remove reads that match at least one of the specified databases.

usage: taxon_filter.py deplete_blastn_bam [-h] [--threads THREADS]
                                          [--chunkSize CHUNKSIZE]
                                          [--JVMmemory JVMMEMORY]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          inBam refDbs [refDbs ...] outBam
Positional arguments:
inBam Input BAM file.
refDbs One or more reference databases for blast.
outBam Output BAM file with matching reads removed.
Options:
--threads=4 The number of threads to use in running blastn.
--chunkSize=1000000
 FASTA chunk size (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
lastal_build_db

build a database for use with last based on an input fasta file

usage: taxon_filter.py lastal_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files (default is cwd: %(default)s)
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
blastn_build_db

Create a database for use with blastn from an input reference FASTA file

usage: taxon_filter.py blastn_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
bmtagger_build_db

Create a database for use with Bmtagger from an input FASTA file.

usage: taxon_filter.py bmtagger_build_db [-h]
                                         [--outputFilePrefix OUTPUTFILEPREFIX]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files (Where *.bitmask and *.srprism files will be stored)
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

assembly.py - de novo assembly

This script contains a number of utilities for viral sequence assembly
from NGS reads. Primarily used for Lassa and Ebola virus analysis in the Sabeti Lab / Broad Institute Viral Genomics.

usage: assembly.py subcommand
Sub-commands:
trim_rmdup_subsamp

Take reads through Trimmomatic, Prinseq, and subsampling. This should probably move over to read_utils or taxon_filter.

usage: assembly.py trim_rmdup_subsamp [-h] [--n_reads N_READS]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      [--version] [--tmp_dir TMP_DIR]
                                      [--tmp_dirKeep]
                                      inBam clipDb outBam
Positional arguments:
inBam Input reads, unaligned BAM format.
clipDb Trimmomatic clip DB.
outBam Output reads, unaligned BAM format (currently, read groups and other header information are destroyed in this process).
Options:
--n_reads=100000
 Subsample reads to no more than this many pairs. (default %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
assemble_trinity

This step runs the Trinity assembler. First trim reads with trimmomatic, rmdup with prinseq, and random subsample to no more than 100k reads.

usage: assembly.py assemble_trinity [-h] [--n_reads N_READS]
                                    [--outReads OUTREADS] [--always_succeed]
                                    [--JVMmemory JVMMEMORY]
                                    [--threads THREADS]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    inBam clipDb outFasta
Positional arguments:
inBam Input unaligned reads, BAM format.
clipDb Trimmomatic clip DB.
outFasta Output assembly.
Options:
--n_reads=100000
 Subsample reads to no more than this many pairs. (default %(default)s)
--outReads Save the trimmomatic/prinseq/subsamp reads to a BAM file
--always_succeed=False
 If Trinity fails (usually because insufficient reads to assemble), emit an empty fasta file as output. Default is to throw a DenovoAssemblyError.
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--threads=1 Number of threads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
order_and_orient

This step cleans up the de novo assembly with a known reference genome. Uses MUMmer (nucmer or promer) to create a reference-based consensus sequence of aligned contigs (with runs of N’s in between the de novo contigs).

usage: assembly.py order_and_orient [-h] [--breaklen BREAKLEN]
                                    [--maxgap MAXGAP] [--minmatch MINMATCH]
                                    [--mincluster MINCLUSTER]
                                    [--min_pct_id MIN_PCT_ID]
                                    [--min_contig_len MIN_CONTIG_LEN]
                                    [--min_pct_contig_aligned MIN_PCT_CONTIG_ALIGNED]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    inFasta inReference outFasta
Positional arguments:
inFasta Input de novo assembly/contigs, FASTA format.
inReference Reference genome for ordering, orienting, and merging contigs, FASTA format.
outFasta Output assembly, FASTA format, with the same number of chromosomes as inReference, and in the same order.
Options:
--breaklen, -b Amount to extend alignment clusters by (if –extend). nucmer default 200, promer default 60.
--maxgap=200, -g=200
 Maximum gap between two adjacent matches in a cluster. Our default is %(default)s. nucmer default 90, promer default 30. Manual suggests going to 1000.
--minmatch=10, -l=10
 Minimum length of an maximal exact match. Our default is %(default)s. nucmer default 20, promer default 6.
--mincluster, -c
 Minimum cluster length. nucmer default 65, promer default 20.
--min_pct_id=0.6, -i=0.6
 show-tiling: minimum percent identity for contig alignment (0.0 - 1.0, default: %(default)s)
--min_contig_len=200
 show-tiling: reject contigs smaller than this (default: %(default)s)
--min_pct_contig_aligned=0.6, -v=0.6
 show-tiling: minimum percent of contig length in alignment (0.0 - 1.0, default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
impute_from_reference

This takes a de novo assembly, aligns against a reference genome, and imputes all missing positions (plus some of the chromosome ends) with the reference genome. This provides an assembly with the proper structure (but potentially wrong sequences in areas) from which we can perform further read-based refinement. Two steps: filter_short_seqs: We then toss out all assemblies that come out to < 15kb or < 95% unambiguous and fail otherwise. modify_contig: Finally, we trim off anything at the end that exceeds the length of the known reference assembly. We also replace all Ns and everything within 55bp of the chromosome ends with the reference sequence. This is clearly incorrect consensus sequence, but it allows downstream steps to map reads in parts of the genome that would otherwise be Ns, and we will correct all of the inferred positions with two steps of read-based refinement (below), and revert positions back to Ns where read support is lacking. FASTA indexing: output assembly is indexed for Picard, Samtools, Novoalign.

usage: assembly.py impute_from_reference [-h] [--newName NEWNAME]
                                         [--minLengthFraction MINLENGTHFRACTION]
                                         [--minUnambig MINUNAMBIG]
                                         [--replaceLength REPLACELENGTH]
                                         [--aligner {muscle,mafft,mummer}]
                                         [--index]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inFasta inReference outFasta
Positional arguments:
inFasta Input assembly/contigs, FASTA format, already ordered, oriented and merged with inReference.
inReference Reference genome to impute with, FASTA format.
outFasta Output assembly, FASTA format.
Options:
--newName rename output chromosome (default: do not rename)
--minLengthFraction=0.5
 minimum length for contig, as fraction of reference (default: %(default)s)
--minUnambig=0.5
 minimum percentage unambiguous bases for contig (default: %(default)s)
--replaceLength=0
 length of ends to be replaced with reference (default: %(default)s)
--aligner=muscle
 

which method to use to align inFasta to inReference. “muscle” = MUSCLE, “mafft” = MAFFT, “mummer” = nucmer. [default: %(default)s]

Possible choices: muscle, mafft, mummer

--index=False Index outFasta for Picard/GATK, Samtools, and Novoalign.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
refine_assembly

This a refinement step where we take a crude assembly, align all reads back to it, and modify the assembly to the majority allele at each position based on read pileups. This step considers both SNPs as well as indels called by GATK and will correct the consensus based on GATK calls. Reads are aligned with Novoalign, then PCR duplicates are removed with Picard (in order to debias the allele counts in the pileups), and realigned with GATK’s IndelRealigner (in order to call indels). Output FASTA file is indexed for Picard, Samtools, and Novoalign.

usage: assembly.py refine_assembly [-h] [--outBam OUTBAM] [--outVcf OUTVCF]
                                   [--min_coverage MIN_COVERAGE]
                                   [--novo_params NOVO_PARAMS]
                                   [--chr_names [CHR_NAMES [CHR_NAMES ...]]]
                                   [--keep_all_reads] [--JVMmemory JVMMEMORY]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inFasta inBam outFasta
Positional arguments:
inFasta Input assembly, FASTA format, pre-indexed for Picard, Samtools, and Novoalign.
inBam Input reads, unaligned BAM format.
outFasta Output refined assembly, FASTA format, indexed for Picard, Samtools, and Novoalign.
Options:
--outBam Reads aligned to inFasta. Unaligned and duplicate reads have been removed. GATK indel realigned.
--outVcf GATK genotype calls for genome in inFasta coordinate space.
--min_coverage=3
 Minimum read coverage required to call a position unambiguous.
--novo_params=-r Random -l 40 -g 40 -x 20 -t 100
 Alignment parameters for Novoalign.
--chr_names=[] Rename all output chromosomes (default: retain original chromosome names)
--keep_all_reads=False
 Retain all reads in BAM file? Default is to remove unaligned and duplicate reads.
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--threads=1 Number of threads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_short_seqs

Check sequences in inFile, retaining only those that are at least minLength

usage: assembly.py filter_short_seqs [-h] [-f FORMAT] [-of OUTPUT_FORMAT]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version]
                                     inFile minLength minUnambig outFile
Positional arguments:
inFile input sequence file
minLength minimum length for contig
minUnambig minimum percentage unambiguous bases for contig
outFile output file
Options:
-f=fasta, --format=fasta
 Format for input sequence (default: %(default)s)
-of=fasta, --output-format=fasta
 Format for output sequence (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
modify_contig

Modifies an input contig. Depending on the options selected, can replace N calls with reference calls, replace ambiguous calls with reference calls, trim to the length of the reference, replace contig ends with reference calls, and trim leading and trailing Ns. Author: rsealfon.

usage: assembly.py modify_contig [-h] [-n NAME] [-cn] [-t] [-r5] [-r3]
                                 [-l REPLACE_LENGTH] [-f FORMAT] [-r] [-rn]
                                 [-ca] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version]
                                 input output ref
Positional arguments:
input input alignment of reference and contig (should contain exactly 2 sequences)
output Destination file for modified contigs
ref reference sequence name (exact match required)
Options:
-n, --name fasta header output name (default: existing header)
-cn=False, --call-reference-ns=False
 should the reference sequence be called if there is an N in the contig and a more specific base in the reference (default: %(default)s)
-t=False, --trim-ends=False
 should ends of contig.fasta be trimmed to length of reference (default: %(default)s)
-r5=False, --replace-5ends=False
 should the 5’-end of contig.fasta be replaced by reference (default: %(default)s)
-r3=False, --replace-3ends=False
 should the 3’-end of contig.fasta be replaced by reference (default: %(default)s)
-l=10, --replace-length=10
 length of ends to be replaced (if replace-ends is yes) (default: %(default)s)
-f=fasta, --format=fasta
 Format for input alignment (default: %(default)s)
-r=False, --replace-end-gaps=False
 Replace gaps at the beginning and end of the sequence with reference sequence (default: %(default)s)
-rn=False, --remove-end-ns=False
 Remove leading and trailing N’s in the contig (default: %(default)s)
-ca=False, --call-reference-ambiguous=False
 should the reference sequence be called if the contig seq is ambiguous and the reference sequence is more informative & consistant with the ambiguous base (ie Y->C) (default: %(default)s)
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
vcf_to_fasta

Take input genotypes (VCF) and construct a consensus sequence (fasta) by using majority-read-count alleles in the VCF. Genotypes in the VCF will be ignored–we will use the allele with majority read support (or an ambiguity base if there is no clear majority). Uncalled positions will be emitted as N’s. Author: dpark.

usage: assembly.py vcf_to_fasta [-h] [--trim_ends] [--min_coverage MIN_DP]
                                [--major_cutoff MAJOR_CUTOFF]
                                [--min_dp_ratio MIN_DP_RATIO]
                                [--name [NAME [NAME ...]]]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version]
                                inVcf outFasta
Positional arguments:
inVcf Input VCF file
outFasta Output FASTA file
Options:
--trim_ends=False
 If specified, we will strip off continuous runs of N’s from the beginning and end of the sequences before writing to output. Interior N’s will not be changed.
--min_coverage=3
 Specify minimum read coverage (with full agreement) to make a call. [default: %(default)s]
--major_cutoff=0.5
 If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: %(default)s]
--min_dp_ratio=0.0
 The input VCF file often reports two read depth values (DP)–one for the position as a whole, and one for the sample in question. We can optionally reject calls in which the sample read count is below a specified fraction of the total read count. This filter will not apply to any sites unless both DP values are reported. [default: %(default)s]
--name=[] output sequence names (default: reference names in VCF file)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
trim_fasta

Take input sequences (fasta) and trim any continuous sections of N’s from the ends of them. Write trimmed sequences to an output fasta file.

usage: assembly.py trim_fasta [-h]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              inFasta outFasta
Positional arguments:
inFasta Input fasta file
outFasta Output (trimmed) fasta file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
deambig_fasta

Take input sequences (fasta) and replace any ambiguity bases with a random unambiguous base from among the possibilities described by the ambiguity code. Write output to fasta file.

usage: assembly.py deambig_fasta [-h]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version]
                                 inFasta outFasta
Positional arguments:
inFasta Input fasta file
outFasta Output fasta file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
dpdiff

Take input VCF files (all with only one sample each) and report on the discrepancies between the two DP fields (one in INFO and one in the sample’s genotype column).

usage: assembly.py dpdiff [-h]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version]
                          inVcfs [inVcfs ...] outFile
Positional arguments:
inVcfs Input VCF file
outFile Output flat file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit

interhost.py - species and population-level genetic variation

This script contains a number of utilities for SNP calling, multi-alignment,
phylogenetics, etc.

usage: interhost.py subcommand
Sub-commands:
snpEff

Annotate variants in VCF file with translation consequences using snpEff.

usage: interhost.py snpEff [-h] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version]
                           inVcf genomes [genomes ...] outVcf emailAddress
Positional arguments:
inVcf Input VCF file
genomes genome name
outVcf Output VCF file
emailAddress Your email address. To access the Genbank CoreNucleotide database, NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access.
Options:
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
align_mafft

Run the mafft alignment on the input FASTA file.

usage: interhost.py align_mafft [-h] [--localpair | --globalpair]
                                [--preservecase] [--reorder]
                                [--gapOpeningPenalty GAPOPENINGPENALTY]
                                [--ep EP] [--verbose] [--outputAsClustal]
                                [--maxiters MAXITERS] [--threads THREADS]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inFastas [inFastas ...] outFile
Positional arguments:
inFastas Input FASTA files.
outFile Output file containing alignment result (default format: FASTA)
Options:
--localpair All pairwise alignments are computed with the Smith-Waterman algorithm.
--globalpair All pairwise alignments are computed with the Needleman-Wunsch algorithm.
--preservecase Preserve base or aa case, as well as symbols.
--reorder Output is ordered aligned rather than in the order of the input (default: %(default)s).
--gapOpeningPenalty=1.53
 Gap opening penalty (default: %(default)s).
--ep Offset (works like gap extension penalty).
--verbose=False
 Full output (default: %(default)s).
--outputAsClustal
 Write output file in Clustal format rather than FASTA
--maxiters=0 Maximum number of refinement iterations (default: %(default)s). Note: if “–localpair” or “–globalpair” is specified this defaults to 1000.
--threads=-1 Number of processing threads (default: %(default)s, where -1 indicates use of all available cores).
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
multichr_mafft

Run the mafft alignment on a series of chromosomes provided in sample-partitioned FASTA files. Output as FASTA. (i.e. file1.fasta would contain chr1, chr2, chr3; file2.fasta would also contain chr1, chr2, chr3)

usage: interhost.py multichr_mafft [-h] [--localpair | --globalpair]
                                   [--preservecase] [--reorder]
                                   [--gapOpeningPenalty GAPOPENINGPENALTY]
                                   [--ep EP] [--verbose] [--outputAsClustal]
                                   [--maxiters MAXITERS] [--threads THREADS]
                                   [--outFilePrefix OUTFILEPREFIX]
                                   [--sampleRelationFile SAMPLERELATIONFILE]
                                   [--sampleNameListFile SAMPLENAMELISTFILE]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inFastas [inFastas ...] outDirectory
Positional arguments:
inFastas Input FASTA files.
outDirectory Location for the output files (default is cwd: %(default)s)
Options:
--localpair All pairwise alignments are computed with the Smith-Waterman algorithm.
--globalpair All pairwise alignments are computed with the Needleman-Wunsch algorithm.
--preservecase Preserve base or aa case, as well as symbols.
--reorder Output is ordered aligned rather than in the order of the input (default: %(default)s).
--gapOpeningPenalty=1.53
 Gap opening penalty (default: %(default)s).
--ep Offset (works like gap extension penalty).
--verbose=False
 Full output (default: %(default)s).
--outputAsClustal
 Write output file in Clustal format rather than FASTA
--maxiters=0 Maximum number of refinement iterations (default: %(default)s). Note: if “–localpair” or “–globalpair” is specified this defaults to 1000.
--threads=-1 Number of processing threads (default: %(default)s, where -1 indicates use of all available cores).
--outFilePrefix=aligned
 Prefix for the output file name (default: %(default)s)
--sampleRelationFile
 If the parameter sampleRelationFile is specified (as a file path), a JSON file will be written mapping sample name to sequence position in the output.
--sampleNameListFile
 If the parameter sampleRelationFile is specified (as a file path), a file will be written mapping sample names in the order of their sequence positions in the output.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

intrahost.py - within-host genetic variation (iSNVs)

This script contains a number of utilities for intrahost variant calling and annotation for viral genomes.

usage: intrahost.py subcommand
Sub-commands:
vphaser_one_sample

Input: a single BAM file, representing reads from one sample, mapped to its own consensus assembly. It may contain multiple read groups and libraries. Output: a tab-separated file with no header containing filtered V Phaser-2 output variants with additional column for sequence/chrom name, and library counts and p-values appended to the counts for each allele.

usage: intrahost.py vphaser_one_sample [-h]
                                       [--vphaserNumThreads VPHASERNUMTHREADS]
                                       [--minReadsEach MINREADSEACH]
                                       [--maxBias MAXBIAS]
                                       [--removeDoublyMappedReads]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version]
                                       inBam inConsFasta outTab
Positional arguments:
inBam Input Bam file.
inConsFasta Consensus assembly fasta.
outTab Tab-separated headerless output file.
Options:
--vphaserNumThreads
 Number of threads in call to V-Phaser 2.
--minReadsEach=5
 Minimum number of reads on each strand (default: %(default)s).
--maxBias=10 Maximum allowable ratio of number of reads on the two strands (default: %(default)s). Ignored if minReadsEach = 0.
--removeDoublyMappedReads=False
 When calling V-Phaser, keep reads mapping to more than one contig.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
vphaser

Run V-Phaser 2 on the input file without any additional filtering. Combine the non-header lines of the CHROM.var.raw.txt files it produces, adding CHROM as the first field on each line.

usage: intrahost.py vphaser [-h] [--numThreads NUMTHREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version]
                            inBam outTab
Positional arguments:
inBam Input Bam file.
outTab Tab-separated headerless output file.
Options:
--numThreads Number of threads in call to V-Phaser 2.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
tabfile_rename

Take input tab file and copy to an output file while changing the values in a specific column based on a mapping file. The first line will pass through untouched (it is assumed to be a header).

usage: intrahost.py tabfile_rename [-h] [--col_idx COL]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version]
                                   inFile mapFile outFile
Positional arguments:
inFile Input flat file
mapFile Map file. Two-column headerless file that maps input values to output values. This script will error if there are values in inFile that do not exist in mapFile.
outFile Output flat file
Options:
--col_idx=0 Which column number to replace (0-based index). [default: %(default)s]
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
merge_to_vcf

Combine and convert vPhaser2 parsed filtered output text files into VCF format. Assumption: consensus assemblies used in creating alignments do not extend beyond ends of reference. the number of alignment files equals the number of chromosomes / segments

usage: intrahost.py merge_to_vcf [-h] --samples SAMPLES [SAMPLES ...] --isnvs
                                 ISNVS [ISNVS ...] --alignments ALIGNMENTS
                                 [ALIGNMENTS ...] [--strip_chr_version]
                                 [--naive_filter] [--parse_accession]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version]
                                 refFasta outVcf
Positional arguments:
refFasta The target reference genome. outVcf will use these chromosome names, coordinate spaces, and reference alleles
outVcf Output VCF file containing all variants
Options:
--samples A list of sample names
--isnvs A list of file names from the output of vphaser_one_sample These must be in the SAME ORDER as samples.
--alignments a list of fasta files containing multialignment of input assemblies, with one file per chromosome/segment. Each alignment file will contain a line for each sample, as well as the reference genome to which they were aligned.
--strip_chr_version=False
 If set, strip any trailing version numbers from the chromosome names. If the chromosome name ends with a period followed by integers, this is interepreted as a version number to be removed. This is because Genbank accession numbers are often used by SnpEff databases downstream, but without the corresponding version number. Default is false (leave chromosome names untouched).
--naive_filter=False
 If set, keep only the alleles that have at least two independent libraries of support and allele freq > 0.005. Default is false (do not filter at this stage).
--parse_accession=False
 If set, parse only the accession for the chromosome name. Helpful if snpEff has to create its own database
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
Fws

Compute the Fws statistic on iSNV data. See Manske, 2012 (Nature)

usage: intrahost.py Fws [-h]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version]
                        inVcf outVcf
Positional arguments:
inVcf Input VCF file
outVcf Output VCF file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
iSNV_table

Convert VCF iSNV data to tabular text

usage: intrahost.py iSNV_table [-h]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version]
                               inVcf outFile
Positional arguments:
inVcf Input VCF file
outFile Output text file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
iSNP_per_patient

Aggregate tabular iSNP data per patient x position (all time points averaged)

usage: intrahost.py iSNP_per_patient [-h]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version]
                                     inFile outFile
Positional arguments:
inFile Input text file
outFile Output text file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit

read_utils.py - utilities that manipulate bam and fastq files

Utilities for working with sequence reads, such as converting formats and fixing mate pairs.

usage: read_utils.py subcommand
Sub-commands:
purge_unmated

Use mergeShuffledFastqSeqs to purge unmated reads, and put corresponding reads in the same order. Corresponding sequences must have sequence identifiers of the form SEQID/1 and SEQID/2.

usage: read_utils.py purge_unmated [-h] [--regex REGEX]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inFastq1 inFastq2 outFastq1 outFastq2
Positional arguments:
inFastq1 Input fastq file; 1st end of paired-end reads.
inFastq2 Input fastq file; 2nd end of paired-end reads.
outFastq1 Output fastq file; 1st end of paired-end reads.
outFastq2 Output fastq file; 2nd end of paired-end reads.
Options:
--regex=^@(\S+)/[1|2]$
 Perl regular expression to parse paired read IDs (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
fastq_to_fasta

Convert from fastq format to fasta format. Warning: output reads might be split onto multiple lines.

usage: read_utils.py fastq_to_fasta [-h]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    inFastq outFasta
Positional arguments:
inFastq Input fastq file.
outFasta Output fasta file.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
index_fasta_samtools

Index a reference genome for Samtools.

usage: read_utils.py index_fasta_samtools [-h]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version]
                                          inFasta
Positional arguments:
inFasta Reference genome, FASTA format.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
index_fasta_picard

Create an index file for a reference genome suitable for Picard/GATK.

usage: read_utils.py index_fasta_picard [-h] [--JVMmemory JVMMEMORY]
                                        [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        inFasta
Positional arguments:
inFasta Input reference genome, FASTA format.
Options:
--JVMmemory=512m
 JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s CreateSequenceDictionary, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
mkdup_picard

Mark or remove duplicate reads from BAM file.

usage: read_utils.py mkdup_picard [-h] [--outMetrics OUTMETRICS] [--remove]
                                  [--JVMmemory JVMMEMORY]
                                  [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inBams [inBams ...] outBam
Positional arguments:
inBams Input reads, BAM format.
outBam Output reads, BAM format.
Options:
--outMetrics Output metrics file. Default is to dump to a temp file.
--remove=False Instead of marking duplicates, remove them entirely (default: %(default)s)
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s MarkDuplicates, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
revert_bam_picard

Revert BAM to raw reads

usage: read_utils.py revert_bam_picard [-h] [--JVMmemory JVMMEMORY]
                                       [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inBam outBam
Positional arguments:
inBam Input reads, BAM format.
outBam Output reads, BAM format.
Options:
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s RevertSam, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
picard

Generic Picard runner.

usage: read_utils.py picard [-h] [--JVMmemory JVMMEMORY]
                            [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            command
Positional arguments:
command picard command
Options:
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
sort_bam

Sort BAM file

usage: read_utils.py sort_bam [-h] [--index] [--md5] [--JVMmemory JVMMEMORY]
                              [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              inBam outBam {unsorted,queryname,coordinate}
Positional arguments:
inBam Input bam file.
outBam Output bam file, sorted.
sortOrder

How to sort the reads. [default: %(default)s]

Possible choices: unsorted, queryname, coordinate

Options:
--index=False Index outBam (default: %(default)s)
--md5=False MD5 checksum outBam (default: %(default)s)
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s SortSam, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
merge_bams

Merge multiple BAMs into one

usage: read_utils.py merge_bams [-h] [--JVMmemory JVMMEMORY]
                                [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBams [inBams ...] outBam
Positional arguments:
inBams Input bam files.
outBam Output bam file.
Options:
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s MergeSamFiles, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_bam

Filter BAM file by read name

usage: read_utils.py filter_bam [-h] [--exclude] [--JVMmemory JVMMEMORY]
                                [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam readList outBam
Positional arguments:
inBam Input bam file.
readList Input file of read IDs.
outBam Output bam file.
Options:
--exclude=False
 If specified, readList is a list of reads to remove from input. Default behavior is to treat readList as an inclusion list (all unnamed reads are removed).
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s FilterSamReads, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
bam_to_fastq

Convert a bam file to a pair of fastq paired-end read files and optional text header.

usage: read_utils.py bam_to_fastq [-h] [--outHeader OUTHEADER]
                                  [--JVMmemory JVMMEMORY]
                                  [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inBam outFastq1 outFastq2
Positional arguments:
inBam Input bam file.
outFastq1 Output fastq file; 1st end of paired-end reads.
outFastq2 Output fastq file; 2nd end of paired-end reads.
Options:
--outHeader Optional text file name that will receive bam header.
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s SamToFastq, OPTIONNAME=value ...
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
fastq_to_bam

Convert a pair of fastq paired-end read files and optional text header to a single bam file.

usage: read_utils.py fastq_to_bam [-h]
                                  (--sampleName SAMPLENAME | --header HEADER)
                                  [--JVMmemory JVMMEMORY]
                                  [--picardOptions [PICARDOPTIONS [PICARDOPTIONS ...]]]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inFastq1 inFastq2 outBam
Positional arguments:
inFastq1 Input fastq file; 1st end of paired-end reads.
inFastq2 Input fastq file; 2nd end of paired-end reads.
outBam Output bam file.
Options:
--sampleName Sample name to insert into the read group header.
--header Optional text file containing header.
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--picardOptions=[]
 Optional arguments to Picard’s FastqToSam, OPTIONNAME=value ... Note that header-related options will be overwritten by HEADER if present.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
split_reads

Split fasta or fastq file into chunks of maxReads reads or into numChunks chunks named outPrefix01, outPrefix02, etc. If both maxReads and numChunks are None, use defaultMaxReads. The number of characters in file names after outPrefix is indexLen; if not specified, use defaultIndexLen.

usage: read_utils.py split_reads [-h]
                                 [--maxReads MAXREADS | --numChunks NUMCHUNKS]
                                 [--indexLen INDEXLEN]
                                 [--format {fastq,fasta}]
                                 [--outSuffix OUTSUFFIX]
                                 inFileName outPrefix
Positional arguments:
inFileName Input fastq or fasta file.
outPrefix Output files will be named ${outPrefix}01${outSuffix}, ${outPrefix}02${outSuffix}...
Options:
--maxReads Maximum number of reads per chunk (default 1000 if neither maxReads nor numChunks is specified).
--numChunks Number of output files, if maxReads is not specified.
--indexLen=2 Number of characters to append to outputPrefix for each output file (default %(default)s). Number of files must not exceed 10^INDEXLEN.
--format=fastq

Input fastq or fasta file (default: %(default)s).

Possible choices: fastq, fasta

--outSuffix= Output filename suffix (e.g. .fastq or .fastq.gz). A suffix ending in .gz will cause the output file to be gzip compressed. Default is no suffix.
split_bam

Split BAM file equally into several output BAM files.

usage: read_utils.py split_bam [-h]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam outBams [outBams ...]
Positional arguments:
inBam Input BAM file.
outBams Output BAM files
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
reheader_bam

Copy a BAM file (inBam to outBam) while renaming elements of the BAM header. The mapping file specifies which (key, old value, new value) mappings. For example: LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI

usage: read_utils.py reheader_bam [-h]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inBam rgMap outBam
Positional arguments:
inBam Input reads, BAM format.
rgMap Tabular file containing three columns: field, old, new.
outBam Output reads, BAM format.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
reheader_bams

Copy BAM files while renaming elements of the BAM header. The mapping file specifies which (key, old value, new value) mappings. For example: LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI FN in1.bam out1.bam FN in2.bam out2.bam

usage: read_utils.py reheader_bams [-h]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   rgMap
Positional arguments:
rgMap Tabular file containing three columns: field, old, new.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
rmdup_mvicuna_bam

Remove duplicate reads from BAM file using M-Vicuna. The primary advantage to this approach over Picard’s MarkDuplicates tool is that Picard requires that input reads are aligned to a reference, and M-Vicuna can operate on unaligned reads.

usage: read_utils.py rmdup_mvicuna_bam [-h] [--JVMmemory JVMMEMORY]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inBam outBam
Positional arguments:
inBam Input reads, BAM format.
outBam Output reads, BAM format.
Options:
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
dup_remove_mvicuna

Run mvicuna’s duplicate removal operation on paired-end reads.

usage: read_utils.py dup_remove_mvicuna [-h]
                                        [--unpairedOutFastq UNPAIREDOUTFASTQ]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        inFastq1 inFastq2 pairedOutFastq1
                                        pairedOutFastq2
Positional arguments:
inFastq1 Input fastq file; 1st end of paired-end reads.
inFastq2 Input fastq file; 2nd end of paired-end reads.
pairedOutFastq1
 Output fastq file; 1st end of paired-end reads.
pairedOutFastq2
 Output fastq file; 2nd end of paired-end reads.
Options:
--unpairedOutFastq
 File name of output unpaired reads
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
rmdup_prinseq_fastq

Run prinseq-lite’s duplicate removal operation on paired-end reads. Also removes reads with more than one N.

usage: read_utils.py rmdup_prinseq_fastq [-h]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inFastq1 inFastq2 outFastq1 outFastq2
Positional arguments:
inFastq1 Input fastq file; 1st end of paired-end reads.
inFastq2 Input fastq file; 2nd end of paired-end reads.
outFastq1 Output fastq file; 1st end of paired-end reads.
outFastq2 Output fastq file; 2nd end of paired-end reads.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_bam_mapped_only

Samtools to reduce a BAM file to only reads that are aligned (-F 4) with a non-zero mapping quality (-q 1) and are not marked as a PCR/optical duplicate (-F 1024).

usage: read_utils.py filter_bam_mapped_only [-h]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            inBam outBam
Positional arguments:
inBam Input aligned reads, BAM format.
outBam Output sorted indexed reads, filtered to aligned-only, BAM format.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
novoalign

Align reads with Novoalign. Sort and index BAM output.

usage: read_utils.py novoalign [-h] [--options OPTIONS] [--min_qual MIN_QUAL]
                               [--JVMmemory JVMMEMORY]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam refFasta outBam
Positional arguments:
inBam Input reads, BAM format.
refFasta Reference genome, FASTA format, pre-indexed by Novoindex.
outBam Output reads, BAM format (aligned).
Options:
--options=-r Random
 Novoalign options (default: %(default)s)
--min_qual=0 Filter outBam to minimum mapping quality (default: %(default)s)
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
novoindex

Index a FASTA file (reference genome) for use with Novoalign. The input file name must end in ”.fasta”. This will create a new ”.nix” file in the same directory. If it already exists, it will be deleted and regenerated.

usage: read_utils.py novoindex [-h]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version]
                               refFasta
Positional arguments:
refFasta Reference genome, FASTA format.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
gatk_ug

Call genotypes using the GATK UnifiedGenotyper.

usage: read_utils.py gatk_ug [-h] [--options OPTIONS] [--JVMmemory JVMMEMORY]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             inBam refFasta outVcf
Positional arguments:
inBam Input reads, BAM format.
refFasta Reference genome, FASTA format, pre-indexed by Picard.
outVcf Output calls in VCF format. If this filename ends with .gz, GATK will BGZIP compress the output and produce a Tabix index file as well.
Options:
--options=--min_base_quality_score 15 -ploidy 4
 UnifiedGenotyper options (default: %(default)s)
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
gatk_realign

Local realignment of BAM files with GATK IndelRealigner.

usage: read_utils.py gatk_realign [-h] [--JVMmemory JVMMEMORY]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep] [--threads THREADS]
                                  inBam refFasta outBam
Positional arguments:
inBam Input reads, BAM format, aligned to refFasta.
refFasta Reference genome, FASTA format, pre-indexed by Picard.
outBam Realigned reads.
Options:
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--threads=1 Number of threads (default: %(default)s)
align_and_fix

Take reads, align to reference with Novoalign, mark duplicates with Picard, realign indels with GATK, and optionally filter final file to mapped/non-dupe reads.

usage: read_utils.py align_and_fix [-h] [--outBamAll OUTBAMALL]
                                   [--outBamFiltered OUTBAMFILTERED]
                                   [--novoalign_options NOVOALIGN_OPTIONS]
                                   [--JVMmemory JVMMEMORY] [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inBam refFasta
Positional arguments:
inBam Input unaligned reads, BAM format.
refFasta Reference genome, FASTA format, pre-indexed by Picard and Novoalign.
Options:
--outBamAll Aligned, sorted, and indexed reads. Unmapped reads are retained and duplicate reads are marked, not removed.
--outBamFiltered
 Aligned, sorted, and indexed reads. Unmapped reads and duplicate reads are removed from this file.
--novoalign_options=-r Random
 Novoalign options (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--threads=1 Number of threads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
align_and_count_hits

Take reads, align to reference with Novoalign and return aligned read counts for each reference sequence.

usage: read_utils.py align_and_count_hits [-h] [--includeZeros]
                                          [--JVMmemory JVMMEMORY]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          inBam refFasta outCounts
Positional arguments:
inBam Input unaligned reads, BAM format.
refFasta Reference genome, FASTA format, pre-indexed by Picard and Novoalign.
outCounts Output counts file
Options:
--includeZeros=False
 Output lines with no hits (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

reports.py - produce various metrics and reports

Functions to create reports from genomics pipeline data.

usage: reports.py subcommand
Sub-commands:
assembly_stats

Fetch assembly-level statistics for a given sample

usage: reports.py assembly_stats [-h]
                                 [--cov_thresholds COV_THRESHOLDS [COV_THRESHOLDS ...]]
                                 [--assembly_dir ASSEMBLY_DIR]
                                 [--assembly_tmp ASSEMBLY_TMP]
                                 [--align_dir ALIGN_DIR]
                                 [--reads_dir READS_DIR]
                                 [--raw_reads_dir RAW_READS_DIR]
                                 samples [samples ...] outFile
Positional arguments:
samples Sample names.
outFile Output report file.
Options:
--cov_thresholds=(1, 5, 20, 100)
 Genome coverage thresholds to report on. (default: %(default)s)
--assembly_dir=data/02_assembly
 Directory with assembly outputs. (default: %(default)s)
--assembly_tmp=tmp/02_assembly
 Directory with assembly temp files. (default: %(default)s)
--align_dir=data/02_align_to_self
 Directory with reads aligned to own assembly. (default: %(default)s)
--reads_dir=data/01_per_sample
 Directory with unaligned filtered read BAMs. (default: %(default)s)
--raw_reads_dir=data/00_raw
 Directory with unaligned raw read BAMs. (default: %(default)s)
alignment_summary

Write or print pairwise alignment summary information for sequences in two FASTA files, including SNPs, ambiguous bases, and indels.

usage: reports.py alignment_summary [-h] [--outfileName OUTFILENAME]
                                    [--printCounts]
                                    inFastaFileOne inFastaFileTwo
Positional arguments:
inFastaFileOne First fasta file for an alignment
inFastaFileTwo First fasta file for an alignment
Options:
--outfileName Output file for counts in TSV format
--printCounts=False
 Undocumented
consolidate_fastqc

Consolidate multiple FASTQC reports into one.

usage: reports.py consolidate_fastqc [-h] inDirs [inDirs ...] outFile
Positional arguments:
inDirs Input FASTQC directories.
outFile Output report file.
consolidate_spike_count

Consolidate multiple spike count reports into one.

usage: reports.py consolidate_spike_count [-h] inDir outFile
Positional arguments:
inDir Input spike count directory.
outFile Output report file.

illumina.py - for raw Illumina outputs

Utilities for demultiplexing Illumina data.

usage: illumina.py subcommand
Sub-commands:
illumina_demux

Demultiplex Illumina runs & produce BAM files, one per sample. Wraps together Picard’s ExtractBarcodes and IlluminaBasecallsToSam while handling the various required input formats. Also can read Illumina BCL directories, tar.gz BCL directories. TO DO: read BCL or tar.gz BCL directories from S3 / object store.

usage: illumina.py illumina_demux [-h] [--outMetrics OUTMETRICS]
                                  [--commonBarcodes COMMONBARCODES]
                                  [--sampleSheet SAMPLESHEET]
                                  [--flowcell FLOWCELL]
                                  [--read_structure READ_STRUCTURE]
                                  [--max_mismatches MAX_MISMATCHES]
                                  [--minimum_base_quality MINIMUM_BASE_QUALITY]
                                  [--min_mismatch_delta MIN_MISMATCH_DELTA]
                                  [--max_no_calls MAX_NO_CALLS]
                                  [--minimum_quality MINIMUM_QUALITY]
                                  [--compress_outputs COMPRESS_OUTPUTS]
                                  [--sequencing_center SEQUENCING_CENTER]
                                  [--adapters_to_check [ADAPTERS_TO_CHECK [ADAPTERS_TO_CHECK ...]]]
                                  [--platform PLATFORM]
                                  [--max_reads_in_ram_per_tile MAX_READS_IN_RAM_PER_TILE]
                                  [--max_records_in_ram MAX_RECORDS_IN_RAM]
                                  [--num_processors NUM_PROCESSORS]
                                  [--apply_eamss_filter APPLY_EAMSS_FILTER]
                                  [--force_gc FORCE_GC]
                                  [--first_tile FIRST_TILE]
                                  [--tile_limit TILE_LIMIT]
                                  [--include_non_pf_reads INCLUDE_NON_PF_READS]
                                  [--run_start_date RUN_START_DATE]
                                  [--read_group_id READ_GROUP_ID]
                                  [--JVMmemory JVMMEMORY]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inDir lane outDir
Positional arguments:
inDir Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.
lane Lane number.
outDir Output directory for BAM files.
Options:
--outMetrics Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file.
--commonBarcodes
 Write a TSV report of all barcode counts, in descending order.
--sampleSheet Override SampleSheet. Input tab or CSV file w/header and four named columns: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir.
--flowcell Override flowcell ID (default: read from RunInfo.xml).
--read_structure
 Override read structure (default: read from RunInfo.xml).
--max_mismatches=0
 Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: %(default)s)
--minimum_base_quality=25
 Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: %(default)s)
--min_mismatch_delta
 Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: %(default)s)
--max_no_calls Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: %(default)s)
--minimum_quality
 Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: %(default)s)
--compress_outputs
 Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: %(default)s)
--sequencing_center
 Picard IlluminaBasecallsToSam SEQUENCING_CENTER (default: %(default)s)
--adapters_to_check=('PAIRED_END', 'NEXTERA_V1', 'NEXTERA_V2')
 Picard IlluminaBasecallsToSam ADAPTERS_TO_CHECK (default: %(default)s)
--platform Picard IlluminaBasecallsToSam PLATFORM (default: %(default)s)
--max_reads_in_ram_per_tile=100000
 Picard IlluminaBasecallsToSam MAX_READS_IN_RAM_PER_TILE (default: %(default)s)
--max_records_in_ram=100000
 Picard IlluminaBasecallsToSam MAX_RECORDS_IN_RAM (default: %(default)s)
--num_processors=4
 Picard IlluminaBasecallsToSam NUM_PROCESSORS (default: %(default)s)
--apply_eamss_filter
 Picard IlluminaBasecallsToSam APPLY_EAMSS_FILTER (default: %(default)s)
--force_gc=False
 Picard IlluminaBasecallsToSam FORCE_GC (default: %(default)s)
--first_tile Picard IlluminaBasecallsToSam FIRST_TILE (default: %(default)s)
--tile_limit Picard IlluminaBasecallsToSam TILE_LIMIT (default: %(default)s)
--include_non_pf_reads=False
 Picard IlluminaBasecallsToSam INCLUDE_NON_PF_READS (default: %(default)s)
--run_start_date
 Picard IlluminaBasecallsToSam RUN_START_DATE (default: %(default)s)
--read_group_id
 Picard IlluminaBasecallsToSam READ_GROUP_ID (default: %(default)s)
--JVMmemory=54g
 JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
common_barcodes

Extract Illumina barcodes for a run and write a TSV report of the barcode counts in descending order

usage: illumina.py common_barcodes [-h] [--truncateToLength TRUNCATETOLENGTH]
                                   [--omitHeader] [--includeNoise]
                                   [--outMetrics OUTMETRICS]
                                   [--sampleSheet SAMPLESHEET]
                                   [--flowcell FLOWCELL]
                                   [--read_structure READ_STRUCTURE]
                                   [--max_mismatches MAX_MISMATCHES]
                                   [--minimum_base_quality MINIMUM_BASE_QUALITY]
                                   [--min_mismatch_delta MIN_MISMATCH_DELTA]
                                   [--max_no_calls MAX_NO_CALLS]
                                   [--minimum_quality MINIMUM_QUALITY]
                                   [--compress_outputs COMPRESS_OUTPUTS]
                                   [--JVMmemory JVMMEMORY]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inDir lane outSummary
Positional arguments:
inDir Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.
lane Lane number.
outSummary Path to the summary file (.tsv format). It includes two columns: (barcode, count)
Options:
--truncateToLength
 If specified, only this number of barcodes will be returned. Useful if you only want the top N barcodes.
--omitHeader=False
 If specified, a header will not be added to the outSummary tsv file.
--includeNoise=False
 If specified, barcodes with periods (”.”) will be included.
--outMetrics Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file.
--sampleSheet Override SampleSheet. Input tab or CSV file w/header and four named columns: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir.
--flowcell Override flowcell ID (default: read from RunInfo.xml).
--read_structure
 Override read structure (default: read from RunInfo.xml).
--max_mismatches=0
 Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: %(default)s)
--minimum_base_quality=25
 Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: %(default)s)
--min_mismatch_delta
 Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: %(default)s)
--max_no_calls Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: %(default)s)
--minimum_quality
 Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: %(default)s)
--compress_outputs
 Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: %(default)s)
--JVMmemory=54g
 JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
miseq_fastq_to_bam

Convert fastq read files to a single bam file. Fastq file names must conform to patterns emitted by Miseq machines. Sample metadata must be provided in a SampleSheet.csv that corresponds to the fastq filename. Specifically, the _S##_ index in the fastq file name will be used to find the corresponding row in the SampleSheet

usage: illumina.py miseq_fastq_to_bam [-h] [--inFastq2 INFASTQ2]
                                      [--runInfo RUNINFO]
                                      [--sequencing_center SEQUENCING_CENTER]
                                      [--JVMmemory JVMMEMORY]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      [--version] [--tmp_dir TMP_DIR]
                                      [--tmp_dirKeep]
                                      outBam sampleSheet inFastq1
Positional arguments:
outBam Output BAM file.
sampleSheet Input SampleSheet.csv file.
inFastq1 Input fastq file; 1st end of paired-end reads if paired.
Options:
--inFastq2 Input fastq file; 2nd end of paired-end reads.
--runInfo Input RunInfo.xml file.
--sequencing_center
 Name of your sequencing center (default is the sequencing machine ID from the RunInfo.xml)
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
extract_fc_metadata

Extract RunInfo.xml and SampleSheet.csv from the provided Illumina directory

usage: illumina.py extract_fc_metadata [-h]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       flowcell outRunInfo outSampleSheet
Positional arguments:
flowcell Illumina directory (possibly tarball)
outRunInfo Output RunInfo.xml file.
outSampleSheet Output SampleSheet.csv file.
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

broad_utils.py - for data generated at the Broad Institute

Utilities for getting sequences out of the Broad walk-up sequencing pipeline. These utilities are probably not of much use outside the Broad.

usage: broad_utils.py subcommand
Sub-commands:
get_bustard_dir

Find the basecalls directory from a Picard directory

usage: broad_utils.py get_bustard_dir [-h]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      inDir
Positional arguments:
inDir Picard directory
Options:
--loglevel=ERROR
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

get_run_date

Find the sequencing run date from a Picard directory

usage: broad_utils.py get_run_date [-h]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   inDir
Positional arguments:
inDir Picard directory
Options:
--loglevel=ERROR
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

get_all_names

Get all samples

usage: broad_utils.py get_all_names [-h]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    {samples,libraries,runs} runfile
Positional arguments:
type

Type of name

Possible choices: samples, libraries, runs

runfile File with seq run information
Options:
--loglevel=ERROR
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Using the Snakemake pipelines

Rather than chaining together viral-ngs pipeline steps as series of tool commands called in isolation, it is possible to execute them as a complete automated pipeline, from processing raw sequencer output to creating files suitable for GenBank submission. This utilizes Snakemake, which is documented at: https://bitbucket.org/snakemake/snakemake/wiki/Home

Here is an overview of the Snakemake rule graph:

_images/rulegraph.png

Setting up the Python 3 virtual environment

Note that Python 3.4 is required to use these tools with Snakemake. It is recommended to create a virtual environment within which all of the viral-ngs dependencies can be installed:

pyvenv-3.4 venv-viral-ngs
cd venv-viral-ngs
source bin/activate

Once the virtual environment has been created and activated, the viral-ngs dependencies can be installed via pip:

pip install -r requirements.txt
pip install -r requirements-pipes.txt

Note: To resume normal use of the system installation of python, call the “deactivate” command in your shell. See the official venv documentation for more information on Python3 virtual environments.

In addition to the dependencies installed via pip, the pipline needs the standard dependencies described in the main viral-ngs installation section.

Note: If running on the Broad Institute UGER cluster environment, import the following dotkits prior to activating the virtualenv:

use .python-3.4.3
use .oracle-java-jdk-1.7.0-51-x86-64
use .bzip2-1.0.6
use .zlib-1.2.6
use .gcc-4.5.3

Setting up an analysis directory

Copying and creating project directories and files

The Snakemake pipline is intended to be run on an input one or more sequencer bam files, each having a filename represending a sample name. The output files are named with the same sample names, and are organized into folders corresponding to the steps of the pipeline in which they were created.

To get started, create an analysis directory somewhere in your compute environment to contain the pipeline input and output files.

Into this directory, copy the following file from the viral-ngs/pipes directory:

config.yaml
Snakefile

Since the file config.yaml is project-specific, you will need to make changes to it as approprate for your usage. The config file changes are described in greater detail below.

Next, cd to the analysis directory and create symbolic links to the following:

  • The viral-ngs virtual environment:

    ln -s /path/to/venv-viral-ngs venv

  • The viral-ngs project, checked out from GitHub or extracted from a version-tagged archive:

    ln -s /path/to/viral-ngs bin

Within the analysis directory, create the directories and files used by the Snakemake pipeline:

data/
    00_raw/
    01_cleaned/
    01_per_sample/
    02_align_to_self/
    02_assembly/
    03_align_to_ref/
    03_interhost/
    04_intrahost/
log/
reports/
tmp/

The directory structure created needs to match the locations specified in config.yaml.

Adding input data
  • Copy each of the raw sample bam files to the 00_raw/ directory and ensure the file names follow the convention of {sample}.bam.

  • Create a file, samples-depletion.txt, to list all of the samples that should be run through the depletion pipeline, with one samplename per line as {sample}, following the format of the input bam file: {sample}.bam. For example, if you copied a file called “G1190.bam” into 00_raw/, then then samples-depletion.txt would contain the line:

    G1190

  • Create a file, samples-assembly.txt, to list all of the samples that should be run through the assembly pipeline.

  • Create a file, samples-runs.txt, to list all of the samples that should be run through the interhost analysis pipeline.

  • Create a blank file, samples-assembly-failures.txt, that may be filled in later.

Modifying the config.yaml file

Minimal modification to the config file is necessary, though there are a few things you need to specify:

An email address for when the pipeline fetches reference data from the NCBI via their Entrez API:

email_point_of_contact_for_ncbi: "someone@example.com"

The path to the depletion databases to be used by BMTagger, along with the file prefixes of the specific databases to use. The process for creating BMTagger depletion databases is described in the NIH BMTagger docs.

bmtagger_db_dir: "/path/to/depletion_databases"
bmtagger_dbs_remove:
  - "hg19"
  - "GRCh37.68_ncRNA-GRCh37.68_transcripts-HS_rRNA_mitRNA"
  - "metagenomics_contaminants_v3"

Pre-built depletion databases are available in both *.tar.gz and *.lz4 format, for removing human reads and common metagenomic contaminants:

Note that these databases must be extracted prior to use.

In addition to the databases used by BMTagger, you will need to specify the location and file prefix of the BLAST database to be used for depletion. The process for creating the BLAST database is described in the NIH BLAST docs, and on this website from the University of Oxford.

blast_db_dir: "/path/to/depletion_databases"
blast_db_remove: "metag_v3.ncRNA.mRNA.mitRNA.consensus"

A pre-built depletion database is also available for BLAST:

Note that this database must be extracted prior to use.

Additional databases are needed to perform metagenomic classification using Kraken, Diamond, or Krona.

kraken_db: "/path/to/kraken_full_20150910"

diamond_db: "/path/to/diamond_db/nr"

krona_db: "/path/to/krona"

Pre-built databases for Kraken, Diamond, and Krona are available:

Note that these databases must be extracted prior to use.

An array of the NCBI GenBank CoreNucleotide accessions for the sequences comprising the reference genome to be used for contig assembly as well as for interhost and intrahost variant analysis. In addition, you will need to specify a file prefix to be used to represent the full reference genome file used downstream.

accessions_for_ref_genome_build:
  - "KJ660346.2"

An optional file containing a list of accessions may be specified for filtering reads via LAST. This is intended to narrow to a genus. If this file is not provided, viral-ngs defaults to using the accessions specified for the reference genome.

accessions_file_for_lastal_db_build: "/path/to/lastal_accessions.txt"

A FASTA file to be used by Trimmomatic during assembly to remove contaminents from reads:

trim_clip_db: "/path/to/depletion_databases/contaminants.fasta"

Pre-built databases for Trimmomatic:

A FASTA file containing spike-ins to be reported:

spikeins_db: "/path/to/references/ercc_spike-ins.fasta"
Modifying the Snakefile

Depending on the state of your input data, and where in the pipeline it may enter, it may be necessary to omit certain processing steps. For example, if your sequencing center has already demultiplexed your data and no demultiplexing is needed, you can comment out the following line in the Snakefile:

include: os.path.join(pipesDir, 'demux.rules’)

Running the pipeline

Configuring for your compute platform
Running the pipeline directly

After the above setup is complete, run the pipeline directly by calling snakemake within the analysis directory.

Running the pipeline on GridEngine (UGER)

Within config.yaml, set the “project” to one that exists on the cluster system.

Inside the analysis directory, run the job submission command. Ex.:

use UGER
qsub -cwd -b y -q long -l m_mem_free=4G ./bin/pipes/Broad_UGER/run-pipe.sh

To kill all jobs that exited (qstat status “Eqw”) with an error:

qdel $(qstat | grep Eqw | awk '{print $1}')
Running the pipeline on LSF

Inside the analysis directory, run the job submission command. Ex.:

bsub -o log/run.out -q forest ./bin/pipes/Broad_LSF/run-pipe.sh

If you notice jobs hanging in the PEND state, an upstream job may have failed. Before killing such jobs, verify that the jobs are pending due to their dependency:

bjobs -al | grep -A 1 "PENDING REASONS" | grep -v "PENDING REASONS" | grep -v '^--$'

To kill all PENDing jobs:

bkill `bjobs | grep PEND | awk '{print $1}'` > /dev/null
When things go wrong

The pipeline may fail with errors during execution, usually while generating assemblies with Trinity. If this occurs, examine the output, add the failing sample names to samples-assembly-failures.txt, keeping the good ones in samples-assembly.txt, and re-run the pipeline. Due to sample degradation prior to sequencing in the wet lab, not all samples have the integrity to complete the pipeline, and it may necessary to skip over these samples by adding them to the samples-assembly-failures.txt.

Assembly of pre-filtered reads

Taxonomic filtration of raw reads

Starting from Illumina BCL directories

Developing viral-ngs

Testing with tox and pyenv

Using pyenv with tox can simplify testing locally against multiple different Python versions and dependency environments. To setup your development environment for testing, you first need to perform the following steps:

  1. Install pyenv for your OS https://github.com/yyuu/pyenv#installation
  2. Install the appropriate Python versions:
$ pyenv install 2.7.10 3.4.3 3.5.0
  1. Change directory to the viral-ngs git checkout.
  2. Set the local pyenv versions:
$ pyenv local 3.5.0 3.4.3 2.7.10
  1. Install tox and tox-pyenv
$ pip3 install tox tox-pyenv
  1. Run tox to run tests and check building docs.
$ tox

Testing easy_deploy with Vagrant and Ansible

By default, easy_deploy installs a version of viral-ngs by cloning the github repository, which makes it difficult to test local changes to the playbook. Testing the easy_deploy setup locally can be done using Vagrant and the Ansible playbook with some custom commands. To do so, we need to take the following steps:

  1. Install Vagrant
  2. Install Ansible
  3. Change to the easy_deploy directory
$ cd easy_deploy
  1. Run the easy deploy script to bootstrap the VM and answer the prompts
$ ./run.sh
  1. From now on, run Ansible playbook manually for provisioning: http://docs.ansible.com/ansible/guide_vagrant.html#running-ansible-manually.
  2. Add deploy=sync or deploy=archive to the –extra-vars of the Ansible playbook command
$ ansible-playbook ... --extra-vars=deploy=sync

To test playbook changes in a local repository, the choice of deploy=sync or deploy=archive changes the strategy used to “deploy” viral-ngs into the Vagrant VM.

The deploy=sync option creates a symlink to the synced folder containing the root of your viral-ngs git repo. Therefore, any tool installation or other changes will be reflected on both the host and the guest machine. This is desirable for fast iteration of changes, but makes it difficult to isolate the host’s viral-ngs installation from the installation on the guest VM.

The deploy=archive option performs a git archive on the host’s viral-ngs repository and untars it into the project directory. This is a clean install each time, which can be time consuming due to the need to reinstall all dependencies, and only the current HEAD commit will be reflected in the guest. No uncommitted/dirty changes will be picked up using this method. This is more ideally suited for a finalized clean test of the playbook.