3.2. taxon_filter.py - tools for taxonomic removal or filtration of readsΒΆ

This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.

usage: taxon_filter.py subcommand
Sub-commands:
deplete_human

Run the entire depletion pipeline: bmtagger, mvicuna, blastn. Optionally, use lastal to select a specific taxon of interest.

usage: taxon_filter.py deplete_human [-h] [--taxfiltBam TAXFILTBAM]
                                     --bmtaggerDbs BMTAGGERDBS
                                     [BMTAGGERDBS ...] --blastDbs BLASTDBS
                                     [BLASTDBS ...] [--lastDb LASTDB]
                                     [--threads THREADS]
                                     [--JVMmemory JVMMEMORY]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam [revertBam] bmtaggerBam rmdupBam
                                     blastnBam
Positional arguments:
inBam Input BAM file.
revertBam Output BAM: read markup reverted with Picard.
bmtaggerBam Output BAM: depleted of human reads with BMTagger.
rmdupBam Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
blastnBam Output BAM: rmdupBam run through another depletion of human reads with BLASTN.
Options:
--taxfiltBam Output BAM: blastnBam run through taxonomic selection via LASTAL.
--bmtaggerDbs Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
--blastDbs One or more reference databases for blast to deplete from input.
--lastDb One reference database for last (required if –taxfiltBam is specified).
--threads=4 The number of threads to use in running blastn.
--JVMmemory=4g JVM virtual memory size for Picard FilterSamReads (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_lastal_bam

Restrict input reads to those that align to the given reference database using LASTAL.

usage: taxon_filter.py filter_lastal_bam [-h]
                                         [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                         [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                         [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                         [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                         [--JVMmemory JVMMEMORY]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inBam db outBam
Positional arguments:
inBam Input reads
db Database of taxa we keep
outBam Output reads, filtered to refDb
Options:
-n=1 maximum gapless alignments per query position (default: %(default)s)
-l=5 minimum length for initial matches (default: %(default)s)
-L=50 maximum length for initial matches (default: %(default)s)
-m=100 maximum initial matches per query position (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_bam_bmtagger

Use bmtagger to deplete input reads against several databases.

usage: taxon_filter.py deplete_bam_bmtagger [-h] [--threads THREADS]
                                            [--JVMmemory JVMMEMORY]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            inBam refDbs [refDbs ...] outBam
Positional arguments:
inBam Input BAM file.
refDbs Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
outBam Output BAM file.
Options:
--threads=4 The number of threads to use in running blastn.
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
deplete_blastn_bam

Use blastn to remove reads that match at least one of the specified databases.

usage: taxon_filter.py deplete_blastn_bam [-h] [--threads THREADS]
                                          [--chunkSize CHUNKSIZE]
                                          [--JVMmemory JVMMEMORY]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          inBam refDbs [refDbs ...] outBam
Positional arguments:
inBam Input BAM file.
refDbs One or more reference databases for blast.
outBam Output BAM file with matching reads removed.
Options:
--threads=4 The number of threads to use in running blastn.
--chunkSize=1000000
 FASTA chunk size (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
lastal_build_db

build a database for use with last based on an input fasta file

usage: taxon_filter.py lastal_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files (default is cwd: %(default)s)
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
blastn_build_db

Create a database for use with blastn from an input reference FASTA file

usage: taxon_filter.py blastn_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
bmtagger_build_db

Create a database for use with Bmtagger from an input FASTA file.

usage: taxon_filter.py bmtagger_build_db [-h]
                                         [--outputFilePrefix OUTPUTFILEPREFIX]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inputFasta outputDirectory
Positional arguments:
inputFasta Location of the input FASTA file
outputDirectory
 Location for the output files (Where *.bitmask and *.srprism files will be stored)
Options:
--outputFilePrefix
 Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.