3.2. taxon_filter.py - tools for taxonomic removal or filtration of reads¶

This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.

usage: taxon_filter.py subcommand

Sub-commands:

deplete

Run the entire depletion pipeline: bwa, bmtagger, mvicuna, blastn.

usage: taxon_filter.py deplete [-h] [--bwaDbs [BWADBS [BWADBS ...]]]
                               [--bmtaggerDbs [BMTAGGERDBS [BMTAGGERDBS ...]]]
                               [--blastDbs [BLASTDBS [BLASTDBS ...]]]
                               [--srprismMemory SRPRISM_MEMORY]
                               [--chunkSize CHUNKSIZE] [--JVMmemory JVMMEMORY]
                               [--clearTags]
                               [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                               [--doNotSanitize] [--threads THREADS]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam [revertBam] bwaBam bmtaggerBam rmdupBam
                               blastnBam

Positional arguments:

`inBam`	Input BAM file.
`revertBam`	Output BAM: read markup reverted with Picard.
`bwaBam`	Output BAM: depleted of reads with BWA.
`bmtaggerBam`	Output BAM: depleted of reads with BMTagger.
`rmdupBam`	Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
`blastnBam`	Output BAM: rmdupBam run through another depletion of reads with BLASTN.

Options:

`--bwaDbs=()`	Reference databases for blast to deplete from input.
`--bmtaggerDbs=()`
	Reference databases to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
`--blastDbs=()`	Reference databases for blast to deplete from input.
`--srprismMemory=7168`
	Memory for srprism.
`--chunkSize=1000000`
	blastn chunk size (default: %(default)s)
`--JVMmemory=4g`	JVM virtual memory size for Picard FilterSamReads (default: %(default)s)
`--clearTags=False`
	When supplying an aligned input file, clear the per-read attribute tags
`--tagsToClear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']`
	A space-separated list of tags to remove from all reads in the input bam file (default: %(default)s)
`--doNotSanitize=False`
	Undocumented
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

deplete_human

A wrapper around ‘deplete’; deprecated but preserved for legacy compatibility.

usage: taxon_filter.py deplete_human [-h] [--bwaDbs [BWADBS [BWADBS ...]]]
                                     [--bmtaggerDbs [BMTAGGERDBS [BMTAGGERDBS ...]]]
                                     [--blastDbs [BLASTDBS [BLASTDBS ...]]]
                                     [--srprismMemory SRPRISM_MEMORY]
                                     [--chunkSize CHUNKSIZE]
                                     [--JVMmemory JVMMEMORY] [--clearTags]
                                     [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                     [--doNotSanitize] [--threads THREADS]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam [revertBam] bwaBam bmtaggerBam
                                     rmdupBam blastnBam

Positional arguments:

`inBam`	Input BAM file.
`revertBam`	Output BAM: read markup reverted with Picard.
`bwaBam`	Output BAM: depleted of reads with BWA.
`bmtaggerBam`	Output BAM: depleted of reads with BMTagger.
`rmdupBam`	Output BAM: bmtaggerBam run through M-Vicuna duplicate removal.
`blastnBam`	Output BAM: rmdupBam run through another depletion of reads with BLASTN.

Options:

`--bwaDbs=()`	Reference databases for blast to deplete from input.
`--bmtaggerDbs=()`
	Reference databases to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
`--blastDbs=()`	Reference databases for blast to deplete from input.
`--srprismMemory=7168`
	Memory for srprism.
`--chunkSize=1000000`
	blastn chunk size (default: %(default)s)
`--JVMmemory=4g`	JVM virtual memory size for Picard FilterSamReads (default: %(default)s)
`--clearTags=False`
	When supplying an aligned input file, clear the per-read attribute tags
`--tagsToClear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']`
	A space-separated list of tags to remove from all reads in the input bam file (default: %(default)s)
`--doNotSanitize=False`
	Undocumented
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

filter_lastal_bam

Restrict input reads to those that align to the given reference database using LASTAL.

usage: taxon_filter.py filter_lastal_bam [-h]
                                         [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                         [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                         [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                         [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                         [--JVMmemory JVMMEMORY]
                                         [--threads THREADS]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inBam db outBam

Positional arguments:

`inBam`	Input reads
`db`	Database of taxa we keep
`outBam`	Output reads, filtered to refDb

Options:

`-n=1`	maximum gapless alignments per query position (default: %(default)s)
`-l=5`	minimum length for initial matches (default: %(default)s)
`-L=50`	maximum length for initial matches (default: %(default)s)
`-m=100`	maximum initial matches per query position (default: %(default)s)
`--JVMmemory=4g`	JVM virtual memory size (default: %(default)s)
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

deplete_bam_bmtagger

Use bmtagger to deplete input reads against several databases.

usage: taxon_filter.py deplete_bam_bmtagger [-h]
                                            [--srprismMemory SRPRISM_MEMORY]
                                            [--JVMmemory JVMMEMORY]
                                            [--clearTags]
                                            [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                            [--doNotSanitize]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            inBam refDbs [refDbs ...] outBam

Positional arguments:

`inBam`	Input BAM file.
`refDbs`	Reference databases (one or more) to deplete from input. For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
`outBam`	Output BAM file.

Options:

`--srprismMemory=7168`
	Memory for srprism.
`--JVMmemory=4g`	JVM virtual memory size (default: %(default)s)
`--clearTags=False`
	When supplying an aligned input file, clear the per-read attribute tags
`--tagsToClear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']`
	A space-separated list of tags to remove from all reads in the input bam file (default: %(default)s)
`--doNotSanitize=False`
	Undocumented
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

deplete_blastn_bam

Use blastn to remove reads that match at least one of the specified databases.

usage: taxon_filter.py deplete_blastn_bam [-h] [--chunkSize CHUNKSIZE]
                                          [--JVMmemory JVMMEMORY]
                                          [--clearTags]
                                          [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                          [--doNotSanitize]
                                          [--threads THREADS]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          inBam refDbs [refDbs ...] outBam

Positional arguments:

`inBam`	Input BAM file.
`refDbs`	One or more reference databases for blast. An ephemeral database will be created if a fasta file is provided.
`outBam`	Output BAM file with matching reads removed.

Options:

`--chunkSize=1000000`
	FASTA chunk size (default: %(default)s)
`--JVMmemory=4g`	JVM virtual memory size (default: %(default)s)
`--clearTags=False`
	When supplying an aligned input file, clear the per-read attribute tags
`--tagsToClear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']`
	A space-separated list of tags to remove from all reads in the input bam file (default: %(default)s)
`--doNotSanitize=False`
	Undocumented
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

deplete_bwa_bam

Use BWA to remove reads that match at least one of the specified databases.

usage: taxon_filter.py deplete_bwa_bam [-h] [--clearTags]
                                       [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                       [--doNotSanitize]
                                       [--JVMmemory JVMMEMORY]
                                       [--threads THREADS]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inBam refDbs [refDbs ...] outBam

Positional arguments:

`inBam`	Input BAM file.
`refDbs`	One or more reference databases for bwa. An ephemeral database will be created if a fasta file is provided.
`outBam`	Ouput BAM file with matching reads removed.

Options:

`--clearTags=False`
	When supplying an aligned input file, clear the per-read attribute tags
`--tagsToClear=['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']`
	A space-separated list of tags to remove from all reads in the input bam file (default: %(default)s)
`--doNotSanitize=False`
	Undocumented
`--JVMmemory=2g`	JVM virtual memory size (default: %(default)s)
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

lastal_build_db

build a database for use with last based on an input fasta file

usage: taxon_filter.py lastal_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory

Positional arguments:

`inputFasta`	Location of the input FASTA file
`outputDirectory`
	Location for the output files (default is cwd: %(default)s)

Options:

`--outputFilePrefix`
	Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

bwa_build_db

Create a database for use with bwa from an input reference FASTA file

usage: taxon_filter.py bwa_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    inputFasta outputDirectory

Positional arguments:

`inputFasta`	Location of the input FASTA file
`outputDirectory`
	Location for the output files

Options:

`--outputFilePrefix`
	Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

blastn_build_db

Create a database for use with blastn from an input reference FASTA file

usage: taxon_filter.py blastn_build_db [-h]
                                       [--outputFilePrefix OUTPUTFILEPREFIX]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       inputFasta outputDirectory

Positional arguments:

`inputFasta`	Location of the input FASTA file
`outputDirectory`
	Location for the output files

Options:

`--outputFilePrefix`
	Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

bmtagger_build_db

Create a database for use with Bmtagger from an input FASTA file.

usage: taxon_filter.py bmtagger_build_db [-h]
                                         [--outputFilePrefix OUTPUTFILEPREFIX]
                                         [--word_size WORD_SIZE]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmp_dir TMP_DIR]
                                         [--tmp_dirKeep]
                                         inputFasta outputDirectory

Positional arguments:

`inputFasta`	Location of the input FASTA file
`outputDirectory`
	Location for the output files (Where .bitmask and .srprism files will be stored)

Options:

`--outputFilePrefix`
	Prefix for the output file name (default: inputFasta name, sans ”.fasta” extension)
`--word_size=18`	Database word size (default: %(default)s)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.