2.4. taxon_filter.py - filter reads by taxonomic membership
This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.
usage: taxon_filter.py subcommand
2.4.2. Sub-commands
2.4.2.1. deplete
Run the entire depletion pipeline: minimap2, bwa, bmtagger, blastn.
taxon_filter.py deplete [-h] [--minimapDbs [MINIMAPDBS ...]]
[--bwaDbs [BWADBS ...]]
[--bmtaggerDbs [BMTAGGERDBS ...]]
[--blastDbs [BLASTDBS ...]]
[--srprismMemory SRPRISM_MEMORY]
[--chunkSize CHUNKSIZE] [--JVMmemory JVMMEMORY]
[--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam revertBam minimapBam bwaBam bmtaggerBam
blastnBam
2.4.2.1.1. Positional Arguments
- inBam
Input BAM file.
- revertBam
Output BAM: read markup reverted with Picard.
- minimapBam
Output BAM: depleted of reads with minimap2.
- bwaBam
Output BAM: depleted of reads with BWA.
- bmtaggerBam
Output BAM: depleted of reads with BMTagger.
- blastnBam
Output BAM: bmtaggerBam run through another depletion of reads with BLASTN.
2.4.2.1.2. Named Arguments
- --minimapDbs
Reference FASTA databases for minimap2 to deplete from input.
Default:
()- --bwaDbs
Reference databases for BWA to deplete from input.
Default:
()- --bmtaggerDbs
- Reference databases to deplete from input.
For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
Default:
()- --blastDbs
Reference databases for BLASTN to deplete from input.
Default:
()- --srprismMemory
Memory for srprism.
Default:
7168- --chunkSize
blastn chunk size (default: 1000000)
Default:
1000000- --JVMmemory
JVM virtual memory size for Picard RevertSam (default: ‘2g’)
Default:
'2g'- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.2. filter_lastal_bam
Restrict input reads to those that align to the given reference database using LASTAL.
taxon_filter.py filter_lastal_bam [-h]
[-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
[-l MIN_LENGTH_FOR_INITIAL_MATCHES]
[-L MAX_LENGTH_FOR_INITIAL_MATCHES]
[-m MAX_INITIAL_MATCHES_PER_POSITION]
[--errorOnReadsInNegControl]
[--negativeControlReadsThreshold NEGATIVE_CONTROL_READS_THRESHOLD]
[--negControlPrefixes [NEG_CONTROL_PREFIXES ...]]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam db outBam
2.4.2.2.1. Positional Arguments
- inBam
Input reads
- db
Database of taxa we keep
- outBam
Output reads, filtered to refDb
2.4.2.2.2. Named Arguments
- -n
maximum gapless alignments per query position (default: 1)
Default:
1- -l
minimum length for initial matches (default: 5)
Default:
5- -L
maximum length for initial matches (default: 50)
Default:
50- -m
maximum initial matches per query position (default: 100)
Default:
100- --errorOnReadsInNegControl
If specified, the function will return an error if there are reads after filtering for samples with names containing: (water,neg,ntc) (default: False)
Default:
False- --negativeControlReadsThreshold
maximum number of reads (single-end) or read pairs (paired-end) to tolerate in samples identified as negative controls (default: 0)
Default:
0- --negControlPrefixes
Bam file name prefixes to interpret as negative controls, space-separated (default: [‘neg’, ‘water’, ‘NTC’])
Default:
['neg', 'water', 'NTC']- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.3. deplete_bam_bmtagger
Use bmtagger to deplete input reads against several databases.
taxon_filter.py deplete_bam_bmtagger [-h] [--srprismMemory SRPRISM_MEMORY]
[--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize] [--JVMmemory JVMMEMORY]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam refDbs [refDbs ...] outBam
2.4.2.3.1. Positional Arguments
- inBam
Input BAM file.
- refDbs
- Reference databases (one or more) to deplete from input.
For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.
- outBam
Output BAM file.
2.4.2.3.2. Named Arguments
- --srprismMemory
Memory for srprism.
Default:
7168- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.4. deplete_blastn_bam
Use blastn to remove reads that match at least one of the specified databases.
taxon_filter.py deplete_blastn_bam [-h] [--chunkSize CHUNKSIZE] [--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize] [--JVMmemory JVMMEMORY]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam refDbs [refDbs ...] outBam
2.4.2.4.1. Positional Arguments
- inBam
Input BAM file.
- refDbs
One or more reference databases for blast. An ephemeral database will be created if a fasta file is provided.
- outBam
Output BAM file with matching reads removed.
2.4.2.4.2. Named Arguments
- --chunkSize
FASTA chunk size (default: 1000000)
Default:
1000000- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.5. deplete_bwa_bam
Use BWA to remove reads that match at least one of the specified databases.
taxon_filter.py deplete_bwa_bam [-h] [--JVMmemory JVMMEMORY] [--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam refDbs [refDbs ...] outBam
2.4.2.5.1. Positional Arguments
- inBam
Input BAM file.
- refDbs
One or more reference databases for bwa. An ephemeral database will be created if a fasta file is provided.
- outBam
Ouput BAM file with matching reads removed.
2.4.2.5.2. Named Arguments
- --JVMmemory
JVM virtual memory size for Picard RevertSam (default: ‘2g’)
Default:
'2g'- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.6. deplete_minimap2_bam
Use minimap2 to remove reads that match at least one of the specified databases.
taxon_filter.py deplete_minimap2_bam [-h] [--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize] [--JVMmemory JVMMEMORY]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam refDbs [refDbs ...] outBam
2.4.2.6.1. Positional Arguments
- inBam
Input BAM file.
- refDbs
One or more reference FASTA files to deplete from input.
- outBam
Output BAM file with matching reads removed.
2.4.2.6.2. Named Arguments
- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.7. lastal_build_db
build a database for use with last based on an input fasta file
taxon_filter.py lastal_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inputFasta outputDirectory
2.4.2.7.1. Positional Arguments
- inputFasta
Location of the input FASTA file
- outputDirectory
Location for the output files (default is cwd: None)
2.4.2.7.2. Named Arguments
- --outputFilePrefix
Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.8. bwa_build_db
Create a database for use with bwa from an input reference FASTA file
taxon_filter.py bwa_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inputFasta outputDirectory
2.4.2.8.1. Positional Arguments
- inputFasta
Location of the input FASTA file
- outputDirectory
Location for the output files
2.4.2.8.2. Named Arguments
- --outputFilePrefix
Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.9. blastn_build_db
Create a database for use with blastn from an input reference FASTA file
taxon_filter.py blastn_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inputFasta outputDirectory
2.4.2.9.1. Positional Arguments
- inputFasta
Location of the input FASTA file
- outputDirectory
Location for the output files
2.4.2.9.2. Named Arguments
- --outputFilePrefix
Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.4.2.10. bmtagger_build_db
Create a database for use with Bmtagger from an input FASTA file.
taxon_filter.py bmtagger_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
[--word_size WORD_SIZE]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inputFasta outputDirectory
2.4.2.10.1. Positional Arguments
2.4.2.10.2. Named Arguments
- --outputFilePrefix
Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)
- --word_size
Database word size (default: 18)
Default:
18- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False