2.4. taxon_filter.py - filter reads by taxonomic membership

This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.

usage: taxon_filter.py subcommand

2.4.1. subcommands



Possible choices: deplete, filter_lastal_bam, deplete_bam_bmtagger, deplete_blastn_bam, deplete_bwa_bam, deplete_minimap2_bam, lastal_build_db, bwa_build_db, blastn_build_db, bmtagger_build_db

2.4.2. Sub-commands

2.4.2.1. deplete

Run the entire depletion pipeline: minimap2, bwa, bmtagger, blastn.

taxon_filter.py deplete [-h] [--minimapDbs [MINIMAPDBS ...]]
                        [--bwaDbs [BWADBS ...]]
                        [--bmtaggerDbs [BMTAGGERDBS ...]]
                        [--blastDbs [BLASTDBS ...]]
                        [--srprismMemory SRPRISM_MEMORY]
                        [--chunkSize CHUNKSIZE] [--JVMmemory JVMMEMORY]
                        [--clearTags]
                        [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                        [--doNotSanitize] [--threads THREADS]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        inBam revertBam minimapBam bwaBam bmtaggerBam
                        blastnBam

2.4.2.1.1. Positional Arguments

inBam

Input BAM file.

revertBam

Output BAM: read markup reverted with Picard.

minimapBam

Output BAM: depleted of reads with minimap2.

bwaBam

Output BAM: depleted of reads with BWA.

bmtaggerBam

Output BAM: depleted of reads with BMTagger.

blastnBam

Output BAM: bmtaggerBam run through another depletion of reads with BLASTN.

2.4.2.1.2. Named Arguments

--minimapDbs

Reference FASTA databases for minimap2 to deplete from input.

Default: ()

--bwaDbs

Reference databases for BWA to deplete from input.

Default: ()

--bmtaggerDbs
Reference databases to deplete from input.

For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.

Default: ()

--blastDbs

Reference databases for BLASTN to deplete from input.

Default: ()

--srprismMemory

Memory for srprism.

Default: 7168

--chunkSize

blastn chunk size (default: 1000000)

Default: 1000000

--JVMmemory

JVM virtual memory size for Picard RevertSam (default: ‘2g’)

Default: '2g'

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.2. filter_lastal_bam

Restrict input reads to those that align to the given reference database using LASTAL.

taxon_filter.py filter_lastal_bam [-h]
                                  [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                  [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                  [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                  [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                  [--errorOnReadsInNegControl]
                                  [--negativeControlReadsThreshold NEGATIVE_CONTROL_READS_THRESHOLD]
                                  [--negControlPrefixes [NEG_CONTROL_PREFIXES ...]]
                                  [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inBam db outBam

2.4.2.2.1. Positional Arguments

inBam

Input reads

db

Database of taxa we keep

outBam

Output reads, filtered to refDb

2.4.2.2.2. Named Arguments

-n

maximum gapless alignments per query position (default: 1)

Default: 1

-l

minimum length for initial matches (default: 5)

Default: 5

-L

maximum length for initial matches (default: 50)

Default: 50

-m

maximum initial matches per query position (default: 100)

Default: 100

--errorOnReadsInNegControl

If specified, the function will return an error if there are reads after filtering for samples with names containing: (water,neg,ntc) (default: False)

Default: False

--negativeControlReadsThreshold

maximum number of reads (single-end) or read pairs (paired-end) to tolerate in samples identified as negative controls (default: 0)

Default: 0

--negControlPrefixes

Bam file name prefixes to interpret as negative controls, space-separated (default: [‘neg’, ‘water’, ‘NTC’])

Default: ['neg', 'water', 'NTC']

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.3. deplete_bam_bmtagger

Use bmtagger to deplete input reads against several databases.

taxon_filter.py deplete_bam_bmtagger [-h] [--srprismMemory SRPRISM_MEMORY]
                                     [--clearTags]
                                     [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                     [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam refDbs [refDbs ...] outBam

2.4.2.3.1. Positional Arguments

inBam

Input BAM file.

refDbs
Reference databases (one or more) to deplete from input.

For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.

outBam

Output BAM file.

2.4.2.3.2. Named Arguments

--srprismMemory

Memory for srprism.

Default: 7168

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.4. deplete_blastn_bam

Use blastn to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_blastn_bam [-h] [--chunkSize CHUNKSIZE] [--clearTags]
                                   [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                   [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inBam refDbs [refDbs ...] outBam

2.4.2.4.1. Positional Arguments

inBam

Input BAM file.

refDbs

One or more reference databases for blast. An ephemeral database will be created if a fasta file is provided.

outBam

Output BAM file with matching reads removed.

2.4.2.4.2. Named Arguments

--chunkSize

FASTA chunk size (default: 1000000)

Default: 1000000

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.5. deplete_bwa_bam

Use BWA to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_bwa_bam [-h] [--JVMmemory JVMMEMORY] [--clearTags]
                                [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                [--doNotSanitize] [--threads THREADS]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam refDbs [refDbs ...] outBam

2.4.2.5.1. Positional Arguments

inBam

Input BAM file.

refDbs

One or more reference databases for bwa. An ephemeral database will be created if a fasta file is provided.

outBam

Ouput BAM file with matching reads removed.

2.4.2.5.2. Named Arguments

--JVMmemory

JVM virtual memory size for Picard RevertSam (default: ‘2g’)

Default: '2g'

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.6. deplete_minimap2_bam

Use minimap2 to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_minimap2_bam [-h] [--clearTags]
                                     [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                     [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                     [--threads THREADS]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam refDbs [refDbs ...] outBam

2.4.2.6.1. Positional Arguments

inBam

Input BAM file.

refDbs

One or more reference FASTA files to deplete from input.

outBam

Output BAM file with matching reads removed.

2.4.2.6.2. Named Arguments

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.7. lastal_build_db

build a database for use with last based on an input fasta file

taxon_filter.py lastal_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inputFasta outputDirectory

2.4.2.7.1. Positional Arguments

inputFasta

Location of the input FASTA file

outputDirectory

Location for the output files (default is cwd: None)

2.4.2.7.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.8. bwa_build_db

Create a database for use with bwa from an input reference FASTA file

taxon_filter.py bwa_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             inputFasta outputDirectory

2.4.2.8.1. Positional Arguments

inputFasta

Location of the input FASTA file

outputDirectory

Location for the output files

2.4.2.8.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.9. blastn_build_db

Create a database for use with blastn from an input reference FASTA file

taxon_filter.py blastn_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inputFasta outputDirectory

2.4.2.9.1. Positional Arguments

inputFasta

Location of the input FASTA file

outputDirectory

Location for the output files

2.4.2.9.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.10. bmtagger_build_db

Create a database for use with Bmtagger from an input FASTA file.

taxon_filter.py bmtagger_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                  [--word_size WORD_SIZE]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inputFasta outputDirectory

2.4.2.10.1. Positional Arguments

inputFasta

Location of the input FASTA file

outputDirectory

Location for the output files (Where *.bitmask and *.srprism files will be stored)

2.4.2.10.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--word_size

Database word size (default: 18)

Default: 18

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False