2.4. taxon_filter.py - filter reads by taxonomic membership

This script contains a number of utilities for filtering NGS reads based on membership or non-membership in a species / genus / taxonomic grouping.

usage: taxon_filter.py subcommand

2.4.1. subcommands

[F: Possible choices: deplete, filter_lastal_bam, deplete_bam_bmtagger, deplete_blastn_bam, deplete_bwa_bam, deplete_minimap2_bam, lastal_build_db, bwa_build_db, blastn_build_db, bmtagger_build_db

2.4.2. Sub-commands

2.4.2.1. deplete

Run the entire depletion pipeline: minimap2, bwa, bmtagger, blastn.

taxon_filter.py deplete [-h] [--minimapDbs [MINIMAPDBS ...]]
                        [--bwaDbs [BWADBS ...]]
                        [--bmtaggerDbs [BMTAGGERDBS ...]]
                        [--blastDbs [BLASTDBS ...]]
                        [--srprismMemory SRPRISM_MEMORY]
                        [--chunkSize CHUNKSIZE] [--JVMmemory JVMMEMORY]
                        [--clearTags]
                        [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                        [--doNotSanitize] [--threads THREADS]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        inBam revertBam minimapBam bwaBam bmtaggerBam
                        blastnBam

2.4.2.1.1. Positional Arguments

inBam: Input BAM file.
revertBam: Output BAM: read markup reverted with Picard.
minimapBam: Output BAM: depleted of reads with minimap2.
bwaBam: Output BAM: depleted of reads with BWA.
bmtaggerBam: Output BAM: depleted of reads with BMTagger.
blastnBam: Output BAM: bmtaggerBam run through another depletion of reads with BLASTN.

2.4.2.1.2. Named Arguments

--minimapDbs

Reference FASTA databases for minimap2 to deplete from input.

Default: ()

--bwaDbs

Reference databases for BWA to deplete from input.

Default: ()

--bmtaggerDbs

Reference databases to deplete from input.: For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.

Default: ()

--blastDbs

Reference databases for BLASTN to deplete from input.

Default: ()

--srprismMemory

Memory for srprism.

Default: 7168

--chunkSize

blastn chunk size (default: 1000000)

Default: 1000000

--JVMmemory

JVM virtual memory size for Picard RevertSam (default: ‘2g’)

Default: '2g'

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize

When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:: https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.2. filter_lastal_bam

Restrict input reads to those that align to the given reference database using LASTAL.

taxon_filter.py filter_lastal_bam [-h]
                                  [-n MAX_GAPLESS_ALIGNMENTS_PER_POSITION]
                                  [-l MIN_LENGTH_FOR_INITIAL_MATCHES]
                                  [-L MAX_LENGTH_FOR_INITIAL_MATCHES]
                                  [-m MAX_INITIAL_MATCHES_PER_POSITION]
                                  [--errorOnReadsInNegControl]
                                  [--negativeControlReadsThreshold NEGATIVE_CONTROL_READS_THRESHOLD]
                                  [--negControlPrefixes [NEG_CONTROL_PREFIXES ...]]
                                  [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inBam db outBam

2.4.2.2.1. Positional Arguments

inBam: Input reads
db: Database of taxa we keep
outBam: Output reads, filtered to refDb

2.4.2.2.2. Named Arguments

-n

maximum gapless alignments per query position (default: 1)

Default: 1

-l

minimum length for initial matches (default: 5)

Default: 5

-L

maximum length for initial matches (default: 50)

Default: 50

-m

maximum initial matches per query position (default: 100)

Default: 100

--errorOnReadsInNegControl

If specified, the function will return an error if there are reads after filtering for samples with names containing: (water,neg,ntc) (default: False)

Default: False

--negativeControlReadsThreshold

maximum number of reads (single-end) or read pairs (paired-end) to tolerate in samples identified as negative controls (default: 0)

Default: 0

--negControlPrefixes

Bam file name prefixes to interpret as negative controls, space-separated (default: [‘neg’, ‘water’, ‘NTC’])

Default: ['neg', 'water', 'NTC']

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.3. deplete_bam_bmtagger

Use bmtagger to deplete input reads against several databases.

taxon_filter.py deplete_bam_bmtagger [-h] [--srprismMemory SRPRISM_MEMORY]
                                     [--clearTags]
                                     [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                     [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam refDbs [refDbs ...] outBam

2.4.2.3.1. Positional Arguments

inBam

Input BAM file.

refDbs

Reference databases (one or more) to deplete from input.: For each db, requires prior creation of db.bitmask by bmtool, and db.srprism.idx, db.srprism.map, etc. by srprism mkindex.

outBam

Output BAM file.

2.4.2.3.2. Named Arguments

--srprismMemory

Memory for srprism.

Default: 7168

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize

When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:: https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.4. deplete_blastn_bam

Use blastn to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_blastn_bam [-h] [--chunkSize CHUNKSIZE] [--clearTags]
                                   [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                   [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inBam refDbs [refDbs ...] outBam

2.4.2.4.1. Positional Arguments

inBam: Input BAM file.
refDbs: One or more reference databases for blast. An ephemeral database will be created if a fasta file is provided.
outBam: Output BAM file with matching reads removed.

2.4.2.4.2. Named Arguments

--chunkSize

FASTA chunk size (default: 1000000)

Default: 1000000

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize

When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:: https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.5. deplete_bwa_bam

Use BWA to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_bwa_bam [-h] [--JVMmemory JVMMEMORY] [--clearTags]
                                [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                [--doNotSanitize] [--threads THREADS]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam refDbs [refDbs ...] outBam

2.4.2.5.1. Positional Arguments

inBam: Input BAM file.
refDbs: One or more reference databases for bwa. An ephemeral database will be created if a fasta file is provided.
outBam: Ouput BAM file with matching reads removed.

2.4.2.5.2. Named Arguments

--JVMmemory

JVM virtual memory size for Picard RevertSam (default: ‘2g’)

Default: '2g'

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize

When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:: https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.6. deplete_minimap2_bam

Use minimap2 to remove reads that match at least one of the specified databases.

taxon_filter.py deplete_minimap2_bam [-h] [--clearTags]
                                     [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                     [--doNotSanitize] [--JVMmemory JVMMEMORY]
                                     [--threads THREADS]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam refDbs [refDbs ...] outBam

2.4.2.6.1. Positional Arguments

inBam: Input BAM file.
refDbs: One or more reference FASTA files to deplete from input.
outBam: Output BAM file with matching reads removed.

2.4.2.6.2. Named Arguments

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize

When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:: https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.7. lastal_build_db

build a database for use with last based on an input fasta file

taxon_filter.py lastal_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inputFasta outputDirectory

2.4.2.7.1. Positional Arguments

inputFasta: Location of the input FASTA file
outputDirectory: Location for the output files (default is cwd: None)

2.4.2.7.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.8. bwa_build_db

Create a database for use with bwa from an input reference FASTA file

taxon_filter.py bwa_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             inputFasta outputDirectory

2.4.2.8.1. Positional Arguments

inputFasta: Location of the input FASTA file
outputDirectory: Location for the output files

2.4.2.8.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.9. blastn_build_db

Create a database for use with blastn from an input reference FASTA file

taxon_filter.py blastn_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inputFasta outputDirectory

2.4.2.9.1. Positional Arguments

inputFasta: Location of the input FASTA file
outputDirectory: Location for the output files

2.4.2.9.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.4.2.10. bmtagger_build_db

Create a database for use with Bmtagger from an input FASTA file.

taxon_filter.py bmtagger_build_db [-h] [--outputFilePrefix OUTPUTFILEPREFIX]
                                  [--word_size WORD_SIZE]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inputFasta outputDirectory

2.4.2.10.1. Positional Arguments

inputFasta: Location of the input FASTA file
outputDirectory: Location for the output files (Where *.bitmask and *.srprism files will be stored)

2.4.2.10.2. Named Arguments

--outputFilePrefix

Prefix for the output file name (default: inputFasta name, sans “.fasta” extension)

--word_size

Database word size (default: 18)

Default: 18

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False