2.2. read_utils.py - utilities that manipulate bam and fastq files

Utilities for working with sequence reads, such as converting formats and fixing mate pairs.

usage: read_utils.py subcommand

2.2.1. subcommands



Possible choices: index_fasta_samtools, index_fasta_picard, mkdup_picard, revert_bam_picard, picard, sort_bam, downsample_bams, merge_bams, filter_bam, fastq_to_bam, join_paired_fastq, split_bam, reheader_bam, reheader_bams, rmdup_cdhit_bam, rmdup_mvicuna_bam, rmdup_bbnorm_bam, rmdup_prinseq_fastq, filter_bam_mapped_only, novoalign, novoindex, gatk_ug, gatk_realign, align_and_fix, minimap2_idxstats, bwamem_idxstats, extract_tarball, read_names, trim_rmdup_subsamp

2.2.2. Sub-commands

2.2.2.1. index_fasta_samtools

Index a reference genome for Samtools.

read_utils.py index_fasta_samtools [-h]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version]
                                   inFasta

2.2.2.1.1. Positional Arguments

inFasta

Reference genome, FASTA format.

2.2.2.1.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.2.2.2. index_fasta_picard

Create an index file for a reference genome suitable for Picard/GATK.

read_utils.py index_fasta_picard [-h] [--JVMmemory JVMMEMORY]
                                 [--picardOptions [PICARDOPTIONS ...]]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmp_dir TMP_DIR]
                                 [--tmp_dirKeep]
                                 inFasta

2.2.2.2.1. Positional Arguments

inFasta

Input reference genome, FASTA format.

2.2.2.2.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘512m’)

Default: '512m'

--picardOptions

Optional arguments to Picard’s CreateSequenceDictionary, OPTIONNAME=value …

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.3. mkdup_picard

Mark or remove duplicate reads from BAM file.

read_utils.py mkdup_picard [-h] [--outMetrics OUTMETRICS] [--remove]
                           [--JVMmemory JVMMEMORY]
                           [--picardOptions [PICARDOPTIONS ...]]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           inBams [inBams ...] outBam

2.2.2.3.1. Positional Arguments

inBams

Input reads, BAM format.

outBam

Output reads, BAM format.

2.2.2.3.2. Named Arguments

--outMetrics

Output metrics file. Default is to dump to a temp file.

--remove

Instead of marking duplicates, remove them entirely (default: False)

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--picardOptions

Optional arguments to Picard’s MarkDuplicates, OPTIONNAME=value …

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.4. revert_bam_picard

Revert BAM to raw reads

read_utils.py revert_bam_picard [-h] [--JVMmemory JVMMEMORY]
                                [--picardOptions [PICARDOPTIONS ...]]
                                [--clearTags]
                                [--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
                                [--doNotSanitize]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam outBam

2.2.2.4.1. Positional Arguments

inBam

Input reads, BAM format.

outBam

Output reads, BAM format.

2.2.2.4.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--picardOptions

Optional arguments to Picard’s RevertSam, OPTIONNAME=value …

Default: []

--clearTags

When supplying an aligned input file, clear the per-read attribute tags

Default: False

--tagsToClear

A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])

Default: ['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']

--doNotSanitize
When being reverted, picard’s SANITIZE=true

is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:

‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’

For more information see:

https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.5. picard

Generic Picard runner.

read_utils.py picard [-h] [--JVMmemory JVMMEMORY]
                     [--picardOptions [PICARDOPTIONS ...]]
                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                     [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                     command

2.2.2.5.1. Positional Arguments

command

picard command

2.2.2.5.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--picardOptions

Optional arguments to Picard, OPTIONNAME=value …

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.6. sort_bam

Sort BAM file

read_utils.py sort_bam [-h] [--index] [--md5] [--JVMmemory JVMMEMORY]
                       [--picardOptions [PICARDOPTIONS ...]]
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                       [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                       inBam outBam {unsorted,queryname,coordinate}

2.2.2.6.1. Positional Arguments

inBam

Input bam file.

outBam

Output bam file, sorted.

sortOrder

Possible choices: unsorted, queryname, coordinate

How to sort the reads. [default: ‘coordinate’]

Default: 'coordinate'

2.2.2.6.2. Named Arguments

--index

Index outBam (default: False)

Default: False

--md5

MD5 checksum outBam (default: False)

Default: False

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--picardOptions

Optional arguments to Picard’s SortSam, OPTIONNAME=value …

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.7. downsample_bams

Downsample multiple bam files to the smallest read count in common, or to the specified count.

read_utils.py downsample_bams [-h] [--outPath OUT_PATH]
                              [--readCount SPECIFIED_READ_COUNT]
                              [--deduplicateBefore | --deduplicateAfter]
                              [--JVMmemory JVMMEMORY]
                              [--picardOptions [PICARDOPTIONS ...]]
                              [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              in_bams [in_bams ...]

2.2.2.7.1. Positional Arguments

in_bams

Input bam files.

2.2.2.7.2. Named Arguments

--outPath
Output path. If not provided,

downsampled bam files will be written to the same paths as each source bam file

--readCount

The number of reads to downsample to.

--deduplicateBefore

de-duplicate reads before downsampling.

Default: False

--deduplicateAfter

de-duplicate reads after downsampling.

Default: False

--JVMmemory

JVM virtual memory size (default: ‘4g’)

Default: '4g'

--picardOptions

Optional arguments to Picard’s DownsampleSam, OPTIONNAME=value …

Default: []

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.8. merge_bams

Merge multiple BAMs into one

read_utils.py merge_bams [-h] [--JVMmemory JVMMEMORY]
                         [--picardOptions [PICARDOPTIONS ...]]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         inBams [inBams ...] outBam

2.2.2.8.1. Positional Arguments

inBams

Input bam files.

outBam

Output bam file.

2.2.2.8.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--picardOptions

Optional arguments to Picard’s MergeSamFiles, OPTIONNAME=value …

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.9. filter_bam

Filter BAM file by read name

read_utils.py filter_bam [-h] [--exclude]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         inBam readList outBam

2.2.2.9.1. Positional Arguments

inBam

Input bam file.

readList

Input file of read IDs.

outBam

Output bam file.

2.2.2.9.2. Named Arguments

--exclude
If specified, readList is a list of reads to remove from input.

Default behavior is to treat readList as an inclusion list (all unnamed reads are removed).

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.10. fastq_to_bam

Convert a pair of fastq paired-end read files and optional text header

to a single bam file.

Uses samtools import for multi-threaded FASTQ to BAM conversion. The JVMmemory parameter is kept for backwards compatibility but ignored. The picardOptions parameter is parsed for Picard-style RG tag options.

read_utils.py fastq_to_bam [-h] (--sampleName SAMPLENAME | --header HEADER)
                           [--JVMmemory JVMMEMORY]
                           [--picardOptions [PICARDOPTIONS ...]]
                           [--threads THREADS]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           inFastq1 inFastq2 outBam

2.2.2.10.1. Positional Arguments

inFastq1

Input fastq file; 1st end of paired-end reads.

inFastq2

Input fastq file; 2nd end of paired-end reads.

outBam

Output bam file.

2.2.2.10.2. Named Arguments

--sampleName

Sample name to insert into the read group header.

--header

Optional text file containing header.

--JVMmemory

Deprecated: kept for backwards compatibility, ignored. (Was for Picard)

--picardOptions
Read group options in Picard format (OPTIONNAME=value).

Supported: LIBRARY_NAME, PLATFORM, PLATFORM_UNIT, SEQUENCING_CENTER, RUN_DATE, READ_GROUP_NAME. Note that header-related options will be overwritten by HEADER if present.

Default: []

--threads

Number of threads for BAM compression (default: auto)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.11. join_paired_fastq

Join paired fastq reads into single reads with Ns

read_utils.py join_paired_fastq [-h] [--outFormat OUTFORMAT]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                output inFastqs [inFastqs ...]

2.2.2.11.1. Positional Arguments

output

Output file.

inFastqs

Input fastq file (2 if paired, 1 if interleaved)

2.2.2.11.2. Named Arguments

--outFormat

Output file format.

Default: 'fastq'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.12. split_bam

Split BAM file equally into several output BAM files.

read_utils.py split_bam [-h]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        inBam outBams [outBams ...]

2.2.2.12.1. Positional Arguments

inBam

Input BAM file.

outBams

Output BAM files

2.2.2.12.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.13. reheader_bam

Copy a BAM file (inBam to outBam) while renaming elements of the BAM header.

The mapping file specifies which (key, old value, new value) mappings. For example:

LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI

read_utils.py reheader_bam [-h]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           inBam rgMap outBam

2.2.2.13.1. Positional Arguments

inBam

Input reads, BAM format.

rgMap

Tabular file containing three columns: field, old, new.

outBam

Output reads, BAM format.

2.2.2.13.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.14. reheader_bams

Copy BAM files while renaming elements of the BAM header.

The mapping file specifies which (key, old value, new value) mappings. For example:

LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI FN in1.bam out1.bam FN in2.bam out2.bam

read_utils.py reheader_bams [-h]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            rgMap

2.2.2.14.1. Positional Arguments

rgMap

Tabular file containing three columns: field, old, new.

2.2.2.14.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.15. rmdup_cdhit_bam

Remove duplicate reads from BAM file using cd-hit-dup.

read_utils.py rmdup_cdhit_bam [-h] [--JVMmemory JVM_MEMORY]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              inBam outBam

2.2.2.15.1. Positional Arguments

inBam

Input reads, BAM format.

outBam

Output reads, BAM format.

2.2.2.15.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘4g’)

Default: '4g'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.16. rmdup_mvicuna_bam

Remove duplicate reads from BAM file using M-Vicuna. The

primary advantage to this approach over Picard’s MarkDuplicates tool is that Picard requires that input reads are aligned to a reference, and M-Vicuna can operate on unaligned reads.

read_utils.py rmdup_mvicuna_bam [-h] [--threads THREADS]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam outBam

2.2.2.16.1. Positional Arguments

inBam

Input reads, BAM format.

outBam

Output reads, BAM format.

2.2.2.16.2. Named Arguments

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.17. rmdup_bbnorm_bam

Remove duplicate/normalize reads from BAM file using BBNorm.

Convert BAM to interleaved FASTQ, run bbnorm, extract kept read IDs, and filter original BAM to those IDs using pysam (O(1) memory).

Args:

inBam: Input BAM file outBam: Output BAM file target: BBNorm target normalization depth (default: bbnorm default of 100) k: Kmer length (default: bbnorm default of 31) passes: Number of passes (default: bbnorm default of 2) memory: Java memory for bbnorm (e.g., “4g”) threads: Number of threads for bbnorm min_input_reads: Skip processing if input has fewer reads (copy input to output) max_output_reads: Randomly downsample keep-list if larger than this

read_utils.py rmdup_bbnorm_bam [-h] [--target TARGET] [--kmerLength K]
                               [--passes PASSES] [--memory MEMORY]
                               [--minInputReads MIN_INPUT_READS]
                               [--maxOutputReads MAX_OUTPUT_READS]
                               [--threads THREADS]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam outBam

2.2.2.17.1. Positional Arguments

inBam

Input reads, BAM format.

outBam

Output reads, BAM format.

2.2.2.17.2. Named Arguments

--target

BBNorm target normalization depth (default: bbnorm default of 100)

--kmerLength

Kmer length for bbnorm (default: bbnorm default of 31)

--passes

Number of bbnorm passes (default: bbnorm default of 2)

--memory

Java memory for bbnorm (e.g., “4g”, “8g”)

--minInputReads

Skip processing if input has fewer than this many reads

--maxOutputReads

Randomly downsample output to at most this many read IDs

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.18. rmdup_prinseq_fastq

Run prinseq-lite’s duplicate removal operation on paired-end

reads. Also removes reads with more than one N.

read_utils.py rmdup_prinseq_fastq [-h] [--includeUnmated]
                                  [--unpairedOutFastq1 UNPAIREDOUTFASTQ1]
                                  [--unpairedOutFastq2 UNPAIREDOUTFASTQ2]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inFastq1 inFastq2 outFastq1 outFastq2

2.2.2.18.1. Positional Arguments

inFastq1

Input fastq file; 1st end of paired-end reads.

inFastq2

Input fastq file; 2nd end of paired-end reads.

outFastq1

Output fastq file; 1st end of paired-end reads.

outFastq2

Output fastq file; 2nd end of paired-end reads.

2.2.2.18.2. Named Arguments

--includeUnmated

Include unmated reads in the main output fastq files (default: False)

Default: False

--unpairedOutFastq1

File name of output unpaired reads from 1st end of paired-end reads (independent of –includeUnmated)

--unpairedOutFastq2

File name of output unpaired reads from 2nd end of paired-end reads (independent of –includeUnmated)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.19. filter_bam_mapped_only

Samtools to reduce a BAM file to only reads that are

aligned (-F 4) with a non-zero mapping quality (-q 1) and are not marked as a PCR/optical duplicate (-F 1024).

read_utils.py filter_bam_mapped_only [-h]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inBam outBam

2.2.2.19.1. Positional Arguments

inBam

Input aligned reads, BAM format.

outBam

Output sorted indexed reads, filtered to aligned-only, BAM format.

2.2.2.19.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.20. novoalign

Align reads with Novoalign. Sort and index BAM output.

read_utils.py novoalign [-h] [--options OPTIONS] [--min_qual MIN_QUAL]
                        [--JVMmemory JVMMEMORY]
                        [--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        inBam refFasta outBam

2.2.2.20.1. Positional Arguments

inBam

Input reads, BAM format.

refFasta

Reference genome, FASTA format, pre-indexed by Novoindex.

outBam

Output reads, BAM format (aligned).

2.2.2.20.2. Named Arguments

--options

Novoalign options (default: ‘-r Random’)

Default: '-r Random'

--min_qual

Filter outBam to minimum mapping quality (default: 0)

Default: 0

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--NOVOALIGN_LICENSE_PATH

A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.21. novoindex

read_utils.py novoindex [-h] [--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version]
                        refFasta

2.2.2.21.1. Positional Arguments

refFasta

Reference genome, FASTA format.

2.2.2.21.2. Named Arguments

--NOVOALIGN_LICENSE_PATH

A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.2.2.22. gatk_ug

Call genotypes using the GATK UnifiedGenotyper.

read_utils.py gatk_ug [-h] [--options OPTIONS] [--JVMmemory JVMMEMORY]
                      [--GATK_PATH GATK_PATH]
                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                      [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                      inBam refFasta outVcf

2.2.2.22.1. Positional Arguments

inBam

Input reads, BAM format.

refFasta

Reference genome, FASTA format, pre-indexed by Picard.

outVcf
Output calls in VCF format. If this filename ends with .gz,

GATK will BGZIP compress the output and produce a Tabix index file as well.

2.2.2.22.2. Named Arguments

--options

UnifiedGenotyper options (default: ‘–min_base_quality_score 15 -ploidy 4’)

Default: '--min_base_quality_score 15 -ploidy 4'

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--GATK_PATH

A path containing the GATK jar file. This overrides the GATK_ENV environment variable or the GATK conda package. (default: None)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.23. gatk_realign

Local realignment of BAM files with GATK IndelRealigner.

read_utils.py gatk_realign [-h] [--JVMmemory JVMMEMORY]
                           [--GATK_PATH GATK_PATH]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           [--threads THREADS]
                           inBam refFasta outBam

2.2.2.23.1. Positional Arguments

inBam

Input reads, BAM format, aligned to refFasta.

refFasta

Reference genome, FASTA format, pre-indexed by Picard.

outBam

Realigned reads.

2.2.2.23.2. Named Arguments

--JVMmemory

JVM virtual memory size (default: ‘2g’)

Default: '2g'

--GATK_PATH

A path containing the GATK jar file. This overrides the GATK_ENV environment variable or the GATK conda package. (default: None)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--threads

Number of threads (default: all available cores)

2.2.2.24. align_and_fix

Take reads, align to reference with Novoalign, minimap2, or BWA-MEM.

Optionally mark duplicates with Picard or sambamba, optionally realign indels with GATK, and optionally filter final file to mapped/non-dupe reads.

read_utils.py align_and_fix [-h] [--outBamAll OUTBAMALL]
                            [--outBamFiltered OUTBAMFILTERED]
                            [--aligner_options ALIGNER_OPTIONS]
                            [--aligner {novoalign,minimap2,bwa}]
                            [--bwa_min_score BWA_MIN_SCORE]
                            [--novoalign_amplicons_bed NOVOALIGN_AMPLICONS_BED]
                            [--amplicon_window AMPLICON_WINDOW]
                            [--JVMmemory JVMMEMORY] [--threads THREADS]
                            [--skipMarkDupes] [--skipRealign]
                            [--dupMarker {sambamba,picard}]
                            [--GATK_PATH GATK_PATH]
                            [--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            inBam refFasta

2.2.2.24.1. Positional Arguments

inBam

Input unaligned reads, BAM format.

refFasta

Reference genome, FASTA format; will be indexed by Picard and Novoalign.

2.2.2.24.2. Named Arguments

--outBamAll
Aligned, sorted, and indexed reads. Unmapped and duplicate reads are

retained. By default, duplicate reads are marked. If “–skipMarkDupes” is specified duplicate reads are included in outout without being marked.

--outBamFiltered
Aligned, sorted, and indexed reads. Unmapped reads are removed from this file,

as well as any marked duplicate reads. Note that if “–skipMarkDupes” is provided, duplicates will be not be marked and will be included in the output.

--aligner_options

aligner options (default for novoalign: “-r Random”, bwa: “-T 30”

--aligner

Possible choices: novoalign, minimap2, bwa

aligner (default: ‘novoalign’)

Default: 'novoalign'

--bwa_min_score

BWA mem on paired reads ignores the -T parameter. Set a value here (e.g. 30) to invoke a custom post-alignment filter (default: no filtration)

--novoalign_amplicons_bed

Novoalign only: amplicon primer file (BED format) to soft clip

--amplicon_window

Novoalign only: amplicon primer window size (default: 4)

Default: 4

--JVMmemory

JVM virtual memory size (default: ‘4g’)

Default: '4g'

--threads

Number of threads (default: all available cores)

--skipMarkDupes

If specified, duplicate reads will not be marked in the resulting output file.

Default: False

--skipRealign

If specified, GATK local realignment will be skipped. Recommended for viral genomes where indel realignment provides minimal benefit.

Default: False

--dupMarker

Possible choices: sambamba, picard

Tool to use for marking duplicates. Sambamba is multi-threaded and faster. Picard is the legacy option. (default: ‘sambamba’)

Default: 'sambamba'

--GATK_PATH

A path containing the GATK jar file. This overrides the GATK_ENV environment variable or the GATK conda package. (default: None)

--NOVOALIGN_LICENSE_PATH

A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.25. minimap2_idxstats

Align reads to reference with minimap2 and produce idxstats-like counts.

This uses PAF output format (no SAM/BAM generation) for faster alignment and streams the output directly without intermediate files.

Args:

inBam: Input reads (BAM format) refFasta: Reference genome (FASTA format) outStats: Output file in samtools idxstats format outReadlist: Optional output file with read IDs that mapped (or None to skip) threads: Number of threads for alignment (default: auto-detect)

read_utils.py minimap2_idxstats [-h] [--outReadlist OUTREADLIST]
                                [--threads THREADS]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                inBam refFasta outStats

2.2.2.25.1. Positional Arguments

inBam

Input unaligned reads, BAM format.

refFasta

Reference genome, FASTA format.

outStats

Output idxstats file (tab-separated: ref_name, ref_length, mapped_count, 0).

2.2.2.25.2. Named Arguments

--outReadlist

Optional output file listing read IDs that mapped.

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.26. bwamem_idxstats

Take reads, align to reference with BWA-MEM and perform samtools idxstats.

Optionally filter reads after alignment, prior to reporting idxstats, to include only those flagged as properly paired.

read_utils.py bwamem_idxstats [-h] [--outBam OUTBAM] [--outStats OUTSTATS]
                              [--minScoreToFilter MIN_SCORE_TO_FILTER]
                              [--alignerOptions ALIGNER_OPTIONS]
                              [--filterReadsAfterAlignment]
                              [--doNotRequirePairsToBeProper]
                              [--keepSingletons] [--keepDuplicates]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              inBam refFasta

2.2.2.26.1. Positional Arguments

inBam

Input unaligned reads, BAM format.

refFasta

Reference genome, FASTA format, pre-indexed by Picard and Novoalign.

2.2.2.26.2. Named Arguments

--outBam

Output aligned, indexed BAM file

--outStats

Output idxstats file

--minScoreToFilter

Filter bwa alignments using this value as the minimum allowed alignment score. Specifically, sum the alignment scores across all alignments for each query (including reads in a pair, supplementary and secondary alignments) and then only include, in the output, queries whose summed alignment score is at least this value. This is only applied when the aligner is ‘bwa’. The filtering on a summed alignment score is sensible for reads in a pair and supplementary alignments, but may not be reasonable if bwa outputs secondary alignments (i.e., if ‘-a’ is in the aligner options). (default: not set - i.e., do not filter bwa’s output)

--alignerOptions

bwa options (default: bwa defaults)

--filterReadsAfterAlignment

If specified, reads till be filtered after alignment to include only those flagged as properly paired.This excludes secondary and supplementary alignments.

Default: False

--doNotRequirePairsToBeProper

Do not require reads to be properly paired when filtering (default: False)

Default: False

--keepSingletons

Keep singleton reads when filtering (default: False)

Default: False

--keepDuplicates

When filtering, do not exclude reads due to being flagged as duplicates (default: False)

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.27. extract_tarball

Extract an input .tar, .tgz, .tar.gz, .tar.bz2, .tar.lz4, or .zip file

to a given directory (or we will choose one on our own). Emit the resulting directory path to stdout.

read_utils.py extract_tarball [-h] [--compression {gz,bz2,lz4,zip,none,auto}]
                              [--pipe_hint PIPE_HINT] [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              tarfile out_dir

2.2.2.27.1. Positional Arguments

tarfile

Input tar file. May be “-” for stdin.

out_dir

Output directory

2.2.2.27.2. Named Arguments

--compression

Possible choices: gz, bz2, lz4, zip, none, auto

Compression type (default: ‘auto’). Auto-detect is incompatible with stdin input unless pipe_hint is specified.

Default: 'auto'

--pipe_hint

If tarfile is stdin, you can provide a file-like URI string for pipe_hint which ends with a common compression file extension if you want to use compression=auto.

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.28. read_names

Extract read names from a sequence file

read_utils.py read_names [-h] [--threads THREADS]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         in_reads out_read_names

2.2.2.28.1. Positional Arguments

in_reads

the input reads ([compressed] fasta or bam)

out_read_names

the read names

2.2.2.28.2. Named Arguments

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.2.2.29. trim_rmdup_subsamp

Take reads through Trimmomatic, Prinseq, and subsampling.

This function performs three main operations: 1. Trimmomatic: adapter trimming and quality filtering 2. Prinseq: duplicate read removal 3. Subsampling: reduce to target read count

Args:

inBam: Input unaligned BAM file clipDb: Trimmomatic adapter clip database (FASTA) outBam: Output unaligned BAM file n_reads: Target number of individual reads (default: 100000) trim_opts: Optional dict of Trimmomatic options threads: Number of threads for Trimmomatic (default: all available)

Returns:

Tuple of (n_input, n_trim, n_rmdup, n_output, n_paired_subsamp, n_unpaired_subsamp) where all counts are individual reads (not pairs).

read_utils.py trim_rmdup_subsamp [-h] [--n_reads N_READS] [--threads THREADS]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmp_dir TMP_DIR]
                                 [--tmp_dirKeep]
                                 inBam clipDb outBam

2.2.2.29.1. Positional Arguments

inBam

Input reads, unaligned BAM format.

clipDb

Trimmomatic clip DB (FASTA with adapter sequences).

outBam
Output reads, unaligned BAM format (currently, read groups and other

header information are destroyed in this process).

2.2.2.29.2. Named Arguments

--n_reads
Subsample to no more than this many individual reads. Note that

paired reads are given priority, and unpaired reads are included to reach the count if there are too few paired reads. (default: 100000)

Default: 100000

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False