2.2. read_utils.py - utilities that manipulate bam and fastq files
Utilities for working with sequence reads, such as converting formats and fixing mate pairs.
usage: read_utils.py subcommand
2.2.2. Sub-commands
2.2.2.1. index_fasta_samtools
Index a reference genome for Samtools.
read_utils.py index_fasta_samtools [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inFasta
2.2.2.1.1. Positional Arguments
- inFasta
Reference genome, FASTA format.
2.2.2.1.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.2.2.2. index_fasta_picard
Create an index file for a reference genome suitable for Picard/GATK.
read_utils.py index_fasta_picard [-h] [--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inFasta
2.2.2.2.1. Positional Arguments
- inFasta
Input reference genome, FASTA format.
2.2.2.2.2. Named Arguments
- --JVMmemory
JVM virtual memory size (default: ‘512m’)
Default:
'512m'- --picardOptions
Optional arguments to Picard’s CreateSequenceDictionary, OPTIONNAME=value …
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.3. mkdup_picard
Mark or remove duplicate reads from BAM file.
read_utils.py mkdup_picard [-h] [--outMetrics OUTMETRICS] [--remove]
[--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBams [inBams ...] outBam
2.2.2.3.1. Positional Arguments
- inBams
Input reads, BAM format.
- outBam
Output reads, BAM format.
2.2.2.3.2. Named Arguments
- --outMetrics
Output metrics file. Default is to dump to a temp file.
- --remove
Instead of marking duplicates, remove them entirely (default: False)
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --picardOptions
Optional arguments to Picard’s MarkDuplicates, OPTIONNAME=value …
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.4. revert_bam_picard
Revert BAM to raw reads
read_utils.py revert_bam_picard [-h] [--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--clearTags]
[--tagsToClear TAGS_TO_CLEAR [TAGS_TO_CLEAR ...]]
[--doNotSanitize]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam outBam
2.2.2.4.1. Positional Arguments
- inBam
Input reads, BAM format.
- outBam
Output reads, BAM format.
2.2.2.4.2. Named Arguments
- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --picardOptions
Optional arguments to Picard’s RevertSam, OPTIONNAME=value …
Default:
[]- --clearTags
When supplying an aligned input file, clear the per-read attribute tags
Default:
False- --tagsToClear
A space-separated list of tags to remove from all reads in the input bam file (default: [‘XT’, ‘X0’, ‘X1’, ‘XA’, ‘AM’, ‘SM’, ‘BQ’, ‘CT’, ‘XN’, ‘OC’, ‘OP’])
Default:
['XT', 'X0', 'X1', 'XA', 'AM', 'SM', 'BQ', 'CT', 'XN', 'OC', 'OP']- --doNotSanitize
- When being reverted, picard’s SANITIZE=true
is set unless –doNotSanitize is given. Sanitization is a destructive operation that removes reads so the bam file is consistent. From the picard documentation:
‘Reads discarded include (but are not limited to) paired reads with missing mates, duplicated records, records with mismatches in length of bases and qualities.’
- For more information see:
https://broadinstitute.github.io/picard/command-line-overview.html#RevertSam
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.5. picard
Generic Picard runner.
read_utils.py picard [-h] [--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
command
2.2.2.5.1. Positional Arguments
- command
picard command
2.2.2.5.2. Named Arguments
- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --picardOptions
Optional arguments to Picard, OPTIONNAME=value …
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.6. sort_bam
Sort BAM file
read_utils.py sort_bam [-h] [--index] [--md5] [--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam outBam {unsorted,queryname,coordinate}
2.2.2.6.1. Positional Arguments
- inBam
Input bam file.
- outBam
Output bam file, sorted.
- sortOrder
Possible choices: unsorted, queryname, coordinate
How to sort the reads. [default: ‘coordinate’]
Default:
'coordinate'
2.2.2.6.2. Named Arguments
- --index
Index outBam (default: False)
Default:
False- --md5
MD5 checksum outBam (default: False)
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --picardOptions
Optional arguments to Picard’s SortSam, OPTIONNAME=value …
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.7. downsample_bams
Downsample multiple bam files to the smallest read count in common, or to the specified count.
read_utils.py downsample_bams [-h] [--outPath OUT_PATH]
[--readCount SPECIFIED_READ_COUNT]
[--deduplicateBefore | --deduplicateAfter]
[--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
in_bams [in_bams ...]
2.2.2.7.1. Positional Arguments
- in_bams
Input bam files.
2.2.2.7.2. Named Arguments
- --outPath
- Output path. If not provided,
downsampled bam files will be written to the same paths as each source bam file
- --readCount
The number of reads to downsample to.
- --deduplicateBefore
de-duplicate reads before downsampling.
Default:
False- --deduplicateAfter
de-duplicate reads after downsampling.
Default:
False- --JVMmemory
JVM virtual memory size (default: ‘4g’)
Default:
'4g'- --picardOptions
Optional arguments to Picard’s DownsampleSam, OPTIONNAME=value …
Default:
[]- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.8. merge_bams
Merge multiple BAMs into one
read_utils.py merge_bams [-h] [--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBams [inBams ...] outBam
2.2.2.8.1. Positional Arguments
- inBams
Input bam files.
- outBam
Output bam file.
2.2.2.8.2. Named Arguments
- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --picardOptions
Optional arguments to Picard’s MergeSamFiles, OPTIONNAME=value …
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.9. filter_bam
Filter BAM file by read name
read_utils.py filter_bam [-h] [--exclude]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam readList outBam
2.2.2.9.1. Positional Arguments
- inBam
Input bam file.
- readList
Input file of read IDs.
- outBam
Output bam file.
2.2.2.9.2. Named Arguments
- --exclude
- If specified, readList is a list of reads to remove from input.
Default behavior is to treat readList as an inclusion list (all unnamed reads are removed).
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.10. fastq_to_bam
- Convert a pair of fastq paired-end read files and optional text header
to a single bam file.
Uses samtools import for multi-threaded FASTQ to BAM conversion. The JVMmemory parameter is kept for backwards compatibility but ignored. The picardOptions parameter is parsed for Picard-style RG tag options.
read_utils.py fastq_to_bam [-h] (--sampleName SAMPLENAME | --header HEADER)
[--JVMmemory JVMMEMORY]
[--picardOptions [PICARDOPTIONS ...]]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inFastq1 inFastq2 outBam
2.2.2.10.1. Positional Arguments
- inFastq1
Input fastq file; 1st end of paired-end reads.
- inFastq2
Input fastq file; 2nd end of paired-end reads.
- outBam
Output bam file.
2.2.2.10.2. Named Arguments
- --sampleName
Sample name to insert into the read group header.
- --header
Optional text file containing header.
- --JVMmemory
Deprecated: kept for backwards compatibility, ignored. (Was for Picard)
- --picardOptions
- Read group options in Picard format (OPTIONNAME=value).
Supported: LIBRARY_NAME, PLATFORM, PLATFORM_UNIT, SEQUENCING_CENTER, RUN_DATE, READ_GROUP_NAME. Note that header-related options will be overwritten by HEADER if present.
Default:
[]- --threads
Number of threads for BAM compression (default: auto)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.11. join_paired_fastq
Join paired fastq reads into single reads with Ns
read_utils.py join_paired_fastq [-h] [--outFormat OUTFORMAT]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
output inFastqs [inFastqs ...]
2.2.2.11.1. Positional Arguments
- output
Output file.
- inFastqs
Input fastq file (2 if paired, 1 if interleaved)
2.2.2.11.2. Named Arguments
- --outFormat
Output file format.
Default:
'fastq'- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.12. split_bam
Split BAM file equally into several output BAM files.
read_utils.py split_bam [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam outBams [outBams ...]
2.2.2.12.1. Positional Arguments
- inBam
Input BAM file.
- outBams
Output BAM files
2.2.2.12.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.13. reheader_bam
- Copy a BAM file (inBam to outBam) while renaming elements of the BAM header.
The mapping file specifies which (key, old value, new value) mappings. For example:
LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI
read_utils.py reheader_bam [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam rgMap outBam
2.2.2.13.1. Positional Arguments
- inBam
Input reads, BAM format.
- rgMap
Tabular file containing three columns: field, old, new.
- outBam
Output reads, BAM format.
2.2.2.13.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.14. reheader_bams
- Copy BAM files while renaming elements of the BAM header.
The mapping file specifies which (key, old value, new value) mappings. For example:
LB lib1 lib_one SM sample1 Sample_1 SM sample2 Sample_2 SM sample3 Sample_3 CN broad BI FN in1.bam out1.bam FN in2.bam out2.bam
read_utils.py reheader_bams [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
rgMap
2.2.2.14.1. Positional Arguments
- rgMap
Tabular file containing three columns: field, old, new.
2.2.2.14.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.15. rmdup_cdhit_bam
Remove duplicate reads from BAM file using cd-hit-dup.
read_utils.py rmdup_cdhit_bam [-h] [--JVMmemory JVM_MEMORY]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam outBam
2.2.2.15.1. Positional Arguments
- inBam
Input reads, BAM format.
- outBam
Output reads, BAM format.
2.2.2.15.2. Named Arguments
- --JVMmemory
JVM virtual memory size (default: ‘4g’)
Default:
'4g'- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.16. rmdup_mvicuna_bam
- Remove duplicate reads from BAM file using M-Vicuna. The
primary advantage to this approach over Picard’s MarkDuplicates tool is that Picard requires that input reads are aligned to a reference, and M-Vicuna can operate on unaligned reads.
read_utils.py rmdup_mvicuna_bam [-h] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam outBam
2.2.2.16.1. Positional Arguments
- inBam
Input reads, BAM format.
- outBam
Output reads, BAM format.
2.2.2.16.2. Named Arguments
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.17. rmdup_bbnorm_bam
Remove duplicate/normalize reads from BAM file using BBNorm.
Convert BAM to interleaved FASTQ, run bbnorm, extract kept read IDs, and filter original BAM to those IDs using pysam (O(1) memory).
- Args:
inBam: Input BAM file outBam: Output BAM file target: BBNorm target normalization depth (default: bbnorm default of 100) k: Kmer length (default: bbnorm default of 31) passes: Number of passes (default: bbnorm default of 2) memory: Java memory for bbnorm (e.g., “4g”) threads: Number of threads for bbnorm min_input_reads: Skip processing if input has fewer reads (copy input to output) max_output_reads: Randomly downsample keep-list if larger than this
read_utils.py rmdup_bbnorm_bam [-h] [--target TARGET] [--kmerLength K]
[--passes PASSES] [--memory MEMORY]
[--minInputReads MIN_INPUT_READS]
[--maxOutputReads MAX_OUTPUT_READS]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam outBam
2.2.2.17.1. Positional Arguments
- inBam
Input reads, BAM format.
- outBam
Output reads, BAM format.
2.2.2.17.2. Named Arguments
- --target
BBNorm target normalization depth (default: bbnorm default of 100)
- --kmerLength
Kmer length for bbnorm (default: bbnorm default of 31)
- --passes
Number of bbnorm passes (default: bbnorm default of 2)
- --memory
Java memory for bbnorm (e.g., “4g”, “8g”)
- --minInputReads
Skip processing if input has fewer than this many reads
- --maxOutputReads
Randomly downsample output to at most this many read IDs
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.18. rmdup_prinseq_fastq
- Run prinseq-lite’s duplicate removal operation on paired-end
reads. Also removes reads with more than one N.
read_utils.py rmdup_prinseq_fastq [-h] [--includeUnmated]
[--unpairedOutFastq1 UNPAIREDOUTFASTQ1]
[--unpairedOutFastq2 UNPAIREDOUTFASTQ2]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inFastq1 inFastq2 outFastq1 outFastq2
2.2.2.18.1. Positional Arguments
- inFastq1
Input fastq file; 1st end of paired-end reads.
- inFastq2
Input fastq file; 2nd end of paired-end reads.
- outFastq1
Output fastq file; 1st end of paired-end reads.
- outFastq2
Output fastq file; 2nd end of paired-end reads.
2.2.2.18.2. Named Arguments
- --includeUnmated
Include unmated reads in the main output fastq files (default: False)
Default:
False- --unpairedOutFastq1
File name of output unpaired reads from 1st end of paired-end reads (independent of –includeUnmated)
- --unpairedOutFastq2
File name of output unpaired reads from 2nd end of paired-end reads (independent of –includeUnmated)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.19. filter_bam_mapped_only
- Samtools to reduce a BAM file to only reads that are
aligned (-F 4) with a non-zero mapping quality (-q 1) and are not marked as a PCR/optical duplicate (-F 1024).
read_utils.py filter_bam_mapped_only [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam outBam
2.2.2.19.1. Positional Arguments
- inBam
Input aligned reads, BAM format.
- outBam
Output sorted indexed reads, filtered to aligned-only, BAM format.
2.2.2.19.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.20. novoalign
Align reads with Novoalign. Sort and index BAM output.
read_utils.py novoalign [-h] [--options OPTIONS] [--min_qual MIN_QUAL]
[--JVMmemory JVMMEMORY]
[--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam refFasta outBam
2.2.2.20.1. Positional Arguments
- inBam
Input reads, BAM format.
- refFasta
Reference genome, FASTA format, pre-indexed by Novoindex.
- outBam
Output reads, BAM format (aligned).
2.2.2.20.2. Named Arguments
- --options
Novoalign options (default: ‘-r Random’)
Default:
'-r Random'- --min_qual
Filter outBam to minimum mapping quality (default: 0)
Default:
0- --JVMmemory
JVM virtual memory size (default: ‘2g’)
Default:
'2g'- --NOVOALIGN_LICENSE_PATH
A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.21. novoindex
read_utils.py novoindex [-h] [--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
refFasta
2.2.2.21.1. Positional Arguments
- refFasta
Reference genome, FASTA format.
2.2.2.21.2. Named Arguments
- --NOVOALIGN_LICENSE_PATH
A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.2.2.22. align_and_fix
- Take reads, align to reference with Novoalign, minimap2, or BWA-MEM.
Optionally mark duplicates with Picard or sambamba, and optionally filter final file to mapped/non-dupe reads.
read_utils.py align_and_fix [-h] [--outBamAll OUTBAMALL]
[--outBamFiltered OUTBAMFILTERED]
[--aligner_options ALIGNER_OPTIONS]
[--aligner {novoalign,minimap2,bwa}]
[--bwa_min_score BWA_MIN_SCORE]
[--novoalign_amplicons_bed NOVOALIGN_AMPLICONS_BED]
[--amplicon_window AMPLICON_WINDOW]
[--JVMmemory JVMMEMORY] [--threads THREADS]
[--skipMarkDupes] [--dupMarker {sambamba,picard}]
[--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
[--skipRealign]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam refFasta
2.2.2.22.1. Positional Arguments
- inBam
Input unaligned reads, BAM format.
- refFasta
Reference genome, FASTA format; will be indexed by Picard and Novoalign.
2.2.2.22.2. Named Arguments
- --outBamAll
- Aligned, sorted, and indexed reads. Unmapped and duplicate reads are
retained. By default, duplicate reads are marked. If “–skipMarkDupes” is specified duplicate reads are included in outout without being marked.
- --outBamFiltered
- Aligned, sorted, and indexed reads. Unmapped reads are removed from this file,
as well as any marked duplicate reads. Note that if “–skipMarkDupes” is provided, duplicates will be not be marked and will be included in the output.
- --aligner_options
aligner options (default for novoalign: “-r Random”, bwa: “-T 30”
- --aligner
Possible choices: novoalign, minimap2, bwa
aligner (default: ‘novoalign’)
Default:
'novoalign'- --bwa_min_score
BWA mem on paired reads ignores the -T parameter. Set a value here (e.g. 30) to invoke a custom post-alignment filter (default: no filtration)
- --novoalign_amplicons_bed
Novoalign only: amplicon primer file (BED format) to soft clip
- --amplicon_window
Novoalign only: amplicon primer window size (default: 4)
Default:
4- --JVMmemory
JVM virtual memory size (default: ‘4g’)
Default:
'4g'- --threads
Number of threads (default: all available cores)
- --skipMarkDupes
If specified, duplicate reads will not be marked in the resulting output file.
Default:
False- --dupMarker
Possible choices: sambamba, picard
Tool to use for marking duplicates. Sambamba is multi-threaded and faster. Picard is the legacy option. (default: ‘sambamba’)
Default:
'sambamba'- --NOVOALIGN_LICENSE_PATH
A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)
- --skipRealign
Deprecated no-op. GATK realignment has been removed.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.23. minimap2_idxstats
Align reads to reference with minimap2 and produce idxstats-like counts.
This uses PAF output format (no SAM/BAM generation) for faster alignment and streams the output directly without intermediate files.
- Args:
inBam: Input reads (BAM format) refFasta: Reference genome (FASTA format) outStats: Output file in samtools idxstats format outReadlist: Optional output file with read IDs that mapped (or None to skip) threads: Number of threads for alignment (default: auto-detect)
read_utils.py minimap2_idxstats [-h] [--outReadlist OUTREADLIST]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam refFasta outStats
2.2.2.23.1. Positional Arguments
- inBam
Input unaligned reads, BAM format.
- refFasta
Reference genome, FASTA format.
- outStats
Output idxstats file (tab-separated: ref_name, ref_length, mapped_count, 0).
2.2.2.23.2. Named Arguments
- --outReadlist
Optional output file listing read IDs that mapped.
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.24. bwamem_idxstats
- Take reads, align to reference with BWA-MEM and perform samtools idxstats.
Optionally filter reads after alignment, prior to reporting idxstats, to include only those flagged as properly paired.
read_utils.py bwamem_idxstats [-h] [--outBam OUTBAM] [--outStats OUTSTATS]
[--minScoreToFilter MIN_SCORE_TO_FILTER]
[--alignerOptions ALIGNER_OPTIONS]
[--filterReadsAfterAlignment]
[--doNotRequirePairsToBeProper]
[--keepSingletons] [--keepDuplicates]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam refFasta
2.2.2.24.1. Positional Arguments
- inBam
Input unaligned reads, BAM format.
- refFasta
Reference genome, FASTA format, pre-indexed by Picard and Novoalign.
2.2.2.24.2. Named Arguments
- --outBam
Output aligned, indexed BAM file
- --outStats
Output idxstats file
- --minScoreToFilter
Filter bwa alignments using this value as the minimum allowed alignment score. Specifically, sum the alignment scores across all alignments for each query (including reads in a pair, supplementary and secondary alignments) and then only include, in the output, queries whose summed alignment score is at least this value. This is only applied when the aligner is ‘bwa’. The filtering on a summed alignment score is sensible for reads in a pair and supplementary alignments, but may not be reasonable if bwa outputs secondary alignments (i.e., if ‘-a’ is in the aligner options). (default: not set - i.e., do not filter bwa’s output)
- --alignerOptions
bwa options (default: bwa defaults)
- --filterReadsAfterAlignment
If specified, reads till be filtered after alignment to include only those flagged as properly paired.This excludes secondary and supplementary alignments.
Default:
False- --doNotRequirePairsToBeProper
Do not require reads to be properly paired when filtering (default: False)
Default:
False- --keepSingletons
Keep singleton reads when filtering (default: False)
Default:
False- --keepDuplicates
When filtering, do not exclude reads due to being flagged as duplicates (default: False)
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.25. extract_tarball
- Extract an input .tar, .tgz, .tar.gz, .tar.bz2, .tar.lz4, or .zip file
to a given directory (or we will choose one on our own). Emit the resulting directory path to stdout.
read_utils.py extract_tarball [-h] [--compression {gz,bz2,lz4,zip,none,auto}]
[--pipe_hint PIPE_HINT] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
tarfile out_dir
2.2.2.25.1. Positional Arguments
- tarfile
Input tar file. May be “-” for stdin.
- out_dir
Output directory
2.2.2.25.2. Named Arguments
- --compression
Possible choices: gz, bz2, lz4, zip, none, auto
Compression type (default: ‘auto’). Auto-detect is incompatible with stdin input unless pipe_hint is specified.
Default:
'auto'- --pipe_hint
If tarfile is stdin, you can provide a file-like URI string for pipe_hint which ends with a common compression file extension if you want to use compression=auto.
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.26. read_names
Extract read names from a sequence file
read_utils.py read_names [-h] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
in_reads out_read_names
2.2.2.26.1. Positional Arguments
- in_reads
the input reads ([compressed] fasta or bam)
- out_read_names
the read names
2.2.2.26.2. Named Arguments
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.2.2.27. trim_rmdup_subsamp
Take reads through Trimmomatic, Prinseq, and subsampling.
This function performs three main operations: 1. Trimmomatic: adapter trimming and quality filtering 2. Prinseq: duplicate read removal 3. Subsampling: reduce to target read count
- Args:
inBam: Input unaligned BAM file clipDb: Trimmomatic adapter clip database (FASTA) outBam: Output unaligned BAM file n_reads: Target number of individual reads (default: 100000) trim_opts: Optional dict of Trimmomatic options threads: Number of threads for Trimmomatic (default: all available)
- Returns:
Tuple of (n_input, n_trim, n_rmdup, n_output, n_paired_subsamp, n_unpaired_subsamp) where all counts are individual reads (not pairs).
read_utils.py trim_rmdup_subsamp [-h] [--n_reads N_READS] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inBam clipDb outBam
2.2.2.27.1. Positional Arguments
- inBam
Input reads, unaligned BAM format.
- clipDb
Trimmomatic clip DB (FASTA with adapter sequences).
- outBam
- Output reads, unaligned BAM format (currently, read groups and other
header information are destroyed in this process).
2.2.2.27.2. Named Arguments
- --n_reads
- Subsample to no more than this many individual reads. Note that
paired reads are given priority, and unpaired reads are included to reach the count if there are too few paired reads. (default: 100000)
Default:
100000- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False