3.2. assembly.py - de novo assemblyΒΆ

This script contains a number of utilities for viral sequence assembly from NGS reads. Primarily used for Lassa and Ebola virus analysis in the Sabeti Lab / Broad Institute Viral Genomics.

usage: assembly.py subcommand
Sub-commands:
assemble_trinity

Undocumented

usage: assembly.py assemble_trinity [-h] [--n_reads N_READS]
                                    [--outReads OUTREADS]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmpDir TMPDIR]
                                    [--tmpDirKeep]
                                    inBam clipDb outFasta
Positional arguments:
inBam Input reads, BAM format.
clipDb Trimmomatic clip DB.
outFasta Output assembly.
Options:
--n_reads=100000
 Subsample reads to no more than this many pairs. (default %(default)s)
--outReads Save the trimmomatic/prinseq/subsamp reads to a BAM file
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
order_and_orient

Undocumented

usage: assembly.py order_and_orient [-h] [--inReads INREADS]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmpDir TMPDIR]
                                    [--tmpDirKeep]
                                    inFasta inReference outFasta
Positional arguments:
inFasta Input assembly/contigs, FASTA format.
inReference Reference genome, FASTA format.
outFasta Output assembly, FASTA format.
Options:
--inReads Input reads in BAM format.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
impute_from_reference

Undocumented

usage: assembly.py impute_from_reference [-h] [--newName NEWNAME]
                                         [--minLength MINLENGTH]
                                         [--minUnambig MINUNAMBIG]
                                         [--replaceLength REPLACELENGTH]
                                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                         [--version] [--tmpDir TMPDIR]
                                         [--tmpDirKeep]
                                         inFasta inReference outFasta
Positional arguments:
inFasta Input assembly/contigs, FASTA format.
inReference Reference genome, FASTA format.
outFasta Output assembly, FASTA format.
Options:
--newName rename output chromosome (default: do not rename)
--minLength=0 minimum length for contig (default: %(default)s)
--minUnambig=0.0
 minimum percentage unambiguous bases for contig (default: %(default)s)
--replaceLength=0
 length of ends to be replaced with reference (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
refine_assembly

Undocumented

usage: assembly.py refine_assembly [-h] [--outBam OUTBAM] [--outVcf OUTVCF]
                                   [--min_coverage MIN_COVERAGE]
                                   [--novo_params NOVO_PARAMS]
                                   [--chr_names [CHR_NAMES [CHR_NAMES ...]]]
                                   [--keep_all_reads] [--JVMmemory JVMMEMORY]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmpDir TMPDIR]
                                   [--tmpDirKeep]
                                   inFasta inBam outFasta
Positional arguments:
inFasta Input assembly, FASTA format, pre-indexed for Picard, Samtools, and Novoalign.
inBam Input reads, BAM format.
outFasta Output refined assembly, FASTA format, indexed for Picard, Samtools, and Novoalign.
Options:
--outBam Reads aligned to inFasta. Unaligned and duplicate reads have been removed. GATK indel realigned.
--outVcf GATK genotype calls for genome in inFasta coordinate space.
--min_coverage=3
 Minimum read coverage required to call a position unambiguous.
--novo_params=-r Random -l 40 -g 40 -x 20 -t 100
 Alignment parameters for Novoalign.
--chr_names=[] Rename all output chromosomes (default: retain original chromosome names)
--keep_all_reads=False
 Retain all reads in BAM file? Default is to remove unaligned and duplicate reads.
--JVMmemory=2g JVM virtual memory size (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_short_seqs

Undocumented

usage: assembly.py filter_short_seqs [-h] [-f FORMAT] [-of OUTPUT_FORMAT]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version]
                                     inFile minLength minUnambig outFile
Positional arguments:
inFile input sequence file
minLength minimum length for contig
minUnambig minimum percentage unambiguous bases for contig
outFile output file
Options:
-f=fasta, --format=fasta
 Format for input sequence (default: %(default)s)
-of=fasta, --output-format=fasta
 Format for output sequence (default: %(default)s)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
modify_contig

Undocumented

usage: assembly.py modify_contig [-h] [-n NAME] [-cn] [-t] [-r5] [-r3]
                                 [-l REPLACE_LENGTH] [-f FORMAT] [-r] [-rn]
                                 [-ca] [--tmpDir TMPDIR] [--tmpDirKeep]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version]
                                 input output ref
Positional arguments:
input input alignment of reference and contig (should contain exactly 2 sequences)
output Destination file for modified contigs
ref reference sequence name (exact match required)
Options:
-n, --name fasta header output name (default: existing header)
-cn=False, --call-reference-ns=False
 should the reference sequence be called if there is an N in the contig and a more specific base in the reference (default: %(default)s)
-t=False, --trim-ends=False
 should ends of contig.fasta be trimmed to length of reference (default: %(default)s)
-r5=False, --replace-5ends=False
 should the 5’-end of contig.fasta be replaced by reference (default: %(default)s)
-r3=False, --replace-3ends=False
 should the 3’-end of contig.fasta be replaced by reference (default: %(default)s)
-l=10, --replace-length=10
 length of ends to be replaced (if replace-ends is yes) (default: %(default)s)
-f=fasta, --format=fasta
 Format for input alignment (default: %(default)s)
-r=False, --replace-end-gaps=False
 Replace gaps at the beginning and end of the sequence with reference sequence (default: %(default)s)
-rn=False, --remove-end-ns=False
 Remove leading and trailing N’s in the contig (default: %(default)s)
-ca=False, --call-reference-ambiguous=False
 should the reference sequence be called if the contig seq is ambiguous and the reference sequence is more informative & consistant with the ambiguous base (ie Y->C) (default: %(default)s)
--tmpDir=/tmp Base directory for temp files. [default: %(default)s]
--tmpDirKeep=False
 Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
vcf_to_fasta

Undocumented

usage: assembly.py vcf_to_fasta [-h] [--trim_ends] [--min_coverage MIN_DP]
                                [--major_cutoff MAJOR_CUTOFF]
                                [--min_dp_ratio MIN_DP_RATIO]
                                [--name [NAME [NAME ...]]]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version]
                                inVcf outFasta
Positional arguments:
inVcf Input VCF file
outFasta Output FASTA file
Options:
--trim_ends=False
 If specified, we will strip off continuous runs of N’s from the beginning and end of the sequences before writing to output. Interior N’s will not be changed.
--min_coverage=3
 Specify minimum read coverage (with full agreement) to make a call. [default: %(default)s]
--major_cutoff=0.5
 If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: %(default)s]
--min_dp_ratio=0.0
 The input VCF file often reports two read depth values (DP)–one for the position as a whole, and one for the sample in question. We can optionally reject calls in which the sample read count is below a specified fraction of the total read count. This filter will not apply to any sites unless both DP values are reported. [default: %(default)s]
--name=[] output sequence names (default: reference names in VCF file)
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
trim_fasta

Undocumented

usage: assembly.py trim_fasta [-h]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              inFasta outFasta
Positional arguments:
inFasta Input fasta file
outFasta Output (trimmed) fasta file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
deambig_fasta

Undocumented

usage: assembly.py deambig_fasta [-h]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version]
                                 inFasta outFasta
Positional arguments:
inFasta Input fasta file
outFasta Output fasta file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
dpdiff

Undocumented

usage: assembly.py dpdiff [-h]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version]
                          inVcfs [inVcfs ...] outFile
Positional arguments:
inVcfs Input VCF file
outFile Output flat file
Options:
--loglevel=DEBUG
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit