3.2. assembly.py - de novo assemblyΒΆ
This script contains a number of utilities for viral sequence assembly from NGS reads. Primarily used for Lassa and Ebola virus analysis in the Sabeti Lab / Broad Institute Viral Genomics.
usage: assembly.py subcommand
- Sub-commands:
- assemble_trinity
Undocumented
usage: assembly.py assemble_trinity [-h] [--n_reads N_READS] [--outReads OUTREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inBam clipDb outFasta
- Positional arguments:
inBam Input reads, BAM format. clipDb Trimmomatic clip DB. outFasta Output assembly. - Options:
--n_reads=100000 Subsample reads to no more than this many pairs. (default %(default)s) --outReads Save the trimmomatic/prinseq/subsamp reads to a BAM file --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- order_and_orient
Undocumented
usage: assembly.py order_and_orient [-h] [--inReads INREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inFasta inReference outFasta
- Positional arguments:
inFasta Input assembly/contigs, FASTA format. inReference Reference genome, FASTA format. outFasta Output assembly, FASTA format. - Options:
--inReads Input reads in BAM format. --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- impute_from_reference
Undocumented
usage: assembly.py impute_from_reference [-h] [--newName NEWNAME] [--minLength MINLENGTH] [--minUnambig MINUNAMBIG] [--replaceLength REPLACELENGTH] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inFasta inReference outFasta
- Positional arguments:
inFasta Input assembly/contigs, FASTA format. inReference Reference genome, FASTA format. outFasta Output assembly, FASTA format. - Options:
--newName rename output chromosome (default: do not rename) --minLength=0 minimum length for contig (default: %(default)s) --minUnambig=0.0 minimum percentage unambiguous bases for contig (default: %(default)s) --replaceLength=0 length of ends to be replaced with reference (default: %(default)s) --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- refine_assembly
Undocumented
usage: assembly.py refine_assembly [-h] [--outBam OUTBAM] [--outVcf OUTVCF] [--min_coverage MIN_COVERAGE] [--novo_params NOVO_PARAMS] [--chr_names [CHR_NAMES [CHR_NAMES ...]]] [--keep_all_reads] [--JVMmemory JVMMEMORY] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmpDir TMPDIR] [--tmpDirKeep] inFasta inBam outFasta
- Positional arguments:
inFasta Input assembly, FASTA format, pre-indexed for Picard, Samtools, and Novoalign. inBam Input reads, BAM format. outFasta Output refined assembly, FASTA format, indexed for Picard, Samtools, and Novoalign. - Options:
--outBam Reads aligned to inFasta. Unaligned and duplicate reads have been removed. GATK indel realigned. --outVcf GATK genotype calls for genome in inFasta coordinate space. --min_coverage=3 Minimum read coverage required to call a position unambiguous. --novo_params=-r Random -l 40 -g 40 -x 20 -t 100 Alignment parameters for Novoalign. --chr_names=[] Rename all output chromosomes (default: retain original chromosome names) --keep_all_reads=False Retain all reads in BAM file? Default is to remove unaligned and duplicate reads. --JVMmemory=2g JVM virtual memory size (default: %(default)s) --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- filter_short_seqs
Undocumented
usage: assembly.py filter_short_seqs [-h] [-f FORMAT] [-of OUTPUT_FORMAT] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] inFile minLength minUnambig outFile
- Positional arguments:
inFile input sequence file minLength minimum length for contig minUnambig minimum percentage unambiguous bases for contig outFile output file - Options:
-f=fasta, --format=fasta Format for input sequence (default: %(default)s) -of=fasta, --output-format=fasta Format for output sequence (default: %(default)s) --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit
- modify_contig
Undocumented
usage: assembly.py modify_contig [-h] [-n NAME] [-cn] [-t] [-r5] [-r3] [-l REPLACE_LENGTH] [-f FORMAT] [-r] [-rn] [-ca] [--tmpDir TMPDIR] [--tmpDirKeep] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] input output ref
- Positional arguments:
input input alignment of reference and contig (should contain exactly 2 sequences) output Destination file for modified contigs ref reference sequence name (exact match required) - Options:
-n, --name fasta header output name (default: existing header) -cn=False, --call-reference-ns=False should the reference sequence be called if there is an N in the contig and a more specific base in the reference (default: %(default)s) -t=False, --trim-ends=False should ends of contig.fasta be trimmed to length of reference (default: %(default)s) -r5=False, --replace-5ends=False should the 5’-end of contig.fasta be replaced by reference (default: %(default)s) -r3=False, --replace-3ends=False should the 3’-end of contig.fasta be replaced by reference (default: %(default)s) -l=10, --replace-length=10 length of ends to be replaced (if replace-ends is yes) (default: %(default)s) -f=fasta, --format=fasta Format for input alignment (default: %(default)s) -r=False, --replace-end-gaps=False Replace gaps at the beginning and end of the sequence with reference sequence (default: %(default)s) -rn=False, --remove-end-ns=False Remove leading and trailing N’s in the contig (default: %(default)s) -ca=False, --call-reference-ambiguous=False should the reference sequence be called if the contig seq is ambiguous and the reference sequence is more informative & consistant with the ambiguous base (ie Y->C) (default: %(default)s) --tmpDir=/tmp Base directory for temp files. [default: %(default)s] --tmpDirKeep=False Keep the tmpDir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure. --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit
- vcf_to_fasta
Undocumented
usage: assembly.py vcf_to_fasta [-h] [--trim_ends] [--min_coverage MIN_DP] [--major_cutoff MAJOR_CUTOFF] [--min_dp_ratio MIN_DP_RATIO] [--name [NAME [NAME ...]]] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] inVcf outFasta
- Positional arguments:
inVcf Input VCF file outFasta Output FASTA file - Options:
--trim_ends=False If specified, we will strip off continuous runs of N’s from the beginning and end of the sequences before writing to output. Interior N’s will not be changed. --min_coverage=3 Specify minimum read coverage (with full agreement) to make a call. [default: %(default)s] --major_cutoff=0.5 If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: %(default)s] --min_dp_ratio=0.0 The input VCF file often reports two read depth values (DP)–one for the position as a whole, and one for the sample in question. We can optionally reject calls in which the sample read count is below a specified fraction of the total read count. This filter will not apply to any sites unless both DP values are reported. [default: %(default)s] --name=[] output sequence names (default: reference names in VCF file) --loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit
- trim_fasta
Undocumented
usage: assembly.py trim_fasta [-h] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] inFasta outFasta
- Positional arguments:
inFasta Input fasta file outFasta Output (trimmed) fasta file - Options:
--loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit
- deambig_fasta
Undocumented
usage: assembly.py deambig_fasta [-h] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] inFasta outFasta
- Positional arguments:
inFasta Input fasta file outFasta Output fasta file - Options:
--loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit
- dpdiff
Undocumented
usage: assembly.py dpdiff [-h] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] inVcfs [inVcfs ...] outFile
- Positional arguments:
inVcfs Input VCF file outFile Output flat file - Options:
--loglevel=DEBUG Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit