2.3. assembly.py - viral sequence assembly from NGS reads
- This script contains a number of utilities for viral sequence assembly
from NGS reads. Primarily used for Lassa and Ebola virus analysis in the Sabeti Lab / Broad Institute Viral Genomics.
usage: assembly.py subcommand
2.3.1. subcommands
- [F
Possible choices: assemble_spades, gapfill_gap2seq, cluster_references_ani, skani_contigs_to_refs, order_and_orient, impute_from_reference, refine_assembly, normalize_coverage, filter_short_seqs, modify_contig, vcf_to_fasta, trim_fasta, deambig_fasta, alignment_summary, simulate_illumina_reads
2.3.2. Sub-commands
2.3.2.1. assemble_spades
De novo RNA-seq assembly with the SPAdes assembler.
assembly.py assemble_spades [-h] [--contigsTrusted CONTIGS_TRUSTED]
[--contigsUntrusted CONTIGS_UNTRUSTED]
[--nReads N_READS] [--kmerSizes KMER_SIZES]
[--outReads OUTREADS] [--filterContigs]
[--alwaysSucceed] [--minContigLen MIN_CONTIG_LEN]
[--spadesOpts SPADES_OPTS]
[--memLimitGb MEM_LIMIT_GB] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
in_bam clip_db out_fasta
2.3.2.1.1. Positional Arguments
- in_bam
Input unaligned reads, BAM format. May include both paired and unpaired reads.
- clip_db
Trimmomatic clip db
- out_fasta
Output assembled contigs. Note that, since this is RNA-seq assembly, for each assembled genomic region there may be several contigs representing different variants of that region.
2.3.2.1.2. Named Arguments
- --contigsTrusted
Optional input contigs of high quality, previously assembled from the same sample
- --contigsUntrusted
Optional input contigs of high medium quality, previously assembled from the same sample
- --nReads
Before assembly, subsample the reads to at most this many
Default:
10000000- --kmerSizes
Comma-separated ascending order list of odd-value kmer sizes to attempt
- --outReads
Save the trimmomatic/prinseq/subsamp reads to a BAM file
- --filterContigs
only output contigs SPAdes is sure of (drop lesser-quality contigs from output)
Default:
False- --alwaysSucceed
if assembly fails for any reason, output an empty contigs file, rather than failing with an error code
Default:
True- --minContigLen
only output contigs longer than this many bp
Default:
0- --spadesOpts
(advanced) Extra flags to pass to the SPAdes assembler
Default:
'--rnaviral'- --memLimitGb
Max memory to use, in GB (default: 4)
Default:
4- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.2. gapfill_gap2seq
This step runs the Gap2Seq tool to close gaps between contigs in a scaffold.
assembly.py gapfill_gap2seq [-h] [--memLimitGb MEM_LIMIT_GB]
[--timeSoftLimitMinutes TIME_SOFT_LIMIT_MINUTES]
[--maskErrors] [--gap2seqOpts GAP2SEQ_OPTS]
[--randomSeed RANDOM_SEED] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
in_scaffold in_bam out_scaffold
2.3.2.2.1. Positional Arguments
- in_scaffold
FASTA file containing the scaffold. Each FASTA record corresponds to one segment (for multi-segment genomes). Contigs within each segment are separated by Ns.
- in_bam
Input unaligned reads, BAM format.
- out_scaffold
Output assemble.
2.3.2.2.2. Named Arguments
- --memLimitGb
Max memory to use, in gigabytes 4.0
Default:
4.0- --timeSoftLimitMinutes
Stop trying to close more gaps after this many minutes (default: 60.0); this is a soft/advisory limit
Default:
60.0- --maskErrors
In case of any error, just copy in_scaffold to out_scaffold, emulating a successful run that simply could not fill any gaps.
Default:
False- --gap2seqOpts
(advanced) Extra command-line options to pass to Gap2Seq
Default:
''- --randomSeed
Random seed; 0 means use current time
Default:
0- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.3. cluster_references_ani
This step uses the skani triangle tool to define clusters of highly-related genomes.
assembly.py cluster_references_ani [-h] [-m M] [-s S] [-c C] [--min_af MIN_AF]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inRefs [inRefs ...] outClusters
2.3.2.3.1. Positional Arguments
- inRefs
FASTA files containing reference genomes
- outClusters
Output file containing clusters of highly-related genomes. Each line contains the filenames of the genomes in one cluster.
2.3.2.3.2. Named Arguments
- -m
marker k-mer compression factor (default: 15)
Default:
15- -s
screen out pairs with < percent identity using k-mer sketching (default: 50)
Default:
50- -c
compression factor (k-mer subsampling ratio) (default: 10)
Default:
10- --min_af
minimum alignment fraction (default: 15)
Default:
15- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.4. skani_contigs_to_refs
assembly.py skani_contigs_to_refs [-h] [-m M] [-s S] [-c C] [-n N]
[--min_af MIN_AF] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inContigs inRefs [inRefs ...] out_skani_dist
out_skani_dist_filtered
out_clusters_filtered
2.3.2.4.1. Positional Arguments
- inContigs
FASTA file containing contigs
- inRefs
FASTA files containing reference genomes
- out_skani_dist
Output file containing ANI distances between contigs and references
- out_skani_dist_filtered
Output file containing ANI distances between contigs and references, with only the top reference hit per cluster
- out_clusters_filtered
Output file containing clusters of highly-related genomes, with only clusters that have a hit to the contigs
2.3.2.4.2. Named Arguments
- -m
marker k-mer compression factor (default: 15)
Default:
15- -s
screen out pairs with < percent identity using k-mer sketching (default: 50)
Default:
50- -c
compression factor (k-mer subsampling ratio) (default: 10)
Default:
10- -n
maximum number of hits to report (default: unlimited)
- --min_af
minimum alignment fraction (default: 15)
Default:
15- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.5. order_and_orient
- This step cleans up the de novo assembly with a known reference genome.
Uses MUMmer (nucmer or promer) to create a reference-based consensus sequence of aligned contigs (with runs of N’s in between the de novo contigs).
assembly.py order_and_orient [-h] [--outAlternateContigs OUTALTERNATECONTIGS]
[--nGenomeSegments N_GENOME_SEGMENTS]
[--outReference OUTREFERENCE]
[--outStats OUTSTATS] [--allow_incomplete_output]
[--breaklen BREAKLEN] [--maxgap MAXGAP]
[--minmatch MINMATCH] [--mincluster MINCLUSTER]
[--min_pct_id MIN_PCT_ID]
[--min_contig_len MIN_CONTIG_LEN]
[--min_pct_contig_aligned MIN_PCT_CONTIG_ALIGNED]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inFasta inReference [inReference ...] outFasta
2.3.2.5.1. Positional Arguments
- inFasta
Input de novo assembly/contigs, FASTA format.
- inReference
Reference genome for ordering, orienting, and merging contigs, FASTA format. Multiple filenames may be listed, each containing one reference genome. Alternatively, multiple references may be given by specifying a single filename, and giving the number of reference segments with the nGenomeSegments parameter. If multiple references are given, they must all contain the same number of segments listed in the same order.
- outFasta
- Output assembly, FASTA format, with the same number of
chromosomes as inReference, and in the same order.
2.3.2.5.2. Named Arguments
- --outAlternateContigs
- Output sequences (FASTA format) from alternative contigs that mapped,
but were not chosen for the final output.
- --nGenomeSegments
- Number of genome segments. If 0 (the default), the inReference parameter is treated as one genome.
If positive, the inReference parameter is treated as a list of genomes of nGenomeSegments each.
Default:
0- --outReference
Output the reference chosen for scaffolding to this file
- --outStats
Output stats used in reference selection
- --allow_incomplete_output
Do not fail with IncompleteAssemblyError if we fail to recover all segments of the desired genome.
Default:
False- --breaklen, -b
- Amount to extend alignment clusters by (if –extend).
nucmer default 200, promer default 60.
- --maxgap, -g
- Maximum gap between two adjacent matches in a cluster.
Our default is 200. nucmer default 90, promer default 30. Manual suggests going to 1000.
Default:
200- --minmatch, -l
- Minimum length of an maximal exact match.
Our default is 10. nucmer default 20, promer default 6.
Default:
10- --mincluster, -c
- Minimum cluster length.
nucmer default 65, promer default 20.
- --min_pct_id, -i
show-tiling: minimum percent identity for contig alignment (0.0 - 1.0, default: 0.6)
Default:
0.6- --min_contig_len
show-tiling: reject contigs smaller than this (default: 200)
Default:
200- --min_pct_contig_aligned, -v
show-tiling: minimum percent of contig length in alignment (0.0 - 1.0, default: 0.3)
Default:
0.3- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.6. impute_from_reference
This takes a de novo assembly, aligns against a reference genome, and imputes all missing positions (plus some of the chromosome ends) with the reference genome. This provides an assembly with the proper structure (but potentially wrong sequences in areas) from which we can perform further read-based refinement. Two steps: filter_short_seqs: We then toss out all assemblies that come out to
< 15kb or < 95% unambiguous and fail otherwise.
- modify_contig: Finally, we trim off anything at the end that exceeds
the length of the known reference assemble. We also replace all Ns and everything within 55bp of the chromosome ends with the reference sequence. This is clearly incorrect consensus sequence, but it allows downstream steps to map reads in parts of the genome that would otherwise be Ns, and we will correct all of the inferred positions with two steps of read-based refinement (below), and revert positions back to Ns where read support is lacking.
FASTA indexing: output assembly is indexed for Picard, Samtools, Novoalign.
assembly.py impute_from_reference [-h] [--newName NEWNAME]
[--minLengthFraction MINLENGTHFRACTION]
[--minUnambig MINUNAMBIG]
[--replaceLength REPLACELENGTH]
[--aligner {muscle,mafft,mummer}] [--index]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
inFasta inReference outFasta
2.3.2.6.1. Positional Arguments
- inFasta
Input assembly/contigs, FASTA format, already ordered, oriented and merged with inReference.
- inReference
Reference genome to impute with, FASTA format.
- outFasta
Output assembly, FASTA format.
2.3.2.6.2. Named Arguments
- --newName
rename output chromosome (default: do not rename)
- --minLengthFraction
minimum length for contig, as fraction of reference (default: 0.5)
Default:
0.5- --minUnambig
minimum percentage unambiguous bases for contig (default: 0.5)
Default:
0.5- --replaceLength
length of ends to be replaced with reference (default: 0)
Default:
0- --aligner
Possible choices: muscle, mafft, mummer
- which method to use to align inFasta to
inReference. “muscle” = MUSCLE, “mafft” = MAFFT, “mummer” = nucmer. [default: ‘muscle’]
Default:
'muscle'- --index
Index outFasta for Picard/GATK, Samtools, and Novoalign.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.7. refine_assembly
- This a refinement step where we take a crude assembly, align
all reads back to it, and modify the assembly to the majority allele at each position based on read pileups. This step considers both SNPs as well as indels called by FreeBayes and will correct the consensus based on variant calls. Reads are aligned with Novoalign, then PCR duplicates are removed with Picard (in order to debias the allele counts in the pileups). Output FASTA file is indexed for Picard, Samtools, and Novoalign.
assembly.py refine_assembly [-h]
[--already_realigned_bam ALREADY_REALIGNED_BAM]
[--outBam OUTBAM] [--outVcf OUTVCF]
[--min_coverage MIN_COVERAGE]
[--major_cutoff MAJOR_CUTOFF]
[--novo_params NOVO_PARAMS]
[--chr_names [CHR_NAMES ...]] [--keep_all_reads]
[--max_coverage MAX_COVERAGE]
[--rasusa_seed RASUSA_SEED]
[--JVMmemory JVMMEMORY]
[--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inFasta inBam outFasta
2.3.2.7.1. Positional Arguments
- inFasta
Input assembly, FASTA format, pre-indexed for Picard, Samtools, and Novoalign.
- inBam
Input reads, unaligned BAM format.
- outFasta
Output refined assembly, FASTA format, indexed for Picard, Samtools, and Novoalign.
2.3.2.7.2. Named Arguments
- --already_realigned_bam
- BAM with reads that are already aligned to inFasta.
This bypasses the alignment process by novoalign and instead uses the given BAM to make an assemble. When set, outBam is ignored.
- --outBam
Reads aligned to inFasta. Unaligned and duplicate reads have been removed.
- --outVcf
Variant calls for genome in inFasta coordinate space.
- --min_coverage
Minimum read coverage required to call a position unambiguous.
Default:
3- --major_cutoff
- If the major allele is present at a frequency higher than this cutoff,
we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: 0.5]
Default:
0.5- --novo_params
Alignment parameters for Novoalign.
Default:
'-r Random -l 40 -g 40 -x 20 -t 100'- --chr_names
Rename all output chromosomes (default: retain original chromosome names)
Default:
[]- --keep_all_reads
Retain all reads in BAM file? Default is to remove unaligned and duplicate reads.
Default:
False- --max_coverage
- Maximum read coverage depth for variant calling. If specified,
reads are downsampled to this coverage using rasusa before FreeBayes variant calling. The downsampled BAM is internal only and not included in any output. [default: None]
- --rasusa_seed
- Random seed for rasusa downsampling reproducibility.
Only used when –max_coverage is set. [default: None]
- --JVMmemory
JVM virtual memory size for Picard/Novoalign (default: ‘2g’)
Default:
'2g'- --NOVOALIGN_LICENSE_PATH
A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.8. normalize_coverage
- Downsample an aligned BAM to a maximum coverage depth using rasusa.
The output BAM is coordinate-sorted and indexed.
assembly.py normalize_coverage [-h] [--seed SEED] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inBam outBam max_coverage
2.3.2.8.1. Positional Arguments
- inBam
Input aligned BAM file, coordinate-sorted.
- outBam
Output downsampled BAM file, coordinate-sorted and indexed.
- max_coverage
Maximum coverage depth to downsample to.
2.3.2.8.2. Named Arguments
- --seed
Random seed for rasusa reproducibility. [default: None]
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.9. filter_short_seqs
Check sequences in inFile, retaining only those that are at least minLength
assembly.py filter_short_seqs [-h] [-f FORMAT] [-of OUTPUT_FORMAT]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inFile minLength minUnambig outFile
2.3.2.9.1. Positional Arguments
- inFile
input sequence file
- minLength
minimum length for contig
- minUnambig
minimum percentage unambiguous bases for contig
- outFile
output file
2.3.2.9.2. Named Arguments
- -f, --format
Format for input sequence (default: ‘fasta’)
Default:
'fasta'- -of, --output-format
Format for output sequence (default: ‘fasta’)
Default:
'fasta'- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.3.2.10. modify_contig
- Modifies an input contig. Depending on the options
selected, can replace N calls with reference calls, replace ambiguous calls with reference calls, trim to the length of the reference, replace contig ends with reference calls, and trim leading and trailing Ns. Author: rsealfon.
assembly.py modify_contig [-h] [-n NAME] [-cn] [-t] [-r5] [-r3]
[-l REPLACE_LENGTH] [-f FORMAT] [-r] [-rn] [-ca]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
input output ref
2.3.2.10.1. Positional Arguments
- input
input alignment of reference and contig (should contain exactly 2 sequences)
- output
Destination file for modified contigs
- ref
reference sequence name (exact match required)
2.3.2.10.2. Named Arguments
- -n, --name
fasta header output name (default: existing header)
- -cn, --call-reference-ns
- should the reference sequence be called if there is an
N in the contig and a more specific base in the reference (default: False)
Default:
False- -t, --trim-ends
should ends of contig.fasta be trimmed to length of reference (default: False)
Default:
False- -r5, --replace-5ends
should the 5’-end of contig.fasta be replaced by reference (default: False)
Default:
False- -r3, --replace-3ends
should the 3’-end of contig.fasta be replaced by reference (default: False)
Default:
False- -l, --replace-length
length of ends to be replaced (if replace-ends is yes) (default: 10)
Default:
10- -f, --format
Format for input alignment (default: ‘fasta’)
Default:
'fasta'- -r, --replace-end-gaps
Replace gaps at the beginning and end of the sequence with reference sequence (default: False)
Default:
False- -rn, --remove-end-ns
Remove leading and trailing N’s in the contig (default: False)
Default:
False- -ca, --call-reference-ambiguous
- should the reference sequence be called if the contig seq is ambiguous and
the reference sequence is more informative & consistant with the ambiguous base (ie Y->C) (default: False)
Default:
False- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.3.2.11. vcf_to_fasta
- Take input genotypes (VCF) and construct a consensus sequence
(fasta) by using majority-read-count alleles in the VCF. Genotypes in the VCF will be ignored–we will use the allele with majority read support (or an ambiguity base if there is no clear majority). Uncalled positions will be emitted as N’s. Author: dpark.
assembly.py vcf_to_fasta [-h] [--trim_ends] [--min_coverage MIN_DP]
[--major_cutoff MAJOR_CUTOFF]
[--min_dp_ratio MIN_DP_RATIO] [--name [NAME ...]]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inVcf outFasta
2.3.2.11.1. Positional Arguments
- inVcf
Input VCF file
- outFasta
Output FASTA file
2.3.2.11.2. Named Arguments
- --trim_ends
- If specified, we will strip off continuous runs of N’s from the beginning
and end of the sequences before writing to output. Interior N’s will not be changed.
Default:
False- --min_coverage
- Specify minimum read coverage (with full agreement) to make a call.
[default: 3]
Default:
3- --major_cutoff
- If the major allele is present at a frequency higher than this cutoff,
we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: 0.5]
Default:
0.5- --min_dp_ratio
- The input VCF file often reports two read depth values (DP)–one for
the position as a whole, and one for the sample in question. We can optionally reject calls in which the sample read count is below a specified fraction of the total read count. This filter will not apply to any sites unless both DP values are reported. [default: 0.0]
Default:
0.0- --name
output sequence names (default: reference names in VCF file)
Default:
[]- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.3.2.12. trim_fasta
- Take input sequences (fasta) and trim any continuous sections of
N’s from the ends of them. Write trimmed sequences to an output fasta file.
assembly.py trim_fasta [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inFasta outFasta
2.3.2.12.1. Positional Arguments
- inFasta
Input fasta file
- outFasta
Output (trimmed) fasta file
2.3.2.12.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.3.2.13. deambig_fasta
- Take input sequences (fasta) and replace any ambiguity bases with a
random unambiguous base from among the possibilities described by the ambiguity code. Write output to fasta file.
assembly.py deambig_fasta [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inFasta outFasta
2.3.2.13.1. Positional Arguments
- inFasta
Input fasta file
- outFasta
Output fasta file
2.3.2.13.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.3.2.14. alignment_summary
- Write or print pairwise alignment summary information for sequences in two FASTA
files, including SNPs, ambiguous bases, and indels.
assembly.py alignment_summary [-h] [--outfileName OUTFILENAME] [--printCounts]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
inFastaFileOne inFastaFileTwo
2.3.2.14.1. Positional Arguments
- inFastaFileOne
First fasta file for an alignment
- inFastaFileTwo
First fasta file for an alignment
2.3.2.14.2. Named Arguments
- --outfileName
Output file for counts in TSV format
- --printCounts
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.3.2.15. simulate_illumina_reads
Simulate Illumina paired-end reads using wgsim.
- Args:
in_fasta: Input reference fasta file out_bam: Output unaligned BAM file coverage: Coverage specification in one of three formats:
Single float value (e.g., 20.0) for uniform coverage across all sequences
Space-separated string with per-sequence depths (e.g., “chr1:20x chr2:2x”)
Path to BED file with 5th column containing depth values
read_length: Length of each simulated read (default: 150) outer_distance: Outer distance between read pairs (default: 500) error_rate: Base error rate (default: 0.02) mutation_rate: Mutation rate (default: 0.001) indel_fraction: Fraction of errors that are indels (default: 0.15) indel_extended_prob: Probability an indel is extended (default: 0.3) random_seed: Random seed for reproducibility sample_name: Sample name for read group library_name: Library name for read group
assembly.py simulate_illumina_reads [-h] [--read_length READ_LENGTH]
[--outer_distance OUTER_DISTANCE]
[--error_rate ERROR_RATE]
[--mutation_rate MUTATION_RATE]
[--indel_fraction INDEL_FRACTION]
[--indel_extended_prob INDEL_EXTENDED_PROB]
[--random_seed RANDOM_SEED]
[--sample_name SAMPLE_NAME]
[--library_name LIBRARY_NAME]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
in_fasta out_bam coverage
2.3.2.15.1. Positional Arguments
- in_fasta
Input reference genome (fasta format)
- out_bam
Output unaligned BAM file
- coverage
- Coverage specification in one of three formats:
Single value (e.g., “20” for 20X across all sequences)
Per-sequence string (e.g., “chr1:20x chr2:2x chr3:0.48x”)
Path to BED file where 5th column contains depth values
2.3.2.15.2. Named Arguments
- --read_length
Length of each read (default: 150)
Default:
150- --outer_distance
Outer distance between read pairs (default: 500)
Default:
500- --error_rate
Base error rate (default: 0.02)
Default:
0.02- --mutation_rate
Mutation rate (default: 0.001)
Default:
0.001- --indel_fraction
Fraction of errors that are indels (default: 0.15)
Default:
0.15- --indel_extended_prob
Probability an indel is extended (default: 0.3)
Default:
0.3- --random_seed
Random seed for reproducibility (default: current time)
- --sample_name
Sample name for read group (default: ‘sample’)
Default:
'sample'- --library_name
Library name for read group (default: ‘lib1’)
Default:
'lib1'- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False