2.3. assembly.py - viral sequence assembly from NGS reads

This script contains a number of utilities for viral sequence assembly
from NGS reads. Primarily used for Lassa and Ebola virus analysis in the Sabeti Lab / Broad Institute Viral Genomics.

usage: assembly.py subcommand

2.3.1. subcommands

[F: Possible choices: assemble_spades, gapfill_gap2seq, cluster_references_ani, skani_contigs_to_refs, order_and_orient, impute_from_reference, refine_assembly, normalize_coverage, filter_short_seqs, modify_contig, vcf_to_fasta, trim_fasta, deambig_fasta, alignment_summary, simulate_illumina_reads

2.3.2. Sub-commands

2.3.2.1. assemble_spades

De novo RNA-seq assembly with the SPAdes assembler.

assembly.py assemble_spades [-h] [--contigsTrusted CONTIGS_TRUSTED]
                            [--contigsUntrusted CONTIGS_UNTRUSTED]
                            [--nReads N_READS] [--kmerSizes KMER_SIZES]
                            [--outReads OUTREADS] [--filterContigs]
                            [--alwaysSucceed] [--minContigLen MIN_CONTIG_LEN]
                            [--spadesOpts SPADES_OPTS]
                            [--memLimitGb MEM_LIMIT_GB] [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            in_bam clip_db out_fasta

2.3.2.1.1. Positional Arguments

in_bam: Input unaligned reads, BAM format. May include both paired and unpaired reads.
clip_db: Trimmomatic clip db
out_fasta: Output assembled contigs. Note that, since this is RNA-seq assembly, for each assembled genomic region there may be several contigs representing different variants of that region.

2.3.2.1.2. Named Arguments

--contigsTrusted

Optional input contigs of high quality, previously assembled from the same sample

--contigsUntrusted

Optional input contigs of high medium quality, previously assembled from the same sample

--nReads

Before assembly, subsample the reads to at most this many

Default: 10000000

--kmerSizes

Comma-separated ascending order list of odd-value kmer sizes to attempt

--outReads

Save the trimmomatic/prinseq/subsamp reads to a BAM file

--filterContigs

only output contigs SPAdes is sure of (drop lesser-quality contigs from output)

Default: False

--alwaysSucceed

if assembly fails for any reason, output an empty contigs file, rather than failing with an error code

Default: True

--minContigLen

only output contigs longer than this many bp

Default: 0

--spadesOpts

(advanced) Extra flags to pass to the SPAdes assembler

Default: '--rnaviral'

--memLimitGb

Max memory to use, in GB (default: 4)

Default: 4

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.2. gapfill_gap2seq

This step runs the Gap2Seq tool to close gaps between contigs in a scaffold.

assembly.py gapfill_gap2seq [-h] [--memLimitGb MEM_LIMIT_GB]
                            [--timeSoftLimitMinutes TIME_SOFT_LIMIT_MINUTES]
                            [--maskErrors] [--gap2seqOpts GAP2SEQ_OPTS]
                            [--randomSeed RANDOM_SEED] [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            in_scaffold in_bam out_scaffold

2.3.2.2.1. Positional Arguments

in_scaffold: FASTA file containing the scaffold. Each FASTA record corresponds to one segment (for multi-segment genomes). Contigs within each segment are separated by Ns.
in_bam: Input unaligned reads, BAM format.
out_scaffold: Output assemble.

2.3.2.2.2. Named Arguments

--memLimitGb

Max memory to use, in gigabytes 4.0

Default: 4.0

--timeSoftLimitMinutes

Stop trying to close more gaps after this many minutes (default: 60.0); this is a soft/advisory limit

Default: 60.0

--maskErrors

In case of any error, just copy in_scaffold to out_scaffold, emulating a successful run that simply could not fill any gaps.

Default: False

--gap2seqOpts

(advanced) Extra command-line options to pass to Gap2Seq

Default: ''

--randomSeed

Random seed; 0 means use current time

Default: 0

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.3. cluster_references_ani

This step uses the skani triangle tool to define clusters of highly-related genomes.

assembly.py cluster_references_ani [-h] [-m M] [-s S] [-c C] [--min_af MIN_AF]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   inRefs [inRefs ...] outClusters

2.3.2.3.1. Positional Arguments

inRefs: FASTA files containing reference genomes
outClusters: Output file containing clusters of highly-related genomes. Each line contains the filenames of the genomes in one cluster.

2.3.2.3.2. Named Arguments

-m

marker k-mer compression factor (default: 15)

Default: 15

-s

screen out pairs with < percent identity using k-mer sketching (default: 50)

Default: 50

-c

compression factor (k-mer subsampling ratio) (default: 10)

Default: 10

--min_af

minimum alignment fraction (default: 15)

Default: 15

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.4. skani_contigs_to_refs

assembly.py skani_contigs_to_refs [-h] [-m M] [-s S] [-c C] [-n N]
                                  [--min_af MIN_AF] [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inContigs inRefs [inRefs ...] out_skani_dist
                                  out_skani_dist_filtered
                                  out_clusters_filtered

2.3.2.4.1. Positional Arguments

inContigs: FASTA file containing contigs
inRefs: FASTA files containing reference genomes
out_skani_dist: Output file containing ANI distances between contigs and references
out_skani_dist_filtered: Output file containing ANI distances between contigs and references, with only the top reference hit per cluster
out_clusters_filtered: Output file containing clusters of highly-related genomes, with only clusters that have a hit to the contigs

2.3.2.4.2. Named Arguments

-m

marker k-mer compression factor (default: 15)

Default: 15

-s

screen out pairs with < percent identity using k-mer sketching (default: 50)

Default: 50

-c

compression factor (k-mer subsampling ratio) (default: 10)

Default: 10

-n

maximum number of hits to report (default: unlimited)

--min_af

minimum alignment fraction (default: 15)

Default: 15

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.5. order_and_orient

This step cleans up the de novo assembly with a known reference genome.
Uses MUMmer (nucmer or promer) to create a reference-based consensus sequence of aligned contigs (with runs of N’s in between the de novo contigs).

assembly.py order_and_orient [-h] [--outAlternateContigs OUTALTERNATECONTIGS]
                             [--nGenomeSegments N_GENOME_SEGMENTS]
                             [--outReference OUTREFERENCE]
                             [--outStats OUTSTATS] [--allow_incomplete_output]
                             [--breaklen BREAKLEN] [--maxgap MAXGAP]
                             [--minmatch MINMATCH] [--mincluster MINCLUSTER]
                             [--min_pct_id MIN_PCT_ID]
                             [--min_contig_len MIN_CONTIG_LEN]
                             [--min_pct_contig_aligned MIN_PCT_CONTIG_ALIGNED]
                             [--threads THREADS]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             inFasta inReference [inReference ...] outFasta

2.3.2.5.1. Positional Arguments

inFasta

Input de novo assembly/contigs, FASTA format.

inReference

Reference genome for ordering, orienting, and merging contigs, FASTA format. Multiple filenames may be listed, each containing one reference genome. Alternatively, multiple references may be given by specifying a single filename, and giving the number of reference segments with the nGenomeSegments parameter. If multiple references are given, they must all contain the same number of segments listed in the same order.

outFasta

Output assembly, FASTA format, with the same number of: chromosomes as inReference, and in the same order.

2.3.2.5.2. Named Arguments

--outAlternateContigs

Output sequences (FASTA format) from alternative contigs that mapped,: but were not chosen for the final output.

--nGenomeSegments

Number of genome segments. If 0 (the default), the inReference parameter is treated as one genome.: If positive, the inReference parameter is treated as a list of genomes of nGenomeSegments each.

Default: 0

--outReference

Output the reference chosen for scaffolding to this file

--outStats

Output stats used in reference selection

--allow_incomplete_output

Do not fail with IncompleteAssemblyError if we fail to recover all segments of the desired genome.

Default: False

--breaklen, -b

Amount to extend alignment clusters by (if –extend).: nucmer default 200, promer default 60.

--maxgap, -g

Maximum gap between two adjacent matches in a cluster.: Our default is 200. nucmer default 90, promer default 30. Manual suggests going to 1000.

Default: 200

--minmatch, -l

Minimum length of an maximal exact match.: Our default is 10. nucmer default 20, promer default 6.

Default: 10

--mincluster, -c

Minimum cluster length.: nucmer default 65, promer default 20.

--min_pct_id, -i

show-tiling: minimum percent identity for contig alignment (0.0 - 1.0, default: 0.6)

Default: 0.6

--min_contig_len

show-tiling: reject contigs smaller than this (default: 200)

Default: 200

--min_pct_contig_aligned, -v

show-tiling: minimum percent of contig length in alignment (0.0 - 1.0, default: 0.3)

Default: 0.3

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.6. impute_from_reference

This takes a de novo assembly, aligns against a reference genome, and imputes all missing positions (plus some of the chromosome ends) with the reference genome. This provides an assembly with the proper structure (but potentially wrong sequences in areas) from which we can perform further read-based refinement. Two steps: filter_short_seqs: We then toss out all assemblies that come out to

< 15kb or < 95% unambiguous and fail otherwise.

modify_contig: Finally, we trim off anything at the end that exceeds
the length of the known reference assemble. We also replace all Ns and everything within 55bp of the chromosome ends with the reference sequence. This is clearly incorrect consensus sequence, but it allows downstream steps to map reads in parts of the genome that would otherwise be Ns, and we will correct all of the inferred positions with two steps of read-based refinement (below), and revert positions back to Ns where read support is lacking.

FASTA indexing: output assembly is indexed for Picard, Samtools, Novoalign.

assembly.py impute_from_reference [-h] [--newName NEWNAME]
                                  [--minLengthFraction MINLENGTHFRACTION]
                                  [--minUnambig MINUNAMBIG]
                                  [--replaceLength REPLACELENGTH]
                                  [--aligner {muscle,mafft,mummer}] [--index]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  inFasta inReference outFasta

2.3.2.6.1. Positional Arguments

inFasta: Input assembly/contigs, FASTA format, already ordered, oriented and merged with inReference.
inReference: Reference genome to impute with, FASTA format.
outFasta: Output assembly, FASTA format.

2.3.2.6.2. Named Arguments

--newName

rename output chromosome (default: do not rename)

--minLengthFraction

minimum length for contig, as fraction of reference (default: 0.5)

Default: 0.5

--minUnambig

minimum percentage unambiguous bases for contig (default: 0.5)

Default: 0.5

--replaceLength

length of ends to be replaced with reference (default: 0)

Default: 0

--aligner

Possible choices: muscle, mafft, mummer

which method to use to align inFasta to: inReference. “muscle” = MUSCLE, “mafft” = MAFFT, “mummer” = nucmer. [default: ‘muscle’]

Default: 'muscle'

--index

Index outFasta for Picard/GATK, Samtools, and Novoalign.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.7. refine_assembly

This a refinement step where we take a crude assembly, align
all reads back to it, and modify the assembly to the majority allele at each position based on read pileups. This step considers both SNPs as well as indels called by FreeBayes and will correct the consensus based on variant calls. Reads are aligned with Novoalign, then PCR duplicates are removed with Picard (in order to debias the allele counts in the pileups). Output FASTA file is indexed for Picard, Samtools, and Novoalign.

assembly.py refine_assembly [-h]
                            [--already_realigned_bam ALREADY_REALIGNED_BAM]
                            [--outBam OUTBAM] [--outVcf OUTVCF]
                            [--min_coverage MIN_COVERAGE]
                            [--major_cutoff MAJOR_CUTOFF]
                            [--novo_params NOVO_PARAMS]
                            [--chr_names [CHR_NAMES ...]] [--keep_all_reads]
                            [--max_coverage MAX_COVERAGE]
                            [--rasusa_seed RASUSA_SEED]
                            [--JVMmemory JVMMEMORY]
                            [--NOVOALIGN_LICENSE_PATH NOVOALIGN_LICENSE_PATH]
                            [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            inFasta inBam outFasta

2.3.2.7.1. Positional Arguments

inFasta: Input assembly, FASTA format, pre-indexed for Picard, Samtools, and Novoalign.
inBam: Input reads, unaligned BAM format.
outFasta: Output refined assembly, FASTA format, indexed for Picard, Samtools, and Novoalign.

2.3.2.7.2. Named Arguments

--already_realigned_bam

BAM with reads that are already aligned to inFasta.: This bypasses the alignment process by novoalign and instead uses the given BAM to make an assemble. When set, outBam is ignored.

--outBam

Reads aligned to inFasta. Unaligned and duplicate reads have been removed.

--outVcf

Variant calls for genome in inFasta coordinate space.

--min_coverage

Minimum read coverage required to call a position unambiguous.

Default: 3

--major_cutoff

If the major allele is present at a frequency higher than this cutoff,: we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: 0.5]

Default: 0.5

--novo_params

Alignment parameters for Novoalign.

Default: '-r Random -l 40 -g 40 -x 20 -t 100'

--chr_names

Rename all output chromosomes (default: retain original chromosome names)

Default: []

--keep_all_reads

Retain all reads in BAM file? Default is to remove unaligned and duplicate reads.

Default: False

--max_coverage

Maximum read coverage depth for variant calling. If specified,: reads are downsampled to this coverage using rasusa before FreeBayes variant calling. The downsampled BAM is internal only and not included in any output. [default: None]

--rasusa_seed

Random seed for rasusa downsampling reproducibility.: Only used when –max_coverage is set. [default: None]

--JVMmemory

JVM virtual memory size for Picard/Novoalign (default: ‘2g’)

Default: '2g'

--NOVOALIGN_LICENSE_PATH

A path to the novoalign.lic file. This overrides the NOVOALIGN_LICENSE_PATH environment variable. (default: None)

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.8. normalize_coverage

Downsample an aligned BAM to a maximum coverage depth using rasusa.: The output BAM is coordinate-sorted and indexed.

assembly.py normalize_coverage [-h] [--seed SEED] [--threads THREADS]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam outBam max_coverage

2.3.2.8.1. Positional Arguments

inBam: Input aligned BAM file, coordinate-sorted.
outBam: Output downsampled BAM file, coordinate-sorted and indexed.
max_coverage: Maximum coverage depth to downsample to.

2.3.2.8.2. Named Arguments

--seed

Random seed for rasusa reproducibility. [default: None]

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.9. filter_short_seqs

Check sequences in inFile, retaining only those that are at least minLength

assembly.py filter_short_seqs [-h] [-f FORMAT] [-of OUTPUT_FORMAT]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              inFile minLength minUnambig outFile

2.3.2.9.1. Positional Arguments

inFile: input sequence file
minLength: minimum length for contig
minUnambig: minimum percentage unambiguous bases for contig
outFile: output file

2.3.2.9.2. Named Arguments

-f, --format

Format for input sequence (default: ‘fasta’)

Default: 'fasta'

-of, --output-format

Format for output sequence (default: ‘fasta’)

Default: 'fasta'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.3.2.10. modify_contig

Modifies an input contig. Depending on the options
selected, can replace N calls with reference calls, replace ambiguous calls with reference calls, trim to the length of the reference, replace contig ends with reference calls, and trim leading and trailing Ns. Author: rsealfon.

assembly.py modify_contig [-h] [-n NAME] [-cn] [-t] [-r5] [-r3]
                          [-l REPLACE_LENGTH] [-f FORMAT] [-r] [-rn] [-ca]
                          [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version]
                          input output ref

2.3.2.10.1. Positional Arguments

input: input alignment of reference and contig (should contain exactly 2 sequences)
output: Destination file for modified contigs
ref: reference sequence name (exact match required)

2.3.2.10.2. Named Arguments

-n, --name

fasta header output name (default: existing header)

-cn, --call-reference-ns

should the reference sequence be called if there is an: N in the contig and a more specific base in the reference (default: False)

Default: False

-t, --trim-ends

should ends of contig.fasta be trimmed to length of reference (default: False)

Default: False

-r5, --replace-5ends

should the 5’-end of contig.fasta be replaced by reference (default: False)

Default: False

-r3, --replace-3ends

should the 3’-end of contig.fasta be replaced by reference (default: False)

Default: False

-l, --replace-length

length of ends to be replaced (if replace-ends is yes) (default: 10)

Default: 10

-f, --format

Format for input alignment (default: ‘fasta’)

Default: 'fasta'

-r, --replace-end-gaps

Replace gaps at the beginning and end of the sequence with reference sequence (default: False)

Default: False

-rn, --remove-end-ns

Remove leading and trailing N’s in the contig (default: False)

Default: False

-ca, --call-reference-ambiguous

should the reference sequence be called if the contig seq is ambiguous and: the reference sequence is more informative & consistant with the ambiguous base (ie Y->C) (default: False)

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.3.2.11. vcf_to_fasta

Take input genotypes (VCF) and construct a consensus sequence
(fasta) by using majority-read-count alleles in the VCF. Genotypes in the VCF will be ignored–we will use the allele with majority read support (or an ambiguity base if there is no clear majority). Uncalled positions will be emitted as N’s. Author: dpark.

assembly.py vcf_to_fasta [-h] [--trim_ends] [--min_coverage MIN_DP]
                         [--major_cutoff MAJOR_CUTOFF]
                         [--min_dp_ratio MIN_DP_RATIO] [--name [NAME ...]]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version]
                         inVcf outFasta

2.3.2.11.1. Positional Arguments

inVcf: Input VCF file
outFasta: Output FASTA file

2.3.2.11.2. Named Arguments

--trim_ends

If specified, we will strip off continuous runs of N’s from the beginning: and end of the sequences before writing to output. Interior N’s will not be changed.

Default: False

--min_coverage

Specify minimum read coverage (with full agreement) to make a call.: [default: 3]

Default: 3

--major_cutoff

If the major allele is present at a frequency higher than this cutoff,: we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position. [default: 0.5]

Default: 0.5

--min_dp_ratio

The input VCF file often reports two read depth values (DP)–one for: the position as a whole, and one for the sample in question. We can optionally reject calls in which the sample read count is below a specified fraction of the total read count. This filter will not apply to any sites unless both DP values are reported. [default: 0.0]

Default: 0.0

--name

output sequence names (default: reference names in VCF file)

Default: []

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.3.2.12. trim_fasta

Take input sequences (fasta) and trim any continuous sections of
N’s from the ends of them. Write trimmed sequences to an output fasta file.

assembly.py trim_fasta [-h]
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                       [--version]
                       inFasta outFasta

2.3.2.12.1. Positional Arguments

inFasta: Input fasta file
outFasta: Output (trimmed) fasta file

2.3.2.12.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.3.2.13. deambig_fasta

Take input sequences (fasta) and replace any ambiguity bases with a
random unambiguous base from among the possibilities described by the ambiguity code. Write output to fasta file.

assembly.py deambig_fasta [-h]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version]
                          inFasta outFasta

2.3.2.13.1. Positional Arguments

inFasta: Input fasta file
outFasta: Output fasta file

2.3.2.13.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.3.2.14. alignment_summary

Write or print pairwise alignment summary information for sequences in two FASTA
files, including SNPs, ambiguous bases, and indels.

assembly.py alignment_summary [-h] [--outfileName OUTFILENAME] [--printCounts]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              inFastaFileOne inFastaFileTwo

2.3.2.14.1. Positional Arguments

inFastaFileOne: First fasta file for an alignment
inFastaFileTwo: First fasta file for an alignment

2.3.2.14.2. Named Arguments

--outfileName

Output file for counts in TSV format

--printCounts

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.3.2.15. simulate_illumina_reads

Simulate Illumina paired-end reads using wgsim.

Args:
in_fasta: Input reference fasta file out_bam: Output unaligned BAM file coverage: Coverage specification in one of three formats:

Single float value (e.g., 20.0) for uniform coverage across all sequences

Space-separated string with per-sequence depths (e.g., “chr1:20x chr2:2x”)

Path to BED file with 5th column containing depth values

read_length: Length of each simulated read (default: 150) outer_distance: Outer distance between read pairs (default: 500) error_rate: Base error rate (default: 0.02) mutation_rate: Mutation rate (default: 0.001) indel_fraction: Fraction of errors that are indels (default: 0.15) indel_extended_prob: Probability an indel is extended (default: 0.3) random_seed: Random seed for reproducibility sample_name: Sample name for read group library_name: Library name for read group

assembly.py simulate_illumina_reads [-h] [--read_length READ_LENGTH]
                                    [--outer_distance OUTER_DISTANCE]
                                    [--error_rate ERROR_RATE]
                                    [--mutation_rate MUTATION_RATE]
                                    [--indel_fraction INDEL_FRACTION]
                                    [--indel_extended_prob INDEL_EXTENDED_PROB]
                                    [--random_seed RANDOM_SEED]
                                    [--sample_name SAMPLE_NAME]
                                    [--library_name LIBRARY_NAME]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    in_fasta out_bam coverage

2.3.2.15.1. Positional Arguments

in_fasta

Input reference genome (fasta format)

out_bam

Output unaligned BAM file

coverage

Coverage specification in one of three formats:

Single value (e.g., “20” for 20X across all sequences)
Per-sequence string (e.g., “chr1:20x chr2:2x chr3:0.48x”)
Path to BED file where 5th column contains depth values

2.3.2.15.2. Named Arguments

--read_length

Length of each read (default: 150)

Default: 150

--outer_distance

Outer distance between read pairs (default: 500)

Default: 500

--error_rate

Base error rate (default: 0.02)

Default: 0.02

--mutation_rate

Mutation rate (default: 0.001)

Default: 0.001

--indel_fraction

Fraction of errors that are indels (default: 0.15)

Default: 0.15

--indel_extended_prob

Probability an indel is extended (default: 0.3)

Default: 0.3

--random_seed

Random seed for reproducibility (default: current time)

--sample_name

Sample name for read group (default: ‘sample’)

Default: 'sample'

--library_name

Library name for read group (default: ‘lib1’)

Default: 'lib1'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False