1.2.1. Viral genome assembly
The filtered and trimmed reads are subsampled to at most 100,000 pairs.
de novo assemby is performed using Trinity. SPAdes is also offered as
an alternative de novo assembler.
Reference-assisted assembly improvements follow (contig scaffolding, orienting, etc.)
with MUMMER and MUSCLE or MAFFT. Gap2Seq is used to seal gaps between scaffolded de novo contigs with sequencing reads.
Each sample’s reads are aligned to its de novo assembly using Novoalign
and any remaining duplicates were removed using Picard MarkDuplicates.
Variant positions in each assembly were identified using GATK IndelRealigner and
UnifiedGenotyper on the read alignments. The assembly was refined to represent the
major allele at each variant site, and any positions supported by fewer than three
reads were changed to N.
This align-call-refine cycle is iterated twice, to minimize reference bias in the assembly.
1.2.2. Intrahost variant identification
Intrahost variants (iSNVs) were called from each sample’s read alignments using
and subjected to an initial set of filters:
variant calls with fewer than five forward or reverse reads
or more than a 10-fold strand bias were eliminated.
iSNVs were also removed if there was more than a five-fold difference
between the strand bias of the variant call and the strand bias of the reference call.
Variant calls that passed these filters were additionally subjected
to a 0.5% frequency filter.
The final list of iSNVs contains only variant calls that passed all filters in two
separate library preparations.
These files infer 100% allele frequencies for all samples at an iSNV position where
there was no intra-host variation within the sample, but a clear consensus call during
assembly. Annotations are computed with snpEff.