1. Description of the methods

Much more documentation to come...

TO DO: here we will put a high level description of the various tools that exist here, perhaps with some pictures and such. We will describe why we used certain tools and approaches / how other approaches fell short / what kinds of problems certain steps are trying to solve. Perhaps some links to papers and such. Kind of a mini-methods paper here.

1.1. Taxonomic read filtration

1.1.1. Human, contaminant, and duplicate read removal

BMTAGGER

BLAST

M-Vicuna

1.1.2. Taxonomic selection

LASTAL

1.2. Viral genome analysis

1.2.1. Viral genome assembly

de novo genome assembly with Trinity. Reference-assisted assembly improvements (scaffolding, orienting, etc) with VFAT (which relies on MUSCLE).

We then do two rounds of assembly improvement (Novoalign and GATK).

1.2.2. Intrahost variant identification

Intrahost variants (iSNVs) are identified from deep sequence coverage using V-Phaser2. For each sample, reads are first aligned to their own consensus genome with Novoalign, followed by duplicate read removal with Picard and local realignment with GATK IndelRealigner. V-Phaser2 is called on each sample to produce a set of iSNV calls.

(then stuff about strand bias filter, then stuff about library counts)

(then stuff about remapping all calls back to the reference assembly’s coordinate space and alleles using MUSCLE, and merging calls across all samples together, emitting in VCF format)

iSNVs are then annotated with snpEff and provided in both VCF and tabular text formats.

1.3. Taxonomic read identification

Nothing here at the moment. That comes later, but we will later integrate it when it’s ready.