3.1. metagenomics.py - metagenomic analyses¶

This script contains a number of utilities for metagenomic analyses.

usage: metagenomics.py subcommand

Sub-commands:

subset_taxonomy

Generate a subset of the taxonomy db files filtered by the whitelist. The whitelist taxids indicate specific taxids plus their parents to add to taxonomy while whitelistTreeTaxids indicate specific taxids plus both parents and all children taxa. Whitelist GI and accessions can only be provided in file form and the resulting gi/accession2taxid files will be filtered to only include those in the whitelist files. Finally, taxids + parents for the gis/accessions will also be included.

usage: metagenomics.py subset_taxonomy [-h]
                                       [--whitelistTaxids WHITELISTTAXIDS [WHITELISTTAXIDS ...]]
                                       [--whitelistTaxidFile WHITELISTTAXIDFILE]
                                       [--whitelistTreeTaxids WHITELISTTREETAXIDS [WHITELISTTREETAXIDS ...]]
                                       [--whitelistTreeTaxidFile WHITELISTTREETAXIDFILE]
                                       [--whitelistGiFile WHITELISTGIFILE]
                                       [--whitelistAccessionFile WHITELISTACCESSIONFILE]
                                       [--skipGi] [--skipAccession]
                                       [--skipDeadAccession]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       taxDb outputDb

Positional arguments:

`taxDb`	Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
`outputDb`	Output taxonomy database directory

Options:

`--whitelistTaxids`
	List of taxids to add to taxonomy (with parents)
`--whitelistTaxidFile`
	File containing taxids - one per line - to add to taxonomy with parents.
`--whitelistTreeTaxids`
	List of taxids to add to taxonomy (with parents and children)
`--whitelistTreeTaxidFile`
	File containing taxids - one per line - to add to taxonomy with parents and children.
`--whitelistGiFile`
	File containing GIs - one per line - to add to taxonomy with nodes.
`--whitelistAccessionFile`
	File containing accessions - one per line - to add to taxonomy with nodes.
`--skipGi=False`	Skip GI to taxid mapping files
`--skipAccession=False`
	Skip accession to taxid mapping files
`--skipDeadAccession=False`
	Skip dead accession to taxid mapping files
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

kraken

Classify reads by taxon using Kraken

usage: metagenomics.py kraken [-h] [--outReports OUTREPORTS [OUTREPORTS ...]]
                              [--outReads OUTREADS [OUTREADS ...]]
                              [--lockMemory]
                              [--filterThreshold FILTERTHRESHOLD]
                              [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              db inBams [inBams ...]

Positional arguments:

`db`	Kraken database directory.
`inBams`	Input unaligned reads, BAM format.

Options:

`--outReports`	Kraken summary report output file. Multiple filenames space separated.
`--outReads`	Kraken per read classification output file. Multiple filenames space separated.
`--lockMemory=False`
	Lock kraken database in RAM. Requires high ulimit -l.
`--filterThreshold=0.05`
	Kraken filter threshold (default %(default)s)
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

krona

Create an interactive HTML report from a tabular metagenomic report

usage: metagenomics.py krona [-h] [--queryColumn QUERYCOLUMN]
                             [--taxidColumn TAXIDCOLUMN]
                             [--scoreColumn SCORECOLUMN]
                             [--magnitudeColumn MAGNITUDECOLUMN] [--noHits]
                             [--noRank]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             inTsv db outHtml

Positional arguments:

`inTsv`	Input tab delimited file.
`db`	Krona taxonomy database directory.
`outHtml`	Output html report.

Options:

`--queryColumn=2`
	Column of query id. (default %(default)s)
`--taxidColumn=3`
	Column of taxonomy id. (default %(default)s)
`--scoreColumn`	Column of score. (default %(default)s)
`--magnitudeColumn`
	Column of magnitude. (default %(default)s)
`--noHits=False`	Include wedge for no hits.
`--noRank=False`	Include no rank assignments.
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit

diamond

Classify reads by the taxon of the Lowest Common Ancestor (LCA)

usage: metagenomics.py diamond [-h] [--outReads OUTREADS] [--threads THREADS]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               inBam db taxDb outReport

Positional arguments:

`inBam`	Input unaligned reads, BAM format.
`db`	Diamond database directory.
`taxDb`	Taxonomy database directory.
`outReport`	Output taxonomy report.

Options:

`--outReads`	Output LCA assignments for each read.
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

diamond_fasta

Classify fasta sequences by the taxon of the Lowest Common Ancestor (LCA)

usage: metagenomics.py diamond_fasta [-h] [--memLimitGb MEMLIMITGB]
                                     [--threads THREADS]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     inFasta db taxDb outFasta

Positional arguments:

`inFasta`	Input sequences, FASTA format, optionally gzip compressed.
`db`	Diamond database file.
`taxDb`	Taxonomy database directory.
`outFasta`	Output sequences, same as inFasta, with taxid\|###\| prepended to each sequence identifier.

Options:

`--memLimitGb`	approximate memory usage in GB
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

build_diamond_db

usage: metagenomics.py build_diamond_db [-h] [--threads THREADS]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        protein_fastas [protein_fastas ...] db

Positional arguments:

`protein_fastas`	Input protein fasta files
`db`	Output Diamond database file

Options:

`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

align_rna

Align to metagenomics bwa index, mark duplicates, and generate LCA report

usage: metagenomics.py align_rna [-h] [--dupeReport DUPEREPORT] [--sensitive]
                                 [--outBam OUTBAM] [--outReads OUTREADS]
                                 [--dupeReads DUPEREADS]
                                 [--JVMmemory JVMMEMORY] [--threads THREADS]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmp_dir TMP_DIR]
                                 [--tmp_dirKeep]
                                 inBam db taxDb outReport

Positional arguments:

`inBam`	Input unaligned reads, BAM format.
`db`	Bwa index prefix.
`taxDb`	Taxonomy database directory.
`outReport`	Output taxonomy report.

Options:

`--dupeReport`	Generate report including duplicates.
`--sensitive=False`
	Use sensitive instead of default BWA mem options.
`--outBam`	Output aligned, indexed BAM file. Default is to write to temp.
`--outReads`	Output LCA assignments for each read.
`--dupeReads`	Output LCA assignments for each read including duplicates.
`--JVMmemory=2g`	JVM virtual memory size (default: %(default)s)
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

report_merge

Merge multiple metegenomic reports into a single metagenomic report. Any Krona input files created by this

usage: metagenomics.py report_merge [-h]
                                    [--outSummaryReport OUT_KRAKEN_SUMMARY]
                                    [--krakenDB KRAKEN_DB]
                                    [--outByQueryToTaxonID OUT_KRONA_INPUT]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    metagenomic_reports
                                    [metagenomic_reports ...]

Positional arguments:

`metagenomic_reports`
	Input metagenomic reports with the query ID and taxon ID in the 2nd and 3rd columns (Kraken format)

Options:

`--outSummaryReport`
	Path of human-readable metagenomic summary report, created by kraken-report
`--krakenDB`	Kraken database (needed for outSummaryReport)
`--outByQueryToTaxonID`
	Output metagenomic report suitable for Krona input.
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

taxlevel_summary

Aggregates taxonomic abundance data from multiple Kraken-format summary files. It is intended to report information on a particular taxonomic level (–taxlevelFocus; ex. ‘species’), within a higher-level grouping (–taxHeading; ex. ‘Viruses’). By default, when –taxHeading is at the same level as –taxlevelFocus a summary with lines for each sample is emitted. Otherwise, a histogram is returned. If per-sample information is desired, –noHist can be specified. In per-sample data, the suffix “-pt” indicates percentage, so a value of 0.02 is 0.0002 of the total number of reads for the sample. If –topN is specified, only the top N most abundant taxa are included in the histogram count or per-sample output. If a number is specified for –countThreshold, only taxa with that number of reads (or greater) are included. Full data returned via –jsonOut (filtered by –topN and –countThreshold), whereas -csvOut returns a summary.

usage: metagenomics.py taxlevel_summary [-h] [--jsonOut JSON_OUT]
                                        [--csvOut CSV_OUT]
                                        [--taxHeading TAX_HEADINGS [TAX_HEADINGS ...]]
                                        [--taxlevelFocus {species,genus,family,order,class,phylum,kingdom,superkingdom}]
                                        [--topN TOP_N_ENTRIES]
                                        [--countThreshold COUNT_THRESHOLD]
                                        [--zeroFill] [--noHist]
                                        [--includeRoot]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        summary_files_in
                                        [summary_files_in ...]

Positional arguments:

`summary_files_in`
	Kraken-format summary text file with tab-delimited taxonomic levels.

Options:

`--jsonOut`	The path to a json file containing the relevant parsed summary data in json format.
`--csvOut`	The path to a csv file containing sample-specific counts.
`--taxHeading=Viruses`
	The taxonomic heading to analyze (default: %(default)s). More than one can be specified.
`--taxlevelFocus=species`
	The taxonomic heading to summarize (totals by Genus, etc.) (default: %(default)s). Possible choices: species, genus, family, order, class, phylum, kingdom, superkingdom
`--topN=100`	Only include the top N most abundant taxa by read count (default: %(default)s)
`--countThreshold=1`
	Minimum number of reads to be included (default: %(default)s)
`--zeroFill=False`
	When absent from a sample, write zeroes (rather than leaving blank).
`--noHist=False`	Write out a report by-sample rather than a histogram.
`--includeRoot=False`
	Include the count of reads at the root level and the unclassified bin.
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.

kraken_build

Builds a kraken database from library directory of fastas and taxonomy db directory. The –subsetTaxonomy option allows shrinking the taxonomy to only include taxids associated with the library folders. For this to work, the library fastas must have the standard id names such as `>NC1234.1` accessions, `>gi|123456789|ref|XXXX||`, or custom kraken name `>kraken:taxid|1234|`. Setting the –minimizerLen (default: 16) small, such as 10, will drastically shrink the db size for small inputs, which is useful for testing. The built db may include symlinks to the original –library / –taxonomy directories. If you want to build a static archiveable version of the library, simply use the –clean option, which will also remove any unnecessary files.

usage: metagenomics.py kraken_build [-h] [--library LIBRARY]
                                    [--taxonomy TAXONOMY] [--subsetTaxonomy]
                                    [--minimizerLen MINIMIZERLEN]
                                    [--kmerLen KMERLEN]
                                    [--maxDbSize MAXDBSIZE] [--clean]
                                    [--workOnDisk] [--threads THREADS]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    db

Positional arguments:

`db`	Kraken database output directory.

Options:

`--library`	Input library directory of fasta files. If not specified, it will be read from the “library” subdirectory of “db”.
`--taxonomy`	Taxonomy db directory. If not specified, it will be read from the “taxonomy” subdirectory of “db”.
`--subsetTaxonomy=False`
	Subset taxonomy based on library fastas.
`--minimizerLen`	Minimizer length (kraken default: 15)
`--kmerLen`	k-mer length (kraken default: 31)
`--maxDbSize`	Maximum db size in GB (will shrink if too big)
`--clean=False`	Clean by deleting other database files after build
`--workOnDisk=False`
	Work on disk instead of RAM. This is generally much slower unless the “db” directory lives on a RAM disk.
`--threads`	Number of threads (default: all available cores)
`--loglevel=INFO`
	Verboseness of output. [default: %(default)s] Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
`--version, -V`	show program’s version number and exit
`--tmp_dir=/tmp`	Base directory for temp files. [default: %(default)s]
`--tmp_dirKeep=False`
	Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.