3.1. metagenomics.py - metagenomic analysesΒΆ

This script contains a number of utilities for metagenomic analyses.

usage: metagenomics.py subcommand
Sub-commands:
subset_taxonomy

Generate a subset of the taxonomy db files filtered by the whitelist. The whitelist taxids indicate specific taxids plus their parents to add to taxonomy while whitelistTreeTaxids indicate specific taxids plus both parents and all children taxa. Whitelist GI and accessions can only be provided in file form and the resulting gi/accession2taxid files will be filtered to only include those in the whitelist files. Finally, taxids + parents for the gis/accessions will also be included.

usage: metagenomics.py subset_taxonomy [-h]
                                       [--whitelistTaxids WHITELISTTAXIDS [WHITELISTTAXIDS ...]]
                                       [--whitelistTaxidFile WHITELISTTAXIDFILE]
                                       [--whitelistTreeTaxids WHITELISTTREETAXIDS [WHITELISTTREETAXIDS ...]]
                                       [--whitelistTreeTaxidFile WHITELISTTREETAXIDFILE]
                                       [--whitelistGiFile WHITELISTGIFILE]
                                       [--whitelistAccessionFile WHITELISTACCESSIONFILE]
                                       [--skipGi] [--skipAccession]
                                       [--skipDeadAccession]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version] [--tmp_dir TMP_DIR]
                                       [--tmp_dirKeep]
                                       taxDb outputDb
Positional arguments:
taxDb Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
outputDb Output taxonomy database directory
Options:
--whitelistTaxids
 List of taxids to add to taxonomy (with parents)
--whitelistTaxidFile
 File containing taxids - one per line - to add to taxonomy with parents.
--whitelistTreeTaxids
 List of taxids to add to taxonomy (with parents and children)
--whitelistTreeTaxidFile
 File containing taxids - one per line - to add to taxonomy with parents and children.
--whitelistGiFile
 File containing GIs - one per line - to add to taxonomy with nodes.
--whitelistAccessionFile
 File containing accessions - one per line - to add to taxonomy with nodes.
--skipGi=False Skip GI to taxid mapping files
--skipAccession=False
 Skip accession to taxid mapping files
--skipDeadAccession=False
 Skip dead accession to taxid mapping files
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
krakenuniq

Classify reads by taxon using KrakenUniq

usage: metagenomics.py krakenuniq [-h]
                                  [--outReports OUTREPORTS [OUTREPORTS ...]]
                                  [--outReads OUTREADS [OUTREADS ...]]
                                  [--filterThreshold FILTERTHRESHOLD]
                                  [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  db inBams [inBams ...]
Positional arguments:
db Kraken database directory.
inBams Input unaligned reads, BAM format.
Options:
--outReports Kraken summary report output file. Multiple filenames space separated.
--outReads Kraken per read classification output file. Multiple filenames space separated.
--filterThreshold=0.05
 Kraken filter threshold (default %(default)s)
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
krona

Create an interactive HTML report from a tabular metagenomic report

usage: metagenomics.py krona [-h] [--queryColumn QUERYCOLUMN]
                             [--taxidColumn TAXIDCOLUMN]
                             [--scoreColumn SCORECOLUMN]
                             [--magnitudeColumn MAGNITUDECOLUMN] [--noHits]
                             [--noRank] [--inputType {tsv,krakenuniq,kaiju}]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             inReport db outHtml
Positional arguments:
inReport Input report file (default: tsv)
db Krona taxonomy database directory.
outHtml Output html report.
Options:
--queryColumn=2
 Column of query id. (default %(default)s)
--taxidColumn=3
 Column of taxonomy id. (default %(default)s)
--scoreColumn Column of score. (default %(default)s)
--magnitudeColumn
 Column of magnitude. (default %(default)s)
--noHits=False Include wedge for no hits.
--noRank=False Include no rank assignments.
--inputType=tsv
 

Handling for specialized report types.

Possible choices: tsv, krakenuniq, kaiju

--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
kaiju

Classify reads by the taxon of the Lowest Common Ancestor (LCA)

usage: metagenomics.py kaiju [-h] [--outReads OUTREADS] [--threads THREADS]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             inBam db taxDb outReport
Positional arguments:
inBam Input unaligned reads, BAM format.
db Kaiju database .fmi file.
taxDb Taxonomy database directory.
outReport Output taxonomy report.
Options:
--outReads Output LCA assignments for each read.
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
report_merge

Merge multiple metegenomic reports into a single metagenomic report. Any Krona input files created by this

usage: metagenomics.py report_merge [-h]
                                    [--outSummaryReport OUT_KRAKEN_SUMMARY]
                                    [--krakenDB KRAKEN_DB]
                                    [--outByQueryToTaxonID OUT_KRONA_INPUT]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    metagenomic_reports
                                    [metagenomic_reports ...]
Positional arguments:
metagenomic_reports
 Input metagenomic reports with the query ID and taxon ID in the 2nd and 3rd columns (Kraken format)
Options:
--outSummaryReport
 Path of human-readable metagenomic summary report, created by kraken-report
--krakenDB Kraken database (needed for outSummaryReport)
--outByQueryToTaxonID
 Output metagenomic report suitable for Krona input.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_bam_to_taxa

Filter an (already classified) input bam file to only include reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification file, as produced by tools such as Kraken, as well as the NCBI taxonomy database.

usage: metagenomics.py filter_bam_to_taxa [-h]
                                          [--taxNames TAX_NAMES [TAX_NAMES ...]]
                                          [--taxIDs TAX_IDS [TAX_IDS ...]]
                                          [--without-children]
                                          [--read_id_col READ_ID_COL]
                                          [--tax_id_col TAX_ID_COL]
                                          [--JVMmemory JVMMEMORY]
                                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                          [--version] [--tmp_dir TMP_DIR]
                                          [--tmp_dirKeep]
                                          in_bam read_IDs_to_tax_IDs out_bam
                                          nodes_dmp names_dmp
Positional arguments:
in_bam Input bam file.
read_IDs_to_tax_IDs
 TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.
out_bam Output bam file, filtered to the taxa specified
nodes_dmp nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
names_dmp names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
Options:
--taxNames The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. “Viruses” This is in addition to any taxonomic IDs provided.
--taxIDs The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.
--without-children=False
 Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).
--read_id_col=1
 The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing read IDs. (default: %(default)s)
--tax_id_col=2 The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing Taxonomy IDs. (default: %(default)s)
--JVMmemory=4g JVM virtual memory size (default: %(default)s)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
taxlevel_summary

Aggregates taxonomic abundance data from multiple Kraken-format summary files. It is intended to report information on a particular taxonomic level (–taxlevelFocus; ex. ‘species’), within a higher-level grouping (–taxHeading; ex. ‘Viruses’). By default, when –taxHeading is at the same level as –taxlevelFocus a summary with lines for each sample is emitted. Otherwise, a histogram is returned. If per-sample information is desired, –noHist can be specified. In per-sample data, the suffix “-pt” indicates percentage, so a value of 0.02 is 0.0002 of the total number of reads for the sample. If –topN is specified, only the top N most abundant taxa are included in the histogram count or per-sample output. If a number is specified for –countThreshold, only taxa with that number of reads (or greater) are included. Full data returned via –jsonOut (filtered by –topN and –countThreshold), whereas -csvOut returns a summary.

usage: metagenomics.py taxlevel_summary [-h] [--jsonOut JSON_OUT]
                                        [--csvOut CSV_OUT]
                                        [--taxHeading TAX_HEADINGS [TAX_HEADINGS ...]]
                                        [--taxlevelFocus TAXLEVEL_FOCUS]
                                        [--topN TOP_N_ENTRIES]
                                        [--countThreshold COUNT_THRESHOLD]
                                        [--zeroFill] [--noHist]
                                        [--includeRoot]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        summary_files_in
                                        [summary_files_in ...]
Positional arguments:
summary_files_in
 Kraken-format summary text file with tab-delimited taxonomic levels.
Options:
--jsonOut The path to a json file containing the relevant parsed summary data in json format.
--csvOut The path to a csv file containing sample-specific counts.
--taxHeading=Viruses
 The taxonomic heading to analyze (default: %(default)s). More than one can be specified.
--taxlevelFocus=species
 The taxonomic heading to summarize (totals by Genus, etc.) (default: %(default)s).
--topN=100 Only include the top N most abundant taxa by read count (default: %(default)s)
--countThreshold=1
 Minimum number of reads to be included (default: %(default)s)
--zeroFill=False
 When absent from a sample, write zeroes (rather than leaving blank).
--noHist=False Write out a report by-sample rather than a histogram.
--includeRoot=False
 Include the count of reads at the root level and the unclassified bin.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
krakenuniq_build

Builds a krakenuniq database from library directory of fastas and taxonomy db directory. The –subsetTaxonomy option allows shrinking the taxonomy to only include taxids associated with the library folders. For this to work, the library fastas must have the standard accession id names such as `>NC1234.1` or `>NC_01234.1`. Setting the –minimizerLen (default: 16) small, such as 10, will drastically shrink the db size for small inputs, which is useful for testing. The built db may include symlinks to the original –library / –taxonomy directories. If you want to build a static archiveable version of the library, simply use the –clean option, which will also remove any unnecessary files.

usage: metagenomics.py krakenuniq_build [-h] [--library LIBRARY]
                                        [--taxonomy TAXONOMY]
                                        [--subsetTaxonomy]
                                        [--minimizerLen MINIMIZERLEN]
                                        [--kmerLen KMERLEN]
                                        [--maxDbSize MAXDBSIZE] [--clean]
                                        [--workOnDisk] [--threads THREADS]
                                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                        [--version] [--tmp_dir TMP_DIR]
                                        [--tmp_dirKeep]
                                        db
Positional arguments:
db Krakenuniq database output directory.
Options:
--library Input library directory of fasta files. If not specified, it will be read from the “library” subdirectory of “db”.
--taxonomy Taxonomy db directory. If not specified, it will be read from the “taxonomy” subdirectory of “db”.
--subsetTaxonomy=False
 Subset taxonomy based on library fastas.
--minimizerLen Minimizer length (krakenuniq default: 15)
--kmerLen k-mer length (krakenuniq default: 31)
--maxDbSize Maximum db size in GB (will shrink if too big)
--clean=False Clean by deleting other database files after build
--workOnDisk=False
 Work on disk instead of RAM. This is generally much slower unless the “db” directory lives on a RAM disk.
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.