2.5. metagenomics.py - utilities for metagenomic analyses

This script contains a number of utilities for metagenomic analyses.

usage: metagenomics.py subcommand

2.5.1. subcommands

[F: Possible choices: subset_taxonomy, filter_taxids_to_focal_hits, kraken2, kb, kma, kma_build, centrifuger, centrifuger_build, centrifuger_quant, centrifuger_classification_to_kraken2, centrifuger_kreport, virnucpro_contigs, virnucpro_label_reads_by_contig, krona, report_merge, filter_bam_to_taxa, taxlevel_summary, taxlevel_plurality, kb_extract, kb_top_taxa, kb_merge_h5ads, krona_build, kraken2_build, kb_build, genomad

2.5.2. Sub-commands

2.5.2.1. subset_taxonomy

Generate a subset of the taxonomy db files filtered by the whitelist. The whitelist taxids indicate specific taxids plus their parents to add to taxonomy while whitelistTreeTaxids indicate specific taxids plus both parents and all children taxa. Whitelist GI and accessions can only be provided in file form and the resulting gi/accession2taxid files will be filtered to only include those in the whitelist files. Finally, taxids + parents for the gis/accessions will also be included.

metagenomics.py subset_taxonomy [-h]
                                [--whitelistTaxids WHITELISTTAXIDS [WHITELISTTAXIDS ...]]
                                [--whitelistTaxidFile WHITELISTTAXIDFILE]
                                [--whitelistTreeTaxids WHITELISTTREETAXIDS [WHITELISTTREETAXIDS ...]]
                                [--whitelistTreeTaxidFile WHITELISTTREETAXIDFILE]
                                [--whitelistGiFile WHITELISTGIFILE]
                                [--whitelistAccessionFile WHITELISTACCESSIONFILE]
                                [--skipGi] [--skipAccession]
                                [--skipDeadAccession]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                taxDb outputDb

2.5.2.1.1. Positional Arguments

taxDb: Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
outputDb: Output taxonomy database directory

2.5.2.1.2. Named Arguments

--whitelistTaxids

List of taxids to add to taxonomy (with parents)

--whitelistTaxidFile

File containing taxids - one per line - to add to taxonomy with parents.

--whitelistTreeTaxids

List of taxids to add to taxonomy (with parents and children)

--whitelistTreeTaxidFile

File containing taxids - one per line - to add to taxonomy with parents and children.

--whitelistGiFile

File containing GIs - one per line - to add to taxonomy with nodes.

--whitelistAccessionFile

File containing accessions - one per line - to add to taxonomy with nodes.

--skipGi

Skip GI to taxid mapping files

Default: False

--skipAccession

Skip accession to taxid mapping files

Default: False

--skipDeadAccession

Skip dead accession to taxid mapping files

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.2. filter_taxids_to_focal_hits

Generate a subset of the taxids_tsv file filtered by the focal_report_tsv. We will only emit rows from the taxids_tsv that contain taxids that are either contained within or are a child/descendant of nodes contained within the focal_report_tsv

metagenomics.py filter_taxids_to_focal_hits [-h]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            taxids_tsv focal_report_tsv
                                            taxdb_dir min_read_count
                                            output_tsv

2.5.2.2.1. Positional Arguments

taxids_tsv: TSV file where first column is a taxid
focal_report_tsv: TSV produced by taxlevel_plurality
taxdb_dir: Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
min_read_count: ignore focal_report_tsv entries below this read count
output_tsv: Output TSV file where first column is a taxid

2.5.2.2.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.3. kraken2

Classify reads by taxon using Kraken2

metagenomics.py kraken2 [-h] [--outReports OUTREPORTS [OUTREPORTS ...]]
                        [--outReads OUTREADS [OUTREADS ...]]
                        [--minimum_hit_groups MINIMUM_HIT_GROUPS]
                        [--min_base_qual MIN_BASE_QUAL]
                        [--confidence CONFIDENCE] [--threads THREADS]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        db inBams [inBams ...]

2.5.2.3.1. Positional Arguments

db: Kraken database directory.
inBams: Input unaligned reads, BAM format.

2.5.2.3.2. Named Arguments

--outReports

Kraken2 summary report output file. Multiple filenames space separated.

--outReads

Kraken2 per read classification output file. Multiple filenames space separated.

--minimum_hit_groups

Minimum hit groups (Kraken2 default: 2)

--min_base_qual

Minimum base quality (default None)

--confidence

Kraken2 confidence score threshold (default None)

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.4. kb

Runs kb count on the input BAM files.

Args:: in_bam (list): List of input BAM files. out_dir (str): Output directory. Defaults to None. index (str): Path to the kb index file. t2g (list|str): Transcript-to-gene mapping file(s). kmer_len (int, optional): K-mer size for the alignment. Defaults to 31. parity (str, optional): Library parity (default: single). Defaults to ‘single’. technology (str, optional): Sequencing technology used. Defaults to ‘bulk’. h5ad (bool, optional): Whether to output HDF5 file. Defaults to False. loom (bool, optional): Whether to output Loom file. Defaults to False. protein (bool, optional): Whether the sequence contains amino acids. Defaults to False. threads (int, optional): Number of threads to use. Defaults to None.

metagenomics.py kb [-h] [--index INDEX] [--t2g T2G] [--kmer_len KMER_LEN]
                   [--parity {single,paired}]
                   [--technology {10xv2,10xv3,10xv3-3prime,10xv3-5prime,dropseq,indrop,celseq,celseq2,smartseq2,bulk}]
                   [--h5ad] [--loom] [--protein] [--out_dir OUT_DIR]
                   [--threads THREADS]
                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                   [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                   in_bam

2.5.2.4.1. Positional Arguments

in_bam: Input unaligned reads, BAM format.

2.5.2.4.2. Named Arguments

--index

kb index file.

--t2g

Input unaligned reads, BAM format.

--kmer_len

k-mer size (default: 31bp)

Default: 31

--parity

Possible choices: single, paired

Library parity (default: single)

Default: 'single'

--technology

Possible choices: 10xv2, 10xv3, 10xv3-3prime, 10xv3-5prime, dropseq, indrop, celseq, celseq2, smartseq2, bulk

Technology used to generate the data (default: bulk)

Default: 'bulk'

--h5ad

Output HDF5 file (default: False)

Default: False

--loom

Output Loom file (default: False)

Default: False

--protein

True if sequence contains amino acids (default: False).

Default: False

--out_dir

Output directory (default: kb_out)

Default: 'kb_out'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.5. kma

metagenomics.py kma [-h] [--outPrefixes OUTPREFIXES [OUTPREFIXES ...]]
                    [--threads THREADS]
                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                    [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                    db inBams [inBams ...]

2.5.2.5.1. Positional Arguments

db: KMA database prefix.
inBams: Input unaligned reads, BAM format.

2.5.2.5.2. Named Arguments

--outPrefixes

KMA output prefixes.

--threads

Number of threads.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.6. kma_build

metagenomics.py kma_build [-h] [--threads THREADS]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                          ref_fasta db_prefix

2.5.2.6.1. Positional Arguments

ref_fasta: Reference FASTA file.
db_prefix: Output database prefix.

2.5.2.6.2. Named Arguments

--threads

Number of threads.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.7. centrifuger

Classify reads by taxon using Centrifuger.

metagenomics.py centrifuger [-h] [--k K]
                            [--unclassified_prefix UNCLASSIFIED_PREFIX]
                            [--classified_prefix CLASSIFIED_PREFIX]
                            [--min_hitlen MIN_HITLEN]
                            [--hitk_factor HITK_FACTOR] [--merge_readpair]
                            [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            db in_bam out_classification

2.5.2.7.1. Positional Arguments

db: Centrifuger database prefix.
in_bam: Input unaligned reads, BAM format.
out_classification: Centrifuger per-read classification output file.

2.5.2.7.2. Named Arguments

--k

Report top k classification results for each read. Default: 1.

Default: 1

--unclassified_prefix

Output prefix for unclassified reads.

--classified_prefix

Output prefix for classified reads.

--min_hitlen

Minimum total length of matched segments.

--hitk_factor

Centrifuger hit-k factor.

--merge_readpair

Merge paired reads before classification.

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.8. centrifuger_build

Build a Centrifuger database.

metagenomics.py centrifuger_build [-h]
                                  [--ref_fastas REF_FASTAS [REF_FASTAS ...]]
                                  [--ref_list REF_LIST]
                                  [--conversion_table CONVERSION_TABLE]
                                  [--build_mem BUILD_MEM] [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  db_prefix taxonomy_tree name_table

2.5.2.8.1. Positional Arguments

db_prefix: Centrifuger database output prefix.
taxonomy_tree: NCBI taxonomy nodes.dmp file.
name_table: NCBI taxonomy names.dmp file.

2.5.2.8.2. Named Arguments

--ref_fastas

Reference FASTA files.

--ref_list

File containing reference FASTA paths, one per line.

--conversion_table

Sequence ID to taxonomy ID mapping file.

--build_mem

Memory target for centrifuger-build.

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.9. centrifuger_quant

Quantify Centrifuger classification output.

metagenomics.py centrifuger_quant [-h] [--min_score MIN_SCORE]
                                  [--min_length MIN_LENGTH]
                                  [--output_format OUTPUT_FORMAT]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  db classification output

2.5.2.9.1. Positional Arguments

db: Centrifuger database prefix.
classification: Centrifuger classification output file.
output: Centrifuger quantification output file.

2.5.2.9.2. Named Arguments

--min_score

Minimum score to include a read.

--min_length

Minimum read length to include.

--output_format

Centrifuger quant output format.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.10. centrifuger_classification_to_kraken2

Convert Centrifuger per-read classification output to Kraken2-style per-read classification output.

metagenomics.py centrifuger_classification_to_kraken2 [-h]
                                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                                      [--version]
                                                      [--tmp_dir TMP_DIR]
                                                      [--tmp_dirKeep]
                                                      classification output

2.5.2.10.1. Positional Arguments

classification: Centrifuger per-read classification output file.
output: Kraken2-style per-read classification output file.

2.5.2.10.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.11. centrifuger_kreport

Produce a Kraken-style hierarchical report from a Centrifuger classification output file.

metagenomics.py centrifuger_kreport [-h] [--no_lca] [--show_zeros]
                                    [--is_count_table] [--min_score MIN_SCORE]
                                    [--min_length MIN_LENGTH]
                                    [--report_score_data]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version] [--tmp_dir TMP_DIR]
                                    [--tmp_dirKeep]
                                    db classification output

2.5.2.11.1. Positional Arguments

db: Centrifuger database prefix.
classification: Centrifuger classification output file.
output: Kraken-style hierarchical report output file.

2.5.2.11.2. Named Arguments

--no_lca

Do not promote multi-assignment reads to their LCA; report counts at the original taxa.

Default: False

--show_zeros

Include taxa with zero reads in the report.

Default: False

--is_count_table

Input is a taxID<TAB>count table instead of the standard centrifuger output.

Default: False

--min_score

Minimum score for reads to be counted.

--min_length

Minimum alignment length for reads to be counted.

--report_score_data

Append an extra column summarizing classification scores.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.12. virnucpro_contigs

Classify contigs from VirNucPro highest-score output.

VirNucPro produces two raw prediction tables whose names are not self-explanatory:

prediction_results.txt contains one row per translated sequence/chunk scored by the model. In the original implementation, this has Sequence_ID, Prediction, score1, and score2.
prediction_results_highestscore.csv is the per-input-contig summary derived from prediction_results.txt. Despite the .csv suffix, the original implementation writes it as a tab-delimited table. This highest-score summary is the input expected by this command, often passed through viral-ngs/WDL as highestscore_tsv.

metagenomics.py virnucpro_contigs [-h] [--minViralProp MIN_VIRAL_PROP]
                                  [--minNonviralProp MIN_NONVIRAL_PROP]
                                  [--minChunks MIN_CHUNKS]
                                  [--minConfidentScore MIN_CONFIDENT_SCORE]
                                  [--maxOpposingScore MAX_OPPOSING_SCORE]
                                  [--minAmbiguousScore MIN_AMBIGUOUS_SCORE]
                                  [--minWeightedDelta MIN_WEIGHTED_DELTA]
                                  [--highConfidenceDelta HIGH_CONFIDENCE_DELTA]
                                  [--idCol ID_COL] [--idPattern ID_PATTERN]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  highestscore_tsv output_tsv

2.5.2.12.1. Positional Arguments

highestscore_tsv: VirNucPro highest-score TSV.
output_tsv: Output contig classification TSV.

2.5.2.12.2. Named Arguments

--minViralProp, --min-viral-prop

Minimum confident viral chunk proportion. (default: 0.1)

Default: 0.1

--minNonviralProp, --min-nonviral-prop

Minimum confident non-viral chunk proportion. (default: 0.1)

Default: 0.1

--minChunks, --min-chunks

Minimum chunks for high/moderate confidence tiers. (default: 5)

Default: 5

--minConfidentScore, --min-confident-score

Minimum winning class score for a chunk to count as confident. (default: 0.8)

Default: 0.8

--maxOpposingScore, --max-opposing-score

Maximum opposing class score for a chunk to count as confident. (default: 0.3)

Default: 0.3

--minAmbiguousScore, --min-ambiguous-score

Minimum score in both classes for a chunk to count as ambiguous. (default: 0.7)

Default: 0.7

--minWeightedDelta, --min-weighted-delta

Minimum absolute weighted score delta required for a viral/non-viral call. (default: 0.3)

Default: 0.3

--highConfidenceDelta, --high-confidence-delta

Minimum absolute weighted score delta required for high-confidence tiers. (default: 0.6)

Default: 0.6

--idCol, --id-col

Column containing chunk/contig IDs. (default: ‘Modified_ID’)

Default: 'Modified_ID'

--idPattern, --id-pattern

Regex used to extract contig group IDs. (default: ‘(NODE_d+)’)

Default: '(NODE\_\d+)'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.13. virnucpro_label_reads_by_contig

Label reads with the VirNucPro classification of their best-mapping contig.

The output is a tab-delimited, one-row-per-read table. Each row preserves the selected BAM alignment context (read length, contig, contig length, strand, mapping quality, percent identity, percent query coverage derived from CIGAR and NM) and appends the VirNucPro contig-level call/tier and supporting chunk summary metrics.

The selected contig is the best primary alignment for each read, ordered by mapping quality, then percent identity, then input order. Input record order is the final deterministic tiebreaker; avoid re-sorting or otherwise reordering the BAM between minimap2 alignment and classification if exact tie reproducibility matters. mapped_well is a boolean flag derived from --min-mapq, --min-identity, and --min-query-cov. Reads with primary alignments to contigs carrying different VirNucPro calls are labeled Multi-mapped. Reads whose selected contig has no VirNucPro classification are labeled Unclassified.

metagenomics.py virnucpro_label_reads_by_contig [-h] [--minMapq MIN_MAPQ]
                                                [--minIdentity MIN_IDENTITY]
                                                [--minQueryCov MIN_QUERY_COV]
                                                [--duckdbMemoryLimit DUCKDB_MEMORY_LIMIT]
                                                [--workDir WORK_DIR]
                                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                                [--version]
                                                [--tmp_dir TMP_DIR]
                                                [--tmp_dirKeep]
                                                aligned_bam
                                                contig_classifications
                                                output_tsv

2.5.2.13.1. Positional Arguments

aligned_bam: Minimap2-aligned BAM with NM tags.
contig_classifications: Per-contig VirNucPro classification TSV produced by virnucpro_contigs.
output_tsv: Output per-read classification TSV. Compression is inferred from the extension (.gz/.zst/.lz4/.bz2).

2.5.2.13.2. Named Arguments

--minMapq, --min-mapq

Minimum mapping quality for the mapped_well flag. (default: 5)

Default: 5

--minIdentity, --min-identity

Minimum percent identity for the mapped_well flag. Use percent units, e.g. 90 not 0.9. (default: 90.0)

Default: 90.0

--minQueryCov, --min-query-cov

Minimum percent query coverage for the mapped_well flag. Use percent units, e.g. 80 not 0.8. (default: 80.0)

Default: 80.0

--duckdbMemoryLimit, --duckdb-memory-limit

DuckDB memory cap, e.g. “8GB” (default: None = auto-detect ~75% of the cgroup limit). Empty string disables any cap.

--workDir, --work-dir

Directory for the per-run temp dir / DuckDB spill (default: None = system tmp).

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.14. krona

Create an interactive HTML report from a tabular metagenomic report

metagenomics.py krona [-h] [--sample_name SAMPLE_NAME]
                      [--queryColumn QUERYCOLUMN] [--taxidColumn TAXIDCOLUMN]
                      [--scoreColumn SCORECOLUMN]
                      [--magnitudeColumn MAGNITUDECOLUMN] [--noHits]
                      [--noRank] [--inputType {tsv,kraken2}]
                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                      [--version]
                      inReports [inReports ...] db outHtml

2.5.2.14.1. Positional Arguments

inReports: Input report file (default: tsv)
db: Krona taxonomy database directory.
outHtml: Output html report.

2.5.2.14.2. Named Arguments

--sample_name

Title of dataset (default basename(inReport))

--queryColumn

Column of query id. (default 2)

Default: 2

--taxidColumn

Column of taxonomy id. (default 3)

Default: 3

--scoreColumn

Column of score. (default None)

--magnitudeColumn

Column of magnitude. (default None)

--noHits

Include wedge for no hits.

Default: False

--noRank

Include no rank assignments.

Default: False

--inputType

Possible choices: tsv, kraken2

Handling for specialized report types.

Default: 'tsv'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.5.2.15. report_merge

Merge multiple metagenomic reports into a single metagenomic report suitable for Krona input.

metagenomics.py report_merge [-h]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             metagenomic_reports [metagenomic_reports ...]
                             out_krona_input

2.5.2.15.1. Positional Arguments

metagenomic_reports: Input metagenomic reports with the query ID and taxon ID in the 2nd and 3rd columns (Kraken format)
out_krona_input: Output metagenomic report suitable for Krona input.

2.5.2.15.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.16. filter_bam_to_taxa

Filter an (already classified) input bam file to only include reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification file, as produced by tools such as Kraken, as well as the NCBI taxonomy database.

metagenomics.py filter_bam_to_taxa [-h] [--exclude]
                                   [--taxNames TAX_NAMES [TAX_NAMES ...]]
                                   [--taxIDs TAX_IDS [TAX_IDS ...]]
                                   [--without-children]
                                   [--read_id_col READ_ID_COL]
                                   [--tax_id_col TAX_ID_COL]
                                   [--out_count OUT_COUNT]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   in_bam read_IDs_to_tax_IDs out_bam
                                   nodes_dmp names_dmp

2.5.2.16.1. Positional Arguments

in_bam: Input bam file.
read_IDs_to_tax_IDs: TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.
out_bam: Output bam file, filtered to the taxa specified
nodes_dmp: nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
names_dmp: names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

2.5.2.16.2. Named Arguments

--exclude

Switch filtration to remove all reads falling under matching taxa (and keep all non-matching). Default is the inverse: keep all reads falling under matching taxa (and remove all non-matching).

Default: False

--taxNames

The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. “Viruses” This is in addition to any taxonomic IDs provided.

--taxIDs

The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.

--without-children

Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).

Default: False

--read_id_col

The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing read IDs. (default: 1)

Default: 1

--tax_id_col

The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing Taxonomy IDs. (default: 2)

Default: 2

--out_count

Write a file with the number of reads matching the specified taxa.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.17. taxlevel_summary

Aggregates taxonomic abundance data from multiple Kraken-format summary files. It is intended to report information on a particular taxonomic level (–taxlevelFocus; ex. ‘species’), within a higher-level grouping (–taxHeading; ex. ‘Viruses’). By default, when –taxHeading is at the same level as –taxlevelFocus a summary with lines for each sample is emitted. Otherwise, a histogram is returned. If per-sample information is desired, –noHist can be specified. In per-sample data, the suffix “-pt” indicates percentage, so a value of 0.02 is 0.0002 of the total number of reads for the sample. If –topN is specified, only the top N most abundant taxa are included in the histogram count or per-sample output. If a number is specified for –countThreshold, only taxa with that number of reads (or greater) are included. Full data returned via –jsonOut (filtered by –topN and –countThreshold), whereas -csvOut returns a summary.

metagenomics.py taxlevel_summary [-h] [--jsonOut JSON_OUT] [--csvOut CSV_OUT]
                                 [--taxHeading TAX_HEADINGS [TAX_HEADINGS ...]]
                                 [--taxlevelFocus TAXLEVEL_FOCUS]
                                 [--topN TOP_N_ENTRIES]
                                 [--countThreshold COUNT_THRESHOLD]
                                 [--zeroFill] [--noHist] [--includeRoot]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmp_dir TMP_DIR]
                                 [--tmp_dirKeep]
                                 summary_files_in [summary_files_in ...]

2.5.2.17.1. Positional Arguments

summary_files_in: Kraken-format summary text file with tab-delimited taxonomic levels.

2.5.2.17.2. Named Arguments

--jsonOut

The path to a json file containing the relevant parsed summary data in json format.

--csvOut

The path to a csv file containing sample-specific counts.

--taxHeading

The taxonomic heading to analyze (default: ‘Viruses’). More than one can be specified.

Default: 'Viruses'

--taxlevelFocus

The taxonomic heading to summarize (totals by Genus, etc.) (default: ‘species’).

Default: 'species'

--topN

Only include the top N most abundant taxa by read count (default: 100)

Default: 100

--countThreshold

Minimum number of reads to be included (default: 1)

Default: 1

--zeroFill

When absent from a sample, write zeroes (rather than leaving blank).

Default: False

--noHist

Write out a report by-sample rather than a histogram.

Default: False

--includeRoot

Include the count of reads at the root level and the unclassified bin.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.18. taxlevel_plurality

Identifies the most abundant taxon (of any rank) contributing to a node of interest in the taxonomic tree. It is intended to highlight the primary contributor of taxonomic signal within a taxonomic category of interest, for example, the most abundant virus among all viruses.

metagenomics.py taxlevel_plurality [-h] [--min_reads MIN_READS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   summary_file tax_heading out_report

2.5.2.18.1. Positional Arguments

summary_file: input Kraken-format summary text file with tab-delimited taxonomic levels.
tax_heading: The taxonomic heading to analyze.
out_report: tab-delimited output file.

2.5.2.18.2. Named Arguments

--min_reads

Only include hits with more than min_reads (default: 1)

Default: 1

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.19. kb_extract

Runs kb extract on the input BAM file.

Args:: in_bam (str): Input BAM file. index (str): Path to the kb index file. t2g (str): Path to the transcript-to-gene mapping file. targets (str): Comma-separated list of target sequences to extract. protein (bool): True if sequence contains amino acids. Defaults to False. out_dir (str): Output directory. Defaults to None. h5ad (str): Path to the output h5ad file. Can pull IDs to extract from this file. Defaults to None. threshold (int, optional): Minimum read count threshold for a target to be extracted. Defaults to 1. threads (int, optional): Number of threads to use. Defaults to None.

metagenomics.py kb_extract [-h] [--index INDEX] [--t2g T2G]
                           [--out_dir OUT_DIR] [--protein] [--targets TARGETS]
                           [--h5ad H5AD] [--threshold THRESHOLD]
                           [--threads THREADS]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           in_bam

2.5.2.19.1. Positional Arguments

in_bam: Input unaligned reads, BAM format.

2.5.2.19.2. Named Arguments

--index

kb index file.

--t2g

Transcript to gene mapping file.

--out_dir

Output directory (default: kb_out)

Default: 'kb_out'

--protein

True if sequence contains amino acids (default: False).

Default: False

--targets

Comma-separated list of target sequences to extract from input sequences.

--h5ad

Path to the output h5ad file. Can pull IDs to extract from this file.

--threshold

Minimum read count threshold for a target to be extracted (only used when extractin IDs from h5ad; default: 1)

Default: 1

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.20. kb_top_taxa

Identifies the most abundant taxon (of any rank) contributing to a taxa node of interest in kb count output.

It is intended to highlight the primary contributor of taxonomic signal within a taxonomic category of interest, for example, the most abundant virus among all viruses.

Args:: counts_tar (str): Path to the input kb count tarball (tar.zst format). out_report (str): Path to the output report file. id_to_tax_map (str, optional): Path to the ID to taxonomy mapping file (CSV format). target_taxon (str): The taxonomic category to analyze (default: ‘Viruses’).

metagenomics.py kb_top_taxa [-h] [--id-to-tax-map ID_TO_TAX_MAP]
                            [--target-taxon TARGET_TAXON]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            counts_tar out_report

2.5.2.20.1. Positional Arguments

counts_tar: Input kb count tarball (tar.zst format).
out_report: Tab-delimited output file.

2.5.2.20.2. Named Arguments

--id-to-tax-map

ID to taxonomy mapping file (CSV format).

--target-taxon

Target taxonomic category to analyze (default: Viruses).

Default: 'Viruses'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.21. kb_merge_h5ads

Merge multiple kb count output tarballs into a single h5ad file with sample metadata.

Extracts h5ad files from counts_unfiltered folder and adds sample names from matrix.cells.

Args:: in_count_tars (list): List of input kb count tarballs (tar.zst format). out_h5ad (str): Path to the output h5ad file. tmp_dir (str, optional): Temporary directory for extraction.

metagenomics.py kb_merge_h5ads [-h] [--out-h5ad OUT_H5AD]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               in_count_tars [in_count_tars ...]

2.5.2.21.1. Positional Arguments

in_count_tars: Input kb count tarballs to merge (tar.zst format).

2.5.2.21.2. Named Arguments

--out-h5ad

Output merged h5ad file.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.22. krona_build

Builds a Krona taxonomy database

metagenomics.py krona_build [-h] [--taxdump_tar_gz TAXDUMP_TAR_GZ]
                            [--get_accessions]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            db

2.5.2.22.1. Positional Arguments

db: Krona taxonomy database output directory.

2.5.2.22.2. Named Arguments

--taxdump_tar_gz

NCBI taxdump.tar.gz file

--get_accessions

Fetch NCBI accession to taxid mappings. This is not required for processing kraken1/2/uniq hits, only for BLAST hits, and adds a significant amount of time and database space (default false).

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.23. kraken2_build

Builds a kraken2 database from library directory of fastas and taxonomy db directory. The –subsetTaxonomy option allows shrinking the taxonomy to only include taxids associated with the library folders. For this to work, the library fastas must have the standard id names such as >NC1234.1 accessions, >gi|123456789|ref|XXXX||, or custom kraken name >kraken:taxid|1234|.

metagenomics.py kraken2_build [-h] [--tax_db TAX_DB]
                              [--taxdump_out TAXDUMP_OUT]
                              [--standard_libraries {archaea,bacteria,plasmid,viral,human,fungi,plant,protozoa,nr,nt,env_nr,env_nt,UniVec,UniVec_Core} [{archaea,bacteria,plasmid,viral,human,fungi,plant,protozoa,nr,nt,env_nr,env_nt,UniVec,UniVec_Core} ...]]
                              [--custom_libraries CUSTOM_LIBRARIES [CUSTOM_LIBRARIES ...]]
                              [--kmerLen KMERLEN]
                              [--minimizerLen MINIMIZERLEN]
                              [--minimizerSpaces MINIMIZERSPACES] [--protein]
                              [--maxDbSize MAXDBSIZE] [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              db

2.5.2.23.1. Positional Arguments

db: Kraken database output directory.

2.5.2.23.2. Named Arguments

--tax_db

Use pre-existing kraken2 taxonomy db structure

--taxdump_out

Save ncbi taxdump.tar.gz file

--standard_libraries

Possible choices: archaea, bacteria, plasmid, viral, human, fungi, plant, protozoa, nr, nt, env_nr, env_nt, UniVec, UniVec_Core

A list of “standard” kraken libraries to download on the fly and add.

--custom_libraries

Custom fasta files with properly formatted headers.

--kmerLen

k-mer length (kraken2 default: 35nt/15aa)

--minimizerLen

Minimizer length (kraken2 default: 31nt/12aa)

--minimizerSpaces

Number of characters in minimizer that are ignored in comparisons (kraken2 default: 7nt/0aa)

--protein

Build protein database (default false/nucleotide).

Default: False

--maxDbSize

Maximum db size in GB (default: none)

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.24. kb_build

Builds a kb index from a reference fasta file.

Args:: ref_fasta (str): Path to the reference sequence fasta file. index (str): Path to the output kb index file. workflow (str): Type of index to create. Options are ‘standard’, ‘nac’, ‘kite’, ‘custom’. kmer_len (int): k-mer length (default: 31). protein (bool): True if sequence contains amino acids (default: False). threads (int): Number of threads to use (default: None).

metagenomics.py kb_build [-h] [--index INDEX]
                         [--workflow {standard,nac,kite,custom}]
                         [--kmer_len KMER_LEN] [--protein] [--threads THREADS]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         ref_fasta

2.5.2.24.1. Positional Arguments

ref_fasta: Reference sequence fasta file.

2.5.2.24.2. Named Arguments

--index

kb output index file.

--workflow

Possible choices: standard, nac, kite, custom

Type of index to create (default: ‘standard’).

Default: 'standard'

--kmer_len

k-mer length (default: 31).

--protein

True if sequence contains amino acids(default: False).

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.25. genomad

Classify viral and plasmid sequences using geNomad

metagenomics.py genomad [-h] [--cleanup] [--restart]
                        [--filterPreset {conservative,relaxed}]
                        [--enableScoreCalibration]
                        [--composition {auto,metagenome,virome}]
                        [--minScore MIN_SCORE] [--maxFdr MAX_FDR]
                        [--minNumberGenes MIN_NUMBER_GENES]
                        [--maxUscg MAX_USCG] [--splits SPLITS]
                        [--threads THREADS]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        in_fasta database out_dir

2.5.2.25.1. Positional Arguments

in_fasta: Input FASTA file with sequences to classify.
database: Path to geNomad database directory.
out_dir: Output directory for geNomad results.

2.5.2.25.2. Named Arguments

--cleanup

Delete intermediate files after execution.

Default: False

--restart

Overwrite existing intermediate files.

Default: False

--filterPreset, --filter-preset

Possible choices: conservative, relaxed

geNomad summary filtering preset.

--enableScoreCalibration, --enable-score-calibration

Execute geNomad score calibration module.

Default: False

--composition

Possible choices: auto, metagenome, virome

Sample composition for score calibration.

--minScore, --min-score

Minimum score to flag a sequence as virus or plasmid.

--maxFdr, --max-fdr

Maximum accepted false discovery rate.

--minNumberGenes, --min-number-genes

Minimum number of genes required for classification.

--maxUscg, --max-uscg

Maximum allowed universal single copy genes.

--splits

Split the MMseqs2 marker search to reduce memory usage.

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False