2.5. metagenomics.py - utilities for metagenomic analyses

This script contains a number of utilities for metagenomic analyses.

usage: metagenomics.py subcommand

2.5.1. subcommands

[F: Possible choices: subset_taxonomy, filter_taxids_to_focal_hits, kraken2, kb, kma, kma_build, krona, report_merge, filter_bam_to_taxa, taxlevel_summary, taxlevel_plurality, kb_extract, kb_top_taxa, kb_merge_h5ads, krona_build, kraken2_build, kb_build

2.5.2. Sub-commands

2.5.2.1. subset_taxonomy

Generate a subset of the taxonomy db files filtered by the whitelist. The whitelist taxids indicate specific taxids plus their parents to add to taxonomy while whitelistTreeTaxids indicate specific taxids plus both parents and all children taxa. Whitelist GI and accessions can only be provided in file form and the resulting gi/accession2taxid files will be filtered to only include those in the whitelist files. Finally, taxids + parents for the gis/accessions will also be included.

metagenomics.py subset_taxonomy [-h]
                                [--whitelistTaxids WHITELISTTAXIDS [WHITELISTTAXIDS ...]]
                                [--whitelistTaxidFile WHITELISTTAXIDFILE]
                                [--whitelistTreeTaxids WHITELISTTREETAXIDS [WHITELISTTREETAXIDS ...]]
                                [--whitelistTreeTaxidFile WHITELISTTREETAXIDFILE]
                                [--whitelistGiFile WHITELISTGIFILE]
                                [--whitelistAccessionFile WHITELISTACCESSIONFILE]
                                [--skipGi] [--skipAccession]
                                [--skipDeadAccession]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                taxDb outputDb

2.5.2.1.1. Positional Arguments

taxDb: Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
outputDb: Output taxonomy database directory

2.5.2.1.2. Named Arguments

--whitelistTaxids

List of taxids to add to taxonomy (with parents)

--whitelistTaxidFile

File containing taxids - one per line - to add to taxonomy with parents.

--whitelistTreeTaxids

List of taxids to add to taxonomy (with parents and children)

--whitelistTreeTaxidFile

File containing taxids - one per line - to add to taxonomy with parents and children.

--whitelistGiFile

File containing GIs - one per line - to add to taxonomy with nodes.

--whitelistAccessionFile

File containing accessions - one per line - to add to taxonomy with nodes.

--skipGi

Skip GI to taxid mapping files

Default: False

--skipAccession

Skip accession to taxid mapping files

Default: False

--skipDeadAccession

Skip dead accession to taxid mapping files

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.2. filter_taxids_to_focal_hits

Generate a subset of the taxids_tsv file filtered by the focal_report_tsv. We will only emit rows from the taxids_tsv that contain taxids that are either contained within or are a child/descendant of nodes contained within the focal_report_tsv

metagenomics.py filter_taxids_to_focal_hits [-h]
                                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                            [--version] [--tmp_dir TMP_DIR]
                                            [--tmp_dirKeep]
                                            taxids_tsv focal_report_tsv
                                            taxdb_dir min_read_count
                                            output_tsv

2.5.2.2.1. Positional Arguments

taxids_tsv: TSV file where first column is a taxid
focal_report_tsv: TSV produced by taxlevel_plurality
taxdb_dir: Taxonomy database directory (containing nodes.dmp, parents.dmp etc.)
min_read_count: ignore focal_report_tsv entries below this read count
output_tsv: Output TSV file where first column is a taxid

2.5.2.2.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.3. kraken2

Classify reads by taxon using Kraken2

metagenomics.py kraken2 [-h] [--outReports OUTREPORTS [OUTREPORTS ...]]
                        [--outReads OUTREADS [OUTREADS ...]]
                        [--minimum_hit_groups MINIMUM_HIT_GROUPS]
                        [--min_base_qual MIN_BASE_QUAL]
                        [--confidence CONFIDENCE] [--threads THREADS]
                        [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                        [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                        db inBams [inBams ...]

2.5.2.3.1. Positional Arguments

db: Kraken database directory.
inBams: Input unaligned reads, BAM format.

2.5.2.3.2. Named Arguments

--outReports

Kraken2 summary report output file. Multiple filenames space separated.

--outReads

Kraken2 per read classification output file. Multiple filenames space separated.

--minimum_hit_groups

Minimum hit groups (Kraken2 default: 2)

--min_base_qual

Minimum base quality (default None)

--confidence

Kraken2 confidence score threshold (default None)

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.4. kb

Runs kb count on the input BAM files.

Args:
in_bam (list): List of input BAM files. out_dir (str): Output directory. Defaults to None. index (str): Path to the kb index file. t2g (list|str): Transcript-to-gene mapping file(s). kmer_len (int, optional): K-mer size for the alignment. Defaults to 31. parity (str, optional): Library parity (default: single). Defaults to ‘single’. technology (str, optional): Sequencing technology used. Defaults to ‘bulk’. h5ad (bool, optional): Whether to output HDF5 file. Defaults to False. loom (bool, optional): Whether to output Loom file. Defaults to False. protein (bool, optional): Whether the sequence contains amino acids. Defaults to False. threads (int, optional): Number of threads to use. Defaults to None.

metagenomics.py kb [-h] [--index INDEX] [--t2g T2G] [--kmer_len KMER_LEN]
                   [--parity {single,paired}]
                   [--technology {10xv2,10xv3,10xv3-3prime,10xv3-5prime,dropseq,indrop,celseq,celseq2,smartseq2,bulk}]
                   [--h5ad] [--loom] [--protein] [--out_dir OUT_DIR]
                   [--threads THREADS]
                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                   [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                   in_bam

2.5.2.4.1. Positional Arguments

in_bam: Input unaligned reads, BAM format.

2.5.2.4.2. Named Arguments

--index

kb index file.

--t2g

Input unaligned reads, BAM format.

--kmer_len

k-mer size (default: 31bp)

Default: 31

--parity

Possible choices: single, paired

Library parity (default: single)

Default: 'single'

--technology

Possible choices: 10xv2, 10xv3, 10xv3-3prime, 10xv3-5prime, dropseq, indrop, celseq, celseq2, smartseq2, bulk

Technology used to generate the data (default: bulk)

Default: 'bulk'

--h5ad

Output HDF5 file (default: False)

Default: False

--loom

Output Loom file (default: False)

Default: False

--protein

True if sequence contains amino acids (default: False).

Default: False

--out_dir

Output directory (default: kb_out)

Default: 'kb_out'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.5. kma

metagenomics.py kma [-h] [--outPrefixes OUTPREFIXES [OUTPREFIXES ...]]
                    [--threads THREADS]
                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                    [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                    db inBams [inBams ...]

2.5.2.5.1. Positional Arguments

db: KMA database prefix.
inBams: Input unaligned reads, BAM format.

2.5.2.5.2. Named Arguments

--outPrefixes

KMA output prefixes.

--threads

Number of threads.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.6. kma_build

metagenomics.py kma_build [-h] [--threads THREADS]
                          [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                          [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                          ref_fasta db_prefix

2.5.2.6.1. Positional Arguments

ref_fasta: Reference FASTA file.
db_prefix: Output database prefix.

2.5.2.6.2. Named Arguments

--threads

Number of threads.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.7. krona

Create an interactive HTML report from a tabular metagenomic report

metagenomics.py krona [-h] [--sample_name SAMPLE_NAME]
                      [--queryColumn QUERYCOLUMN] [--taxidColumn TAXIDCOLUMN]
                      [--scoreColumn SCORECOLUMN]
                      [--magnitudeColumn MAGNITUDECOLUMN] [--noHits]
                      [--noRank] [--inputType {tsv,kraken2}]
                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                      [--version]
                      inReports [inReports ...] db outHtml

2.5.2.7.1. Positional Arguments

inReports: Input report file (default: tsv)
db: Krona taxonomy database directory.
outHtml: Output html report.

2.5.2.7.2. Named Arguments

--sample_name

Title of dataset (default basename(inReport))

--queryColumn

Column of query id. (default 2)

Default: 2

--taxidColumn

Column of taxonomy id. (default 3)

Default: 3

--scoreColumn

Column of score. (default None)

--magnitudeColumn

Column of magnitude. (default None)

--noHits

Include wedge for no hits.

Default: False

--noRank

Include no rank assignments.

Default: False

--inputType

Possible choices: tsv, kraken2

Handling for specialized report types.

Default: 'tsv'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.5.2.8. report_merge

Merge multiple metagenomic reports into a single metagenomic report suitable for Krona input.

metagenomics.py report_merge [-h]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             metagenomic_reports [metagenomic_reports ...]
                             out_krona_input

2.5.2.8.1. Positional Arguments

metagenomic_reports: Input metagenomic reports with the query ID and taxon ID in the 2nd and 3rd columns (Kraken format)
out_krona_input: Output metagenomic report suitable for Krona input.

2.5.2.8.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.9. filter_bam_to_taxa

Filter an (already classified) input bam file to only include reads that have been mapped to specified taxonomic IDs or scientific names. This requires a classification file, as produced by tools such as Kraken, as well as the NCBI taxonomy database.

metagenomics.py filter_bam_to_taxa [-h] [--exclude]
                                   [--taxNames TAX_NAMES [TAX_NAMES ...]]
                                   [--taxIDs TAX_IDS [TAX_IDS ...]]
                                   [--without-children]
                                   [--read_id_col READ_ID_COL]
                                   [--tax_id_col TAX_ID_COL]
                                   [--out_count OUT_COUNT]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   in_bam read_IDs_to_tax_IDs out_bam
                                   nodes_dmp names_dmp

2.5.2.9.1. Positional Arguments

in_bam: Input bam file.
read_IDs_to_tax_IDs: TSV file mapping read IDs to taxIDs, Kraken-format by default. Assumes bijective mapping of read ID to tax ID.
out_bam: Output bam file, filtered to the taxa specified
nodes_dmp: nodes.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/
names_dmp: names.dmp file from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

2.5.2.9.2. Named Arguments

--exclude

Switch filtration to remove all reads falling under matching taxa (and keep all non-matching). Default is the inverse: keep all reads falling under matching taxa (and remove all non-matching).

Default: False

--taxNames

The taxonomic names to include. More than one can be specified. Mapped to Tax IDs by lowercase exact match only. Ex. “Viruses” This is in addition to any taxonomic IDs provided.

--taxIDs

The NCBI taxonomy IDs to include. More than one can be specified. This is in addition to any taxonomic names provided.

--without-children

Omit reads classified more specifically than each taxon specified (without this a taxon and its children are included).

Default: False

--read_id_col

The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing read IDs. (default: 1)

Default: 1

--tax_id_col

The (zero-indexed) number of the column in read_IDs_to_tax_IDs containing Taxonomy IDs. (default: 2)

Default: 2

--out_count

Write a file with the number of reads matching the specified taxa.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.10. taxlevel_summary

Aggregates taxonomic abundance data from multiple Kraken-format summary files. It is intended to report information on a particular taxonomic level (–taxlevelFocus; ex. ‘species’), within a higher-level grouping (–taxHeading; ex. ‘Viruses’). By default, when –taxHeading is at the same level as –taxlevelFocus a summary with lines for each sample is emitted. Otherwise, a histogram is returned. If per-sample information is desired, –noHist can be specified. In per-sample data, the suffix “-pt” indicates percentage, so a value of 0.02 is 0.0002 of the total number of reads for the sample. If –topN is specified, only the top N most abundant taxa are included in the histogram count or per-sample output. If a number is specified for –countThreshold, only taxa with that number of reads (or greater) are included. Full data returned via –jsonOut (filtered by –topN and –countThreshold), whereas -csvOut returns a summary.

metagenomics.py taxlevel_summary [-h] [--jsonOut JSON_OUT] [--csvOut CSV_OUT]
                                 [--taxHeading TAX_HEADINGS [TAX_HEADINGS ...]]
                                 [--taxlevelFocus TAXLEVEL_FOCUS]
                                 [--topN TOP_N_ENTRIES]
                                 [--countThreshold COUNT_THRESHOLD]
                                 [--zeroFill] [--noHist] [--includeRoot]
                                 [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                 [--version] [--tmp_dir TMP_DIR]
                                 [--tmp_dirKeep]
                                 summary_files_in [summary_files_in ...]

2.5.2.10.1. Positional Arguments

summary_files_in: Kraken-format summary text file with tab-delimited taxonomic levels.

2.5.2.10.2. Named Arguments

--jsonOut

The path to a json file containing the relevant parsed summary data in json format.

--csvOut

The path to a csv file containing sample-specific counts.

--taxHeading

The taxonomic heading to analyze (default: ‘Viruses’). More than one can be specified.

Default: 'Viruses'

--taxlevelFocus

The taxonomic heading to summarize (totals by Genus, etc.) (default: ‘species’).

Default: 'species'

--topN

Only include the top N most abundant taxa by read count (default: 100)

Default: 100

--countThreshold

Minimum number of reads to be included (default: 1)

Default: 1

--zeroFill

When absent from a sample, write zeroes (rather than leaving blank).

Default: False

--noHist

Write out a report by-sample rather than a histogram.

Default: False

--includeRoot

Include the count of reads at the root level and the unclassified bin.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.11. taxlevel_plurality

Identifies the most abundant taxon (of any rank) contributing to a node of interest in the taxonomic tree. It is intended to highlight the primary contributor of taxonomic signal within a taxonomic category of interest, for example, the most abundant virus among all viruses.

metagenomics.py taxlevel_plurality [-h] [--min_reads MIN_READS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   summary_file tax_heading out_report

2.5.2.11.1. Positional Arguments

summary_file: input Kraken-format summary text file with tab-delimited taxonomic levels.
tax_heading: The taxonomic heading to analyze.
out_report: tab-delimited output file.

2.5.2.11.2. Named Arguments

--min_reads

Only include hits with more than min_reads (default: 1)

Default: 1

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.12. kb_extract

Runs kb extract on the input BAM file.

Args:
in_bam (str): Input BAM file. index (str): Path to the kb index file. t2g (str): Path to the transcript-to-gene mapping file. targets (str): Comma-separated list of target sequences to extract. protein (bool): True if sequence contains amino acids. Defaults to False. out_dir (str): Output directory. Defaults to None. h5ad (str): Path to the output h5ad file. Can pull IDs to extract from this file. Defaults to None. threshold (int, optional): Minimum read count threshold for a target to be extracted. Defaults to 1. threads (int, optional): Number of threads to use. Defaults to None.

metagenomics.py kb_extract [-h] [--index INDEX] [--t2g T2G]
                           [--out_dir OUT_DIR] [--protein] [--targets TARGETS]
                           [--h5ad H5AD] [--threshold THRESHOLD]
                           [--threads THREADS]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           in_bam

2.5.2.12.1. Positional Arguments

in_bam: Input unaligned reads, BAM format.

2.5.2.12.2. Named Arguments

--index

kb index file.

--t2g

Transcript to gene mapping file.

--out_dir

Output directory (default: kb_out)

Default: 'kb_out'

--protein

True if sequence contains amino acids (default: False).

Default: False

--targets

Comma-separated list of target sequences to extract from input sequences.

--h5ad

Path to the output h5ad file. Can pull IDs to extract from this file.

--threshold

Minimum read count threshold for a target to be extracted (only used when extractin IDs from h5ad; default: 1)

Default: 1

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.13. kb_top_taxa

Identifies the most abundant taxon (of any rank) contributing to a taxa node of interest in kb count output.

It is intended to highlight the primary contributor of taxonomic signal within a taxonomic category of interest, for example, the most abundant virus among all viruses.

Args:
counts_tar (str): Path to the input kb count tarball (tar.zst format). out_report (str): Path to the output report file. id_to_tax_map (str, optional): Path to the ID to taxonomy mapping file (CSV format). target_taxon (str): The taxonomic category to analyze (default: ‘Viruses’).

metagenomics.py kb_top_taxa [-h] [--id-to-tax-map ID_TO_TAX_MAP]
                            [--target-taxon TARGET_TAXON]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            counts_tar out_report

2.5.2.13.1. Positional Arguments

counts_tar: Input kb count tarball (tar.zst format).
out_report: Tab-delimited output file.

2.5.2.13.2. Named Arguments

--id-to-tax-map

ID to taxonomy mapping file (CSV format).

--target-taxon

Target taxonomic category to analyze (default: Viruses).

Default: 'Viruses'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.14. kb_merge_h5ads

Merge multiple kb count output tarballs into a single h5ad file with sample metadata.

Extracts h5ad files from counts_unfiltered folder and adds sample names from matrix.cells.

Args:
in_count_tars (list): List of input kb count tarballs (tar.zst format). out_h5ad (str): Path to the output h5ad file. tmp_dir (str, optional): Temporary directory for extraction.

metagenomics.py kb_merge_h5ads [-h] [--out-h5ad OUT_H5AD]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               in_count_tars [in_count_tars ...]

2.5.2.14.1. Positional Arguments

in_count_tars: Input kb count tarballs to merge (tar.zst format).

2.5.2.14.2. Named Arguments

--out-h5ad

Output merged h5ad file.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.15. krona_build

Builds a Krona taxonomy database

metagenomics.py krona_build [-h] [--taxdump_tar_gz TAXDUMP_TAR_GZ]
                            [--get_accessions]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            db

2.5.2.15.1. Positional Arguments

db: Krona taxonomy database output directory.

2.5.2.15.2. Named Arguments

--taxdump_tar_gz

NCBI taxdump.tar.gz file

--get_accessions

Fetch NCBI accession to taxid mappings. This is not required for processing kraken1/2/uniq hits, only for BLAST hits, and adds a significant amount of time and database space (default false).

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.16. kraken2_build

Builds a kraken2 database from library directory of fastas and taxonomy db directory. The –subsetTaxonomy option allows shrinking the taxonomy to only include taxids associated with the library folders. For this to work, the library fastas must have the standard id names such as >NC1234.1 accessions, >gi|123456789|ref|XXXX||, or custom kraken name >kraken:taxid|1234|.

metagenomics.py kraken2_build [-h] [--tax_db TAX_DB]
                              [--taxdump_out TAXDUMP_OUT]
                              [--standard_libraries {archaea,bacteria,plasmid,viral,human,fungi,plant,protozoa,nr,nt,env_nr,env_nt,UniVec,UniVec_Core} [{archaea,bacteria,plasmid,viral,human,fungi,plant,protozoa,nr,nt,env_nr,env_nt,UniVec,UniVec_Core} ...]]
                              [--custom_libraries CUSTOM_LIBRARIES [CUSTOM_LIBRARIES ...]]
                              [--kmerLen KMERLEN]
                              [--minimizerLen MINIMIZERLEN]
                              [--minimizerSpaces MINIMIZERSPACES] [--protein]
                              [--maxDbSize MAXDBSIZE] [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              db

2.5.2.16.1. Positional Arguments

db: Kraken database output directory.

2.5.2.16.2. Named Arguments

--tax_db

Use pre-existing kraken2 taxonomy db structure

--taxdump_out

Save ncbi taxdump.tar.gz file

--standard_libraries

Possible choices: archaea, bacteria, plasmid, viral, human, fungi, plant, protozoa, nr, nt, env_nr, env_nt, UniVec, UniVec_Core

A list of “standard” kraken libraries to download on the fly and add.

--custom_libraries

Custom fasta files with properly formatted headers.

--kmerLen

k-mer length (kraken2 default: 35nt/15aa)

--minimizerLen

Minimizer length (kraken2 default: 31nt/12aa)

--minimizerSpaces

Number of characters in minimizer that are ignored in comparisons (kraken2 default: 7nt/0aa)

--protein

Build protein database (default false/nucleotide).

Default: False

--maxDbSize

Maximum db size in GB (default: none)

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.5.2.17. kb_build

Builds a kb index from a reference fasta file.

Args:
ref_fasta (str): Path to the reference sequence fasta file. index (str): Path to the output kb index file. workflow (str): Type of index to create. Options are ‘standard’, ‘nac’, ‘kite’, ‘custom’. kmer_len (int): k-mer length (default: 31). protein (bool): True if sequence contains amino acids (default: False). threads (int): Number of threads to use (default: None).

metagenomics.py kb_build [-h] [--index INDEX]
                         [--workflow {standard,nac,kite,custom}]
                         [--kmer_len KMER_LEN] [--protein] [--threads THREADS]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         ref_fasta

2.5.2.17.1. Positional Arguments

ref_fasta: Reference sequence fasta file.

2.5.2.17.2. Named Arguments

--index

kb output index file.

--workflow

Possible choices: standard, nac, kite, custom

Type of index to create (default: ‘standard’).

Default: 'standard'

--kmer_len

k-mer length (default: 31).

--protein

True if sequence contains amino acids(default: False).

Default: False

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False