2.6. kmer_utils.py - commands for working with sets of kmers
Commands for working with sets of kmers
usage: kmer_utils.py subcommand
2.6.2. Sub-commands
2.6.2.1. build_kmer_db
Build a database of kmers occurring in given sequences.
kmer_utils.py build_kmer_db [-h] [--kmerSize KMER_SIZE] [--minOccs MIN_OCCS]
[--maxOccs MAX_OCCS] [--counterCap COUNTER_CAP]
[--singleStrand] [--memLimitGb MEM_LIMIT_GB]
[--memLimitLaxness {0,1,2}] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
seq_files [seq_files ...] kmer_db
2.6.2.1.1. Positional Arguments
- seq_files
Files from which to extract kmers (fasta/fastq/bam, fasta/fastq may be .gz or .bz2)
- kmer_db
kmer database (with or without .kmc_pre/.kmc_suf suffix)
2.6.2.1.2. Named Arguments
- --kmerSize, -k
kmer size
Default:
25- --minOccs, -ci
drop kmers with fewer than this many occurrences
Default:
1- --maxOccs, -cx
drop kmers with more than this many occurrences
Default:
2147483647- --counterCap, -cs
cap kmer counts at this value
Default:
255- --singleStrand, -b
do not add kmers from reverse complements of input sequences
Default:
False- --memLimitGb
Max memory to use, in GB
Default:
8- --memLimitLaxness
Possible choices: 0, 1, 2
How strict is –memLimitGb? 0=strict, 1=lax, 2=even more lax
Default:
0- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.6.2.2. dump_kmer_counts
Dump kmers and their counts from kmer database to a text file
kmer_utils.py dump_kmer_counts [-h] [--minOccs MIN_OCCS] [--maxOccs MAX_OCCS]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
kmer_db out_kmers
2.6.2.2.1. Positional Arguments
- kmer_db
kmer database (with or without .kmc_pre/.kmc_suf suffix)
- out_kmers
text file to which to write the kmers
2.6.2.2.2. Named Arguments
- --minOccs, -ci
drop kmers with fewer than this many occurrences
Default:
1- --maxOccs, -cx
drop kmers with more than this many occurrences
Default:
2147483647- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.6.2.3. filter_reads
Filter reads based on their kmer contents.
Can also be used to filter contigs or reference sequences, but we’ll refer to filtering of reads in the documentation.
Note that “occurrence of a kmer” means “occurrence of the kmer or its reverse complement” if kmer_db was built with single_strand==False.
- Inputs:
kmer_db: the kmc kmer database in_reads: the reads to filter. can be a .fasta or .fastq or .bam; fasta or fastq can be compressed
with gzip or bzip2. If a .bam, a read pair is kept if either mate passes the filter.
- Outputs:
- out_reads: file to which filtered reads are written. type is determined from extension,
same types as above are supported.
- Params:
db_min_occs: only consider database kmers with at least this count db_max_occs: only consider database kmers with at most this count
read_min_occs: only keep reads with at least this many occurrences of kmers from database. read_max_occs: only keep reads with no more than this many occurrence of kmers from the database. read_min_occs_frac: only keep reads with at least this many occurrences of kmers from database,
interpreted as a fraction of read length in kmers
- read_max_occs_frac: only keep reads with no more than this many occurrence of kmers from the database.
interpreted as a fraction of read length in kmers.
(Note: minimal read kmer content can be given as absolute counts or fraction of read length, but not both).
hard_mask: if True, in the output reads, kmers not passing the filter are replaced by Ns threads: use this many threads
kmer_utils.py filter_reads [-h] [--dbMinOccs DB_MIN_OCCS]
[--dbMaxOccs DB_MAX_OCCS]
[--readMinOccs READ_MIN_OCCS]
[--readMaxOccs READ_MAX_OCCS]
[--readMinOccsFrac READ_MIN_OCCS_FRAC]
[--readMaxOccsFrac READ_MAX_OCCS_FRAC] [--hardMask]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
kmer_db in_reads out_reads
2.6.2.3.1. Positional Arguments
- kmer_db
kmer database (with or without .kmc_pre/.kmc_suf suffix)
- in_reads
input reads, as fasta/fastq/bam
- out_reads
output reads
2.6.2.3.2. Named Arguments
- --dbMinOccs
ignore datatbase kmers with count below this
Default:
1- --dbMaxOccs
ignore datatbase kmers with count above this
Default:
2147483647- --readMinOccs
filter out reads with fewer than this many db kmers
Default:
0- --readMaxOccs
filter out reads with more than this many db kmers
Default:
2147483647- --readMinOccsFrac
filter out reads with fewer than this many db kmers, interpreted as fraction of read length
Default:
0.0- --readMaxOccsFrac
filter out reads with more than this many db kmers, interpreted as fraction of read length
Default:
1.0- --hardMask
In the output reads, mask the invalid kmers
Default:
False- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.6.2.4. kmers_binary_op
Perform a simple binary operation on kmer sets.
kmer_utils.py kmers_binary_op [-h] [--resultMinOccs RESULT_MIN_OCCS]
[--resultMaxOccs RESULT_MAX_OCCS]
[--resultCounterCap RESULT_COUNTER_CAP]
[--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
{intersect,union,kmers_subtract,counters_subtract}
kmer_db1 kmer_db2 kmer_db_out
2.6.2.4.1. Positional Arguments
- op
Possible choices: intersect, union, kmers_subtract, counters_subtract
binary operation to perform
- kmer_db1
first kmer set
- kmer_db2
second kmer set
- kmer_db_out
output kmer db
2.6.2.4.2. Named Arguments
- --resultMinOccs
from the result drop kmers with counts below this
Default:
1- --resultMaxOccs
from the result drop kmers with counts above this
Default:
2147483647- --resultCounterCap
cap output counters at this value
Default:
255- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False
2.6.2.5. kmers_set_counts
Copy the kmer database, setting all kmer counts in the output to the given value.
kmer_utils.py kmers_set_counts [-h] [--threads THREADS]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
kmer_db_in value kmer_db_out
2.6.2.5.1. Positional Arguments
- kmer_db_in
input kmer db
- value
all kmer counts in the output will be set to this value
- kmer_db_out
output kmer db
2.6.2.5.2. Named Arguments
- --threads
Number of threads; by default all cores are used
Default:
2- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False