3.7. kmer_utils.py - utilities that manipulate kmer sets and use them to filter reads or reference sequencesΒΆ

Commands for working with sets of kmers

usage: kmer_utils.py subcommand
Sub-commands:
build_kmer_db

Build a database of kmers occurring in given sequences.

usage: kmer_utils.py build_kmer_db [-h] [--kmerSize KMER_SIZE]
                                   [--minOccs MIN_OCCS] [--maxOccs MAX_OCCS]
                                   [--counterCap COUNTER_CAP] [--singleStrand]
                                   [--memLimitGb MEM_LIMIT_GB]
                                   [--memLimitLaxness {0,1,2}]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]
                                   seq_files [seq_files ...] kmer_db
Positional arguments:
seq_files Files from which to extract kmers (fasta/fastq/bam, fasta/fastq may be .gz or .bz2)
kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix)
Options:
--kmerSize=25, -k=25
 kmer size
--minOccs=1, -ci=1
 drop kmers with fewer than this many occurrences
--maxOccs=2147483647, -cx=2147483647
 drop kmers with more than this many occurrences
--counterCap=255, -cs=255
 cap kmer counts at this value
--singleStrand=False, -b=False
 do not add kmers from reverse complements of input sequences
--memLimitGb=8 Max memory to use, in GB
--memLimitLaxness=0
 

How strict is –memLimitGb? 0=strict, 1=lax, 2=even more lax

Possible choices: 0, 1, 2

--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
dump_kmer_counts

Dump kmers and their counts from kmer database to a text file

usage: kmer_utils.py dump_kmer_counts [-h] [--minOccs MIN_OCCS]
                                      [--maxOccs MAX_OCCS] [--threads THREADS]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      [--version] [--tmp_dir TMP_DIR]
                                      [--tmp_dirKeep]
                                      kmer_db out_kmers
Positional arguments:
kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix)
out_kmers text file to which to write the kmers
Options:
--minOccs=1, -ci=1
 drop kmers with fewer than this many occurrences
--maxOccs=2147483647, -cx=2147483647
 drop kmers with more than this many occurrences
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
filter_reads

Filter reads based on their kmer contents. Can also be used to filter contigs or reference sequences, but we’ll refer to filtering of reads in the documentation. Note that “occurrence of a kmer” means “occurrence of the kmer or its reverse complement” if kmer_db was built with single_strand==False. Inputs: kmer_db: the kmc kmer database in_reads: the reads to filter. can be a .fasta or .fastq or .bam; fasta or fastq can be compressed with gzip or bzip2. If a .bam, a read pair is kept if either mate passes the filter. Outputs: out_reads: file to which filtered reads are written. type is determined from extension, same types as above are supported. Params: db_min_occs: only consider database kmers with at least this count db_max_occs: only consider database kmers with at most this count read_min_occs: only keep reads with at least this many occurrences of kmers from database. read_max_occs: only keep reads with no more than this many occurrence of kmers from the database. read_min_occs_frac: only keep reads with at least this many occurrences of kmers from database, interpreted as a fraction of read length in kmers read_max_occs_frac: only keep reads with no more than this many occurrence of kmers from the database. interpreted as a fraction of read length in kmers. (Note: minimal read kmer content can be given as absolute counts or fraction of read length, but not both). hard_mask: if True, in the output reads, kmers not passing the filter are replaced by Ns threads: use this many threads

usage: kmer_utils.py filter_reads [-h] [--dbMinOccs DB_MIN_OCCS]
                                  [--dbMaxOccs DB_MAX_OCCS]
                                  [--readMinOccs READ_MIN_OCCS]
                                  [--readMaxOccs READ_MAX_OCCS]
                                  [--readMinOccsFrac READ_MIN_OCCS_FRAC]
                                  [--readMaxOccsFrac READ_MAX_OCCS_FRAC]
                                  [--hardMask] [--threads THREADS]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version] [--tmp_dir TMP_DIR]
                                  [--tmp_dirKeep]
                                  kmer_db in_reads out_reads
Positional arguments:
kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix)
in_reads input reads, as fasta/fastq/bam
out_reads output reads
Options:
--dbMinOccs=1 ignore datatbase kmers with count below this
--dbMaxOccs=2147483647
 ignore datatbase kmers with count above this
--readMinOccs=0
 filter out reads with fewer than this many db kmers
--readMaxOccs=2147483647
 filter out reads with more than this many db kmers
--readMinOccsFrac=0.0
 filter out reads with fewer than this many db kmers, interpreted as fraction of read length
--readMaxOccsFrac=1.0
 filter out reads with more than this many db kmers, interpreted as fraction of read length
--hardMask=False
 In the output reads, mask the invalid kmers
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
kmers_binary_op

Perform a simple binary operation on kmer sets.

usage: kmer_utils.py kmers_binary_op [-h] [--resultMinOccs RESULT_MIN_OCCS]
                                     [--resultMaxOccs RESULT_MAX_OCCS]
                                     [--resultCounterCap RESULT_COUNTER_CAP]
                                     [--threads THREADS]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version] [--tmp_dir TMP_DIR]
                                     [--tmp_dirKeep]
                                     {intersect,union,kmers_subtract,counters_subtract}
                                     kmer_db1 kmer_db2 kmer_db_out
Positional arguments:
op

binary operation to perform

Possible choices: intersect, union, kmers_subtract, counters_subtract

kmer_db1 first kmer set
kmer_db2 second kmer set
kmer_db_out output kmer db
Options:
--resultMinOccs=1
 from the result drop kmers with counts below this
--resultMaxOccs=2147483647
 from the result drop kmers with counts above this
--resultCounterCap=255
 cap output counters at this value
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
kmers_set_counts

Copy the kmer database, setting all kmer counts in the output to the given value.

usage: kmer_utils.py kmers_set_counts [-h] [--threads THREADS]
                                      [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                      [--version] [--tmp_dir TMP_DIR]
                                      [--tmp_dirKeep]
                                      kmer_db_in value kmer_db_out
Positional arguments:
kmer_db_in input kmer db
value all kmer counts in the output will be set to this value
kmer_db_out output kmer db
Options:
--threads Number of threads (default: all available cores)
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.