3.7. kmer_utils.py - utilities that manipulate kmer sets and use them to filter reads or reference sequencesΒΆ
Commands for working with sets of kmers
usage: kmer_utils.py subcommand
- Sub-commands:
- build_kmer_db
Build a database of kmers occurring in given sequences.
usage: kmer_utils.py build_kmer_db [-h] [--kmerSize KMER_SIZE] [--minOccs MIN_OCCS] [--maxOccs MAX_OCCS] [--counterCap COUNTER_CAP] [--singleStrand] [--memLimitGb MEM_LIMIT_GB] [--memLimitLaxness {0,1,2}] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] seq_files [seq_files ...] kmer_db
- Positional arguments:
seq_files Files from which to extract kmers (fasta/fastq/bam, fasta/fastq may be .gz or .bz2) kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix) - Options:
--kmerSize=25, -k=25 kmer size --minOccs=1, -ci=1 drop kmers with fewer than this many occurrences --maxOccs=2147483647, -cx=2147483647 drop kmers with more than this many occurrences --counterCap=255, -cs=255 cap kmer counts at this value --singleStrand=False, -b=False do not add kmers from reverse complements of input sequences --memLimitGb=8 Max memory to use, in GB --memLimitLaxness=0 How strict is –memLimitGb? 0=strict, 1=lax, 2=even more lax
Possible choices: 0, 1, 2
--threads Number of threads (default: all available cores) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- dump_kmer_counts
Dump kmers and their counts from kmer database to a text file
usage: kmer_utils.py dump_kmer_counts [-h] [--minOccs MIN_OCCS] [--maxOccs MAX_OCCS] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] kmer_db out_kmers
- Positional arguments:
kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix) out_kmers text file to which to write the kmers - Options:
--minOccs=1, -ci=1 drop kmers with fewer than this many occurrences --maxOccs=2147483647, -cx=2147483647 drop kmers with more than this many occurrences --threads Number of threads (default: all available cores) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- filter_reads
Filter reads based on their kmer contents. Can also be used to filter contigs or reference sequences, but we’ll refer to filtering of reads in the documentation. Note that “occurrence of a kmer” means “occurrence of the kmer or its reverse complement” if kmer_db was built with single_strand==False. Inputs: kmer_db: the kmc kmer database in_reads: the reads to filter. can be a .fasta or .fastq or .bam; fasta or fastq can be compressed with gzip or bzip2. If a .bam, a read pair is kept if either mate passes the filter. Outputs: out_reads: file to which filtered reads are written. type is determined from extension, same types as above are supported. Params: db_min_occs: only consider database kmers with at least this count db_max_occs: only consider database kmers with at most this count read_min_occs: only keep reads with at least this many occurrences of kmers from database. read_max_occs: only keep reads with no more than this many occurrence of kmers from the database. read_min_occs_frac: only keep reads with at least this many occurrences of kmers from database, interpreted as a fraction of read length in kmers read_max_occs_frac: only keep reads with no more than this many occurrence of kmers from the database. interpreted as a fraction of read length in kmers. (Note: minimal read kmer content can be given as absolute counts or fraction of read length, but not both). hard_mask: if True, in the output reads, kmers not passing the filter are replaced by Ns threads: use this many threads
usage: kmer_utils.py filter_reads [-h] [--dbMinOccs DB_MIN_OCCS] [--dbMaxOccs DB_MAX_OCCS] [--readMinOccs READ_MIN_OCCS] [--readMaxOccs READ_MAX_OCCS] [--readMinOccsFrac READ_MIN_OCCS_FRAC] [--readMaxOccsFrac READ_MAX_OCCS_FRAC] [--hardMask] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] kmer_db in_reads out_reads
- Positional arguments:
kmer_db kmer database (with or without .kmc_pre/.kmc_suf suffix) in_reads input reads, as fasta/fastq/bam out_reads output reads - Options:
--dbMinOccs=1 ignore datatbase kmers with count below this --dbMaxOccs=2147483647 ignore datatbase kmers with count above this --readMinOccs=0 filter out reads with fewer than this many db kmers --readMaxOccs=2147483647 filter out reads with more than this many db kmers --readMinOccsFrac=0.0 filter out reads with fewer than this many db kmers, interpreted as fraction of read length --readMaxOccsFrac=1.0 filter out reads with more than this many db kmers, interpreted as fraction of read length --hardMask=False In the output reads, mask the invalid kmers --threads Number of threads (default: all available cores) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- kmers_binary_op
Perform a simple binary operation on kmer sets.
usage: kmer_utils.py kmers_binary_op [-h] [--resultMinOccs RESULT_MIN_OCCS] [--resultMaxOccs RESULT_MAX_OCCS] [--resultCounterCap RESULT_COUNTER_CAP] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] {intersect,union,kmers_subtract,counters_subtract} kmer_db1 kmer_db2 kmer_db_out
- Positional arguments:
op binary operation to perform
Possible choices: intersect, union, kmers_subtract, counters_subtract
kmer_db1 first kmer set kmer_db2 second kmer set kmer_db_out output kmer db - Options:
--resultMinOccs=1 from the result drop kmers with counts below this --resultMaxOccs=2147483647 from the result drop kmers with counts above this --resultCounterCap=255 cap output counters at this value --threads Number of threads (default: all available cores) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- kmers_set_counts
Copy the kmer database, setting all kmer counts in the output to the given value.
usage: kmer_utils.py kmers_set_counts [-h] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] kmer_db_in value kmer_db_out
- Positional arguments:
kmer_db_in input kmer db value all kmer counts in the output will be set to this value kmer_db_out output kmer db - Options:
--threads Number of threads (default: all available cores) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.