2.11. file_utils.py - utilities to perform various file manipulations

Utilities for dealing with files.

usage: file_utils.py subcommand

2.11.1. subcommands



Possible choices: merge_tarballs, rename_fasta_sequences, tsv_derived_cols, tsv_join

2.11.2. Sub-commands

2.11.2.1. merge_tarballs

Merges separate tarballs into one tarball

data can be piped in and/or out

file_utils.py merge_tarballs [-h] [--extractToDiskPath EXTRACT_TO_DISK_PATH]
                             [--pipeInHint PIPE_HINT_IN]
                             [--pipeOutHint PIPE_HINT_OUT] [--threads THREADS]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             out_tarball in_tarballs [in_tarballs ...]

2.11.2.1.1. Positional Arguments

out_tarball
output tarball (.tar.gz|.tar.lz4|*.tar.bz2|*.tar.zst|-);

compression is inferred by the file extension.

Note: if “-” is used, output will be written to stdout and

–pipeOutHint must be provided to indicate compression type when compression type is not gzip (gzip is used by default).

in_tarballs

input tarballs (.tar.gz|.tar.lz4|*.tar.bz2|*.tar.zst)

2.11.2.1.2. Named Arguments

--extractToDiskPath

If specified, the tar contents will also be extracted to a local directory.

--pipeInHint

If specified, the compression type used is used for piped input.

Default: 'gz'

--pipeOutHint

If specified, the compression type used is used for piped output.

Default: 'gz'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.11.2.2. rename_fasta_sequences

Renames the sequences in a fasta file. Behavior modes:
  1. If input file has exactly one sequence and suffix_always is False,

    then the output file’s sequence is named new_name.

  2. In all other cases,

    the output file’s sequences are named <new_name>-<i> where <i> is an increasing number from 1..<# of sequences>

file_utils.py rename_fasta_sequences [-h] [--suffix_always]
                                     [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version]
                                     in_fasta out_fasta new_name

2.11.2.2.1. Positional Arguments

in_fasta

input fasta sequences

out_fasta

output (renamed) fasta sequences

new_name

new sequence base name

2.11.2.2.2. Named Arguments

--suffix_always

append numeric index ‘-1’ to <new_name> if only one sequence exists in <input> (default: False)

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.11.2.3. tsv_derived_cols

Modify metadata table to compute derivative columns on the fly and add or replace new columns

file_utils.py tsv_derived_cols [-h] [--table_map [TABLE_MAP ...]]
                               [--lab_highlight_loc LAB_HIGHLIGHT_LOC]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version]
                               in_tsv out_tsv

2.11.2.3.1. Positional Arguments

in_tsv

input metadata

out_tsv

output metadata

2.11.2.3.2. Named Arguments

--table_map

Mapping tables. Each mapping table is a tsv with a header. The first column is the output column name for this mapping (it will be created or overwritten). The subsequent columns are matching criteria. The value in the first column is written to the output column. The exception is in the case where all match columns are ‘*’ – in this case, the value in the first column is the column header name to copy over.

--lab_highlight_loc

This option copies the ‘originating_lab’ and ‘submitting_lab’ columns to new ones including a prefix, but only if they match certain criteria. The value of this string must be of the form prefix;col_header=value:col_header=value. For example, ‘MA;country=USA:division=Massachusetts’ will copy the originating_lab and submitting_lab columns to MA_originating_lab and MA_submitting_lab, but only for those rows where country=USA and division=Massachusetts.

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.11.2.4. tsv_join

full outer join of tables

file_utils.py tsv_join [-h] --join_id JOIN_ID
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                       [--version]
                       in_tsvs [in_tsvs ...] out_tsv

2.11.2.4.1. Positional Arguments

in_tsvs

input tsvs

out_tsv

output tsv

2.11.2.4.2. Named Arguments

--join_id

column name to join on

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit