2.9. ncbi.py - NCBI Genbank and SRA utilities

This script contains a number of utilities for submitting our analyses to NCBI’s Genbank and SRA databases, as well as retreiving records from Genbank.

usage: ncbi.py subcommand

2.9.1. subcommands

[F: Possible choices: tbl_transfer, tbl_transfer_multichr, tbl_transfer_prealigned, fetch_fastas, fetch_feature_tables, fetch_genbank_records, biosample_to_genbank, prep_genbank_files, prep_sra_table

2.9.2. Sub-commands

2.9.2.1. tbl_transfer

This function takes an NCBI TBL file describing features on a genome (genes, etc) and transfers them to a new genome.

ncbi.py tbl_transfer [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
                     [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                     [--version]
                     ref_fasta ref_tbl alt_fasta out_tbl

2.9.2.1.1. Positional Arguments

ref_fasta: Input sequence of reference genome
ref_tbl: Input reference annotations (NCBI TBL format)
alt_fasta: Input sequence of new genome
out_tbl: Output file with transferred annotations

2.9.2.1.2. Named Arguments

--oob_clip

Out of bounds feature behavior.: False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge

Ambiguous feature behavior.

False: features specified as ambiguous (“<####” or “>####”) are mapped,: where possible
True: features specified as ambiguous (“<####” or “>####”) are interpreted: as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.2. tbl_transfer_multichr

This function takes an NCBI TBL file describing features on a genome (genes, etc) and transfers them to a new genome.

ncbi.py tbl_transfer_multichr [-h] [--ref_fastas REF_FASTAS [REF_FASTAS ...]]
                              [--ref_tbls REF_TBLS [REF_TBLS ...]]
                              [--oob_clip] [--ignoreAmbigFeatureEdge]
                              [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              alt_fasta out_dir

2.9.2.2.1. Positional Arguments

alt_fasta: Input sequence of new genome, all chr/segs in one fasta file, in the same order as ref_fastas
out_dir: Output files include one fasta and tbl per sequence in alt_fasta, named according to the fasta header ID of each entry in alt_fasta.

2.9.2.2.2. Named Arguments

--ref_fastas

Input sequences of reference genome, one chr/seg per fasta file

--ref_tbls

Input reference annotations (NCBI TBL format), one chr/seg per tbl file, in the same order as ref_fastas

--oob_clip

Out of bounds feature behavior.: False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge

Ambiguous feature behavior.

False: features specified as ambiguous (“<####” or “>####”) are mapped,: where possible
True: features specified as ambiguous (“<####” or “>####”) are interpreted: as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.3. tbl_transfer_prealigned

This breaks out the ref and alt sequences into separate fasta files, and then creates unified files containing the reference sequence first and the alt second. Each of these unified files is then passed as a cmap to tbl_transfer_common.

This function expects to receive one fasta file containing a multialignment of a single segment/chromosome along with the respective reference sequence for that segment/chromosome. It also expects a reference containing all reference segments/chromosomes, so that the reference sequence can be identified in the input file by name. It also expects a list of reference tbl files, where each file is named according to the ID present for its corresponding sequence in the refFasta. For each non-reference sequence present in the inputFasta, two files are written: a fasta containing the segment/chromosome for the same, along with its corresponding feature table as created by tbl_transfer_common.

ncbi.py tbl_transfer_prealigned [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
                                [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version]
                                inputFasta refFasta refAnnotTblFiles
                                [refAnnotTblFiles ...] outputDir

2.9.2.3.1. Positional Arguments

inputFasta

FASTA file containing input sequences,: including pre-made alignments and reference sequence

refFasta

FASTA file containing the reference genome

refAnnotTblFiles

Name of the reference feature tables,: each of which should have a filename comrised of [refId].tbl so they can be matched against the reference sequences

outputDir

The output directory

2.9.2.3.2. Named Arguments

--oob_clip

Out of bounds feature behavior.: False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge

Ambiguous feature behavior.

False: features specified as ambiguous (“<####” or “>####”) are mapped,: where possible
True: features specified as ambiguous (“<####” or “>####”) are interpreted: as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.4. fetch_fastas

This function downloads and saves the FASTA files from the Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_fastas [-h] [--api_key API_KEY] [--forceOverwrite]
                     [--combinedFilePrefix COMBINEDFILEPREFIX]
                     [--fileExt FILEEXT] [--removeSeparateFiles]
                     [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                     [--tmp_dirKeep]
                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                     [--version]
                     emailAddress destinationDir accession_IDs
                     [accession_IDs ...]

2.9.2.4.1. Positional Arguments

emailAddress

Your email address. To access Genbank databases,: NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.4.2. Named Arguments

--api_key

Your NCBI API key. If an API key is not provided, NCBI: requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix

The prefix of the file containing the combined concatenated: results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize

Causes files to be downloaded from GenBank in chunks of N accessions.: Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.5. fetch_feature_tables

This function downloads and saves feature tables from the Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_feature_tables [-h] [--api_key API_KEY] [--forceOverwrite]
                             [--combinedFilePrefix COMBINEDFILEPREFIX]
                             [--fileExt FILEEXT] [--removeSeparateFiles]
                             [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                             [--tmp_dirKeep]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             emailAddress destinationDir accession_IDs
                             [accession_IDs ...]

2.9.2.5.1. Positional Arguments

emailAddress

Your email address. To access Genbank databases,: NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.5.2. Named Arguments

--api_key

Your NCBI API key. If an API key is not provided, NCBI: requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix

The prefix of the file containing the combined concatenated: results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize

Causes files to be downloaded from GenBank in chunks of N accessions.: Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.6. fetch_genbank_records

This function downloads and saves full flat text records from Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_genbank_records [-h] [--api_key API_KEY] [--forceOverwrite]
                              [--combinedFilePrefix COMBINEDFILEPREFIX]
                              [--fileExt FILEEXT] [--removeSeparateFiles]
                              [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                              [--tmp_dirKeep]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              emailAddress destinationDir accession_IDs
                              [accession_IDs ...]

2.9.2.6.1. Positional Arguments

emailAddress

Your email address. To access Genbank databases,: NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.6.2. Named Arguments

--api_key

Your NCBI API key. If an API key is not provided, NCBI: requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix

The prefix of the file containing the combined concatenated: results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize

Causes files to be downloaded from GenBank in chunks of N accessions.: Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.7. biosample_to_genbank

Prepare a Genbank Source Modifier Table based on a BioSample registration table (since all of the values are there)

ncbi.py biosample_to_genbank [-h] [--biosample_in_smt] [--iso_dates]
                             [--sgtf_override]
                             [--filter_to_samples FILTER_TO_SAMPLES]
                             [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             attributes num_segments taxid out_genbank_smt
                             out_biosample_map

2.9.2.7.1. Positional Arguments

attributes: Input BioSample metadata table – the attributes.tsv returned by BioSample after successful registration
num_segments: number of chromosomes/segments per genome for this species
taxid: NCBI Taxonomy numeric taxid to assign to all entries
out_genbank_smt: Output tab table in Genbank Source Modifier Table format, suitable for prep_genbank_files
out_biosample_map: Output two-column biosample accession to sample name map, suitable for prep_genbank_files

2.9.2.7.2. Named Arguments

--biosample_in_smt

Add BioSample and BioProject columns to source modifier table output

Default: False

--iso_dates

write collection_date in ISO format (YYYY-MM-DD). default (false) is to write in tbl2asn format (DD-Mmm-YYYY)

Default: False

--sgtf_override

replace “Screening for Variants of Concern (VoC)” with “screened by S dropout” in the note field

Default: False

--filter_to_samples

Filter output to specified sample IDs in this input file (one ID per line).

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.8. prep_genbank_files

Prepare genbank submission files. Requires .fasta and .tbl files as input, as well as numerous other metadata files for the submission. Creates a directory full of files (.sqn in particular) that can be sent to GenBank.

ncbi.py prep_genbank_files [-h] [--comment COMMENT]
                           [--sequencing_tech SEQUENCING_TECH]
                           [--master_source_table MASTER_SOURCE_TABLE]
                           [--organism ORGANISM] [--mol_type MOL_TYPE]
                           [--biosample_map BIOSAMPLE_MAP]
                           [--coverage_table COVERAGE_TABLE]
                           [--assembly_method ASSEMBLY_METHOD]
                           [--assembly_method_version ASSEMBLY_METHOD_VERSION]
                           [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version]
                           templateFile fasta_files [fasta_files ...] annotDir

2.9.2.8.1. Positional Arguments

templateFile: Submission template file (.sbt) including author and contact info
fasta_files: Input fasta files
annotDir: Output directory with genbank submission files (.tbl files must already be there)

2.9.2.8.2. Named Arguments

--comment

comment field

--sequencing_tech

sequencing technology (e.g. Illumina HiSeq 2500)

--master_source_table

source modifier table

--organism

species name

--mol_type

molecule type

--biosample_map

A file with two columns and a header: sample and BioSample.: This file may refer to samples that are not included in this submission.

--coverage_table

A genome coverage report file with a header row. The table must: have at least two columns named sample and aln2self_cov_median. All other columns are ignored. Rows referring to samples not in this submission are ignored.

--assembly_method

short description of informatic assembly method

--assembly_method_version

version of assembly method used

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.9. prep_sra_table

This is a very lazy hack that creates a basic table that can be pasted into various columns of an SRA submission spreadsheet. It probably doesn’t work in all cases.

ncbi.py prep_sra_table [-h]
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                       [--version]
                       lib_fname biosampleFile md5_fname outFile

2.9.2.9.1. Positional Arguments

lib_fname

A file that lists all of the library IDs that will be submitted in this batch

biosampleFile

A file with two columns and a header: sample and BioSample.: This file may refer to samples that are not included in this submission.

md5_fname

A file with two columns and no header. Two columns are MD5 checksum and filename.: Should contain an entry for every bam file being submitted in this batch. This is typical output from “md5sum *.cleaned.bam”.

outFile

Output table that contains most of the variable columns needed for SRA submission.

2.9.2.9.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit