2.9. ncbi.py - NCBI Genbank and SRA utilities

This script contains a number of utilities for submitting our analyses to NCBI’s Genbank and SRA databases, as well as retreiving records from Genbank.

usage: ncbi.py subcommand

2.9.1. subcommands



Possible choices: tbl_transfer, tbl_transfer_multichr, tbl_transfer_prealigned, fetch_fastas, fetch_feature_tables, fetch_genbank_records, biosample_to_genbank, prep_genbank_files, prep_sra_table

2.9.2. Sub-commands

2.9.2.1. tbl_transfer

This function takes an NCBI TBL file describing features on a genome

(genes, etc) and transfers them to a new genome.

ncbi.py tbl_transfer [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
                     [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                     [--version]
                     ref_fasta ref_tbl alt_fasta out_tbl

2.9.2.1.1. Positional Arguments

ref_fasta

Input sequence of reference genome

ref_tbl

Input reference annotations (NCBI TBL format)

alt_fasta

Input sequence of new genome

out_tbl

Output file with transferred annotations

2.9.2.1.2. Named Arguments

--oob_clip
Out of bounds feature behavior.

False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge
Ambiguous feature behavior.
False: features specified as ambiguous (“<####” or “>####”) are mapped,

where possible

True: features specified as ambiguous (“<####” or “>####”) are interpreted

as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.2. tbl_transfer_multichr

This function takes an NCBI TBL file describing features on a genome

(genes, etc) and transfers them to a new genome.

ncbi.py tbl_transfer_multichr [-h] [--ref_fastas REF_FASTAS [REF_FASTAS ...]]
                              [--ref_tbls REF_TBLS [REF_TBLS ...]]
                              [--oob_clip] [--ignoreAmbigFeatureEdge]
                              [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              alt_fasta out_dir

2.9.2.2.1. Positional Arguments

alt_fasta

Input sequence of new genome, all chr/segs in one fasta file, in the same order as ref_fastas

out_dir

Output files include one fasta and tbl per sequence in alt_fasta, named according to the fasta header ID of each entry in alt_fasta.

2.9.2.2.2. Named Arguments

--ref_fastas

Input sequences of reference genome, one chr/seg per fasta file

--ref_tbls

Input reference annotations (NCBI TBL format), one chr/seg per tbl file, in the same order as ref_fastas

--oob_clip
Out of bounds feature behavior.

False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge
Ambiguous feature behavior.
False: features specified as ambiguous (“<####” or “>####”) are mapped,

where possible

True: features specified as ambiguous (“<####” or “>####”) are interpreted

as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.3. tbl_transfer_prealigned

This breaks out the ref and alt sequences into separate fasta files, and then creates unified files containing the reference sequence first and the alt second. Each of these unified files is then passed as a cmap to tbl_transfer_common.

This function expects to receive one fasta file containing a multialignment of a single segment/chromosome along with the respective reference sequence for that segment/chromosome. It also expects a reference containing all reference segments/chromosomes, so that the reference sequence can be identified in the input file by name. It also expects a list of reference tbl files, where each file is named according to the ID present for its corresponding sequence in the refFasta. For each non-reference sequence present in the inputFasta, two files are written: a fasta containing the segment/chromosome for the same, along with its corresponding feature table as created by tbl_transfer_common.

ncbi.py tbl_transfer_prealigned [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
                                [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version]
                                inputFasta refFasta refAnnotTblFiles
                                [refAnnotTblFiles ...] outputDir

2.9.2.3.1. Positional Arguments

inputFasta
FASTA file containing input sequences,

including pre-made alignments and reference sequence

refFasta

FASTA file containing the reference genome

refAnnotTblFiles
Name of the reference feature tables,

each of which should have a filename comrised of [refId].tbl so they can be matched against the reference sequences

outputDir

The output directory

2.9.2.3.2. Named Arguments

--oob_clip
Out of bounds feature behavior.

False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds

but truncate any features that are partly out of bounds

Default: False

--ignoreAmbigFeatureEdge
Ambiguous feature behavior.
False: features specified as ambiguous (“<####” or “>####”) are mapped,

where possible

True: features specified as ambiguous (“<####” or “>####”) are interpreted

as exact values

Default: False

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.4. fetch_fastas

This function downloads and saves the FASTA files from the Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_fastas [-h] [--api_key API_KEY] [--forceOverwrite]
                     [--combinedFilePrefix COMBINEDFILEPREFIX]
                     [--fileExt FILEEXT] [--removeSeparateFiles]
                     [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                     [--tmp_dirKeep]
                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                     [--version]
                     emailAddress destinationDir accession_IDs
                     [accession_IDs ...]

2.9.2.4.1. Positional Arguments

emailAddress
Your email address. To access Genbank databases,

NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.4.2. Named Arguments

--api_key
Your NCBI API key. If an API key is not provided, NCBI

requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix
The prefix of the file containing the combined concatenated

results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize
Causes files to be downloaded from GenBank in chunks of N accessions.

Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.5. fetch_feature_tables

This function downloads and saves feature tables from the Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_feature_tables [-h] [--api_key API_KEY] [--forceOverwrite]
                             [--combinedFilePrefix COMBINEDFILEPREFIX]
                             [--fileExt FILEEXT] [--removeSeparateFiles]
                             [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                             [--tmp_dirKeep]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             emailAddress destinationDir accession_IDs
                             [accession_IDs ...]

2.9.2.5.1. Positional Arguments

emailAddress
Your email address. To access Genbank databases,

NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.5.2. Named Arguments

--api_key
Your NCBI API key. If an API key is not provided, NCBI

requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix
The prefix of the file containing the combined concatenated

results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize
Causes files to be downloaded from GenBank in chunks of N accessions.

Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.6. fetch_genbank_records

This function downloads and saves full flat text records from Genbank CoreNucleotide database given a given list of accession IDs.

ncbi.py fetch_genbank_records [-h] [--api_key API_KEY] [--forceOverwrite]
                              [--combinedFilePrefix COMBINEDFILEPREFIX]
                              [--fileExt FILEEXT] [--removeSeparateFiles]
                              [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                              [--tmp_dirKeep]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              emailAddress destinationDir accession_IDs
                              [accession_IDs ...]

2.9.2.6.1. Positional Arguments

emailAddress
Your email address. To access Genbank databases,

NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).

destinationDir

Output directory with where .fasta and .tbl files will be saved

accession_IDs

List of Genbank nuccore accession IDs

2.9.2.6.2. Named Arguments

--api_key
Your NCBI API key. If an API key is not provided, NCBI

requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

--forceOverwrite

Overwrite existing files, if present.

Default: False

--combinedFilePrefix
The prefix of the file containing the combined concatenated

results returned by the list of accession IDs, in the order provided.

--fileExt

The extension to use for the downloaded files

--removeSeparateFiles

If specified, remove the individual files and leave only the combined file.

Default: False

--chunkSize
Causes files to be downloaded from GenBank in chunks of N accessions.

Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.

Default: 1

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.7. biosample_to_genbank

Prepare a Genbank Source Modifier Table based on a BioSample registration table (since all of the values are there)

ncbi.py biosample_to_genbank [-h] [--biosample_in_smt] [--iso_dates]
                             [--sgtf_override]
                             [--filter_to_samples FILTER_TO_SAMPLES]
                             [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                             [--version]
                             attributes num_segments taxid out_genbank_smt
                             out_biosample_map

2.9.2.7.1. Positional Arguments

attributes

Input BioSample metadata table – the attributes.tsv returned by BioSample after successful registration

num_segments

number of chromosomes/segments per genome for this species

taxid

NCBI Taxonomy numeric taxid to assign to all entries

out_genbank_smt

Output tab table in Genbank Source Modifier Table format, suitable for prep_genbank_files

out_biosample_map

Output two-column biosample accession to sample name map, suitable for prep_genbank_files

2.9.2.7.2. Named Arguments

--biosample_in_smt

Add BioSample and BioProject columns to source modifier table output

Default: False

--iso_dates

write collection_date in ISO format (YYYY-MM-DD). default (false) is to write in tbl2asn format (DD-Mmm-YYYY)

Default: False

--sgtf_override

replace “Screening for Variants of Concern (VoC)” with “screened by S dropout” in the note field

Default: False

--filter_to_samples

Filter output to specified sample IDs in this input file (one ID per line).

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.8. prep_genbank_files

Prepare genbank submission files. Requires .fasta and .tbl files as input,

as well as numerous other metadata files for the submission. Creates a directory full of files (.sqn in particular) that can be sent to GenBank.

ncbi.py prep_genbank_files [-h] [--comment COMMENT]
                           [--sequencing_tech SEQUENCING_TECH]
                           [--master_source_table MASTER_SOURCE_TABLE]
                           [--organism ORGANISM] [--mol_type MOL_TYPE]
                           [--biosample_map BIOSAMPLE_MAP]
                           [--coverage_table COVERAGE_TABLE]
                           [--assembly_method ASSEMBLY_METHOD]
                           [--assembly_method_version ASSEMBLY_METHOD_VERSION]
                           [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version]
                           templateFile fasta_files [fasta_files ...] annotDir

2.9.2.8.1. Positional Arguments

templateFile

Submission template file (.sbt) including author and contact info

fasta_files

Input fasta files

annotDir

Output directory with genbank submission files (.tbl files must already be there)

2.9.2.8.2. Named Arguments

--comment

comment field

--sequencing_tech

sequencing technology (e.g. Illumina HiSeq 2500)

--master_source_table

source modifier table

--organism

species name

--mol_type

molecule type

--biosample_map
A file with two columns and a header: sample and BioSample.

This file may refer to samples that are not included in this submission.

--coverage_table
A genome coverage report file with a header row. The table must

have at least two columns named sample and aln2self_cov_median. All other columns are ignored. Rows referring to samples not in this submission are ignored.

--assembly_method

short description of informatic assembly method

--assembly_method_version

version of assembly method used

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep
Keep the tmp_dir if an exception occurs while

running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

2.9.2.9. prep_sra_table

This is a very lazy hack that creates a basic table that can be

pasted into various columns of an SRA submission spreadsheet. It probably doesn’t work in all cases.

ncbi.py prep_sra_table [-h]
                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                       [--version]
                       lib_fname biosampleFile md5_fname outFile

2.9.2.9.1. Positional Arguments

lib_fname

A file that lists all of the library IDs that will be submitted in this batch

biosampleFile
A file with two columns and a header: sample and BioSample.

This file may refer to samples that are not included in this submission.

md5_fname
A file with two columns and no header. Two columns are MD5 checksum and filename.

Should contain an entry for every bam file being submitted in this batch. This is typical output from “md5sum *.cleaned.bam”.

outFile

Output table that contains most of the variable columns needed for SRA submission.

2.9.2.9.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit