3.11. ncbi.py - utilities to interact with NCBIΒΆ

This script contains a number of utilities for submitting our analyses to NCBI’s Genbank and SRA databases, as well as retreiving records from Genbank.

usage: ncbi.py subcommand
Sub-commands:
tbl_transfer

This function takes an NCBI TBL file describing features on a genome (genes, etc) and transfers them to a new genome.

usage: ncbi.py tbl_transfer [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
                            [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version]
                            ref_fasta ref_tbl alt_fasta out_tbl
Positional arguments:
ref_fasta Input sequence of reference genome
ref_tbl Input reference annotations (NCBI TBL format)
alt_fasta Input sequence of new genome
out_tbl Output file with transferred annotations
Options:
--oob_clip=False
 Out of bounds feature behavior. False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds but truncate any features that are partly out of bounds
--ignoreAmbigFeatureEdge=False
 Ambiguous feature behavior. False: features specified as ambiguous (“<####” or “>####”) are mapped, where possible True: features specified as ambiguous (“<####” or “>####”) are interpreted as exact values
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
tbl_transfer_prealigned

This breaks out the ref and alt sequences into separate fasta files, and then creates unified files containing the reference sequence first and the alt second. Each of these unified files is then passed as a cmap to tbl_transfer_common. This function expects to receive one fasta file containing a multialignment of a single segment/chromosome along with the respective reference sequence for that segment/chromosome. It also expects a reference containing all reference segments/chromosomes, so that the reference sequence can be identified in the input file by name. It also expects a list of reference tbl files, where each file is named according to the ID present for its corresponding sequence in the refFasta. For each non-reference sequence present in the inputFasta, two files are written: a fasta containing the segment/chromosome for the same, along with its corresponding feature table as created by tbl_transfer_common.

usage: ncbi.py tbl_transfer_prealigned [-h] [--oob_clip]
                                       [--ignoreAmbigFeatureEdge]
                                       [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                       [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                       [--version]
                                       inputFasta refFasta refAnnotTblFiles
                                       [refAnnotTblFiles ...] outputDir
Positional arguments:
inputFasta FASTA file containing input sequences, including pre-made alignments and reference sequence
refFasta FASTA file containing the reference genome
refAnnotTblFiles
 Name of the reference feature tables, each of which should have a filename comrised of [refId].tbl so they can be matched against the reference sequences
outputDir The output directory
Options:
--oob_clip=False
 Out of bounds feature behavior. False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds but truncate any features that are partly out of bounds
--ignoreAmbigFeatureEdge=False
 Ambiguous feature behavior. False: features specified as ambiguous (“<####” or “>####”) are mapped, where possible True: features specified as ambiguous (“<####” or “>####”) are interpreted as exact values
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
fetch_fastas

This function downloads and saves the FASTA files from the Genbank CoreNucleotide database given a given list of accession IDs.

usage: ncbi.py fetch_fastas [-h] [--api_key API_KEY] [--forceOverwrite]
                            [--combinedFilePrefix COMBINEDFILEPREFIX]
                            [--fileExt FILEEXT] [--removeSeparateFiles]
                            [--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
                            [--tmp_dirKeep]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version]
                            emailAddress destinationDir accession_IDs
                            [accession_IDs ...]
Positional arguments:
emailAddress Your email address. To access Genbank databases, NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
destinationDir Output directory with where .fasta and .tbl files will be saved
accession_IDs List of Genbank nuccore accession IDs
Options:
--api_key Your NCBI API key. If an API key is not provided, NCBI requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
--forceOverwrite=False
 Overwrite existing files, if present.
--combinedFilePrefix
 The prefix of the file containing the combined concatenated results returned by the list of accession IDs, in the order provided.
--fileExt The extension to use for the downloaded files
--removeSeparateFiles=False
 If specified, remove the individual files and leave only the combined file.
--chunkSize=1 Causes files to be downloaded from GenBank in chunks of N accessions. Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: %(default)s). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
fetch_feature_tables

This function downloads and saves feature tables from the Genbank CoreNucleotide database given a given list of accession IDs.

usage: ncbi.py fetch_feature_tables [-h] [--api_key API_KEY]
                                    [--forceOverwrite]
                                    [--combinedFilePrefix COMBINEDFILEPREFIX]
                                    [--fileExt FILEEXT]
                                    [--removeSeparateFiles]
                                    [--chunkSize CHUNKSIZE]
                                    [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                    [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                    [--version]
                                    emailAddress destinationDir accession_IDs
                                    [accession_IDs ...]
Positional arguments:
emailAddress Your email address. To access Genbank databases, NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
destinationDir Output directory with where .fasta and .tbl files will be saved
accession_IDs List of Genbank nuccore accession IDs
Options:
--api_key Your NCBI API key. If an API key is not provided, NCBI requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
--forceOverwrite=False
 Overwrite existing files, if present.
--combinedFilePrefix
 The prefix of the file containing the combined concatenated results returned by the list of accession IDs, in the order provided.
--fileExt The extension to use for the downloaded files
--removeSeparateFiles=False
 If specified, remove the individual files and leave only the combined file.
--chunkSize=1 Causes files to be downloaded from GenBank in chunks of N accessions. Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: %(default)s). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
fetch_genbank_records

This function downloads and saves full flat text records from Genbank CoreNucleotide database given a given list of accession IDs.

usage: ncbi.py fetch_genbank_records [-h] [--api_key API_KEY]
                                     [--forceOverwrite]
                                     [--combinedFilePrefix COMBINEDFILEPREFIX]
                                     [--fileExt FILEEXT]
                                     [--removeSeparateFiles]
                                     [--chunkSize CHUNKSIZE]
                                     [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                     [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                     [--version]
                                     emailAddress destinationDir accession_IDs
                                     [accession_IDs ...]
Positional arguments:
emailAddress Your email address. To access Genbank databases, NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
destinationDir Output directory with where .fasta and .tbl files will be saved
accession_IDs List of Genbank nuccore accession IDs
Options:
--api_key Your NCBI API key. If an API key is not provided, NCBI requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
--forceOverwrite=False
 Overwrite existing files, if present.
--combinedFilePrefix
 The prefix of the file containing the combined concatenated results returned by the list of accession IDs, in the order provided.
--fileExt The extension to use for the downloaded files
--removeSeparateFiles=False
 If specified, remove the individual files and leave only the combined file.
--chunkSize=1 Causes files to be downloaded from GenBank in chunks of N accessions. Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: %(default)s). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
prep_genbank_files

Prepare genbank submission files. Requires .fasta and .tbl files as input, as well as numerous other metadata files for the submission. Creates a directory full of files (.sqn in particular) that can be sent to GenBank.

usage: ncbi.py prep_genbank_files [-h] [--comment COMMENT]
                                  [--sequencing_tech SEQUENCING_TECH]
                                  [--master_source_table MASTER_SOURCE_TABLE]
                                  [--organism ORGANISM] [--mol_type MOL_TYPE]
                                  [--biosample_map BIOSAMPLE_MAP]
                                  [--coverage_table COVERAGE_TABLE]
                                  [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                                  [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                  [--version]
                                  templateFile fasta_files [fasta_files ...]
                                  annotDir
Positional arguments:
templateFile Submission template file (.sbt) including author and contact info
fasta_files Input fasta files
annotDir Output directory with genbank submission files (.tbl files must already be there)
Options:
--comment comment field
--sequencing_tech
 sequencing technology (e.g. Illumina HiSeq 2500)
--master_source_table
 source modifier table
--organism species name
--mol_type molecule type
--biosample_map
 A file with two columns and a header: sample and BioSample. This file may refer to samples that are not included in this submission.
--coverage_table
 A genome coverage report file with a header row. The table must have at least two columns named sample and aln2self_cov_median. All other columns are ignored. Rows referring to samples not in this submission are ignored.
--tmp_dir=/tmp Base directory for temp files. [default: %(default)s]
--tmp_dirKeep=False
 Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit
prep_sra_table

This is a very lazy hack that creates a basic table that can be pasted into various columns of an SRA submission spreadsheet. It probably doesn’t work in all cases.

usage: ncbi.py prep_sra_table [-h]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version]
                              lib_fname biosampleFile md5_fname outFile
Positional arguments:
lib_fname A file that lists all of the library IDs that will be submitted in this batch
biosampleFile A file with two columns and a header: sample and BioSample. This file may refer to samples that are not included in this submission.
md5_fname A file with two columns and no header. Two columns are MD5 checksum and filename. Should contain an entry for every bam file being submitted in this batch. This is typical output from “md5sum *.cleaned.bam”.
outFile Output table that contains most of the variable columns needed for SRA submission.
Options:
--loglevel=INFO
 

Verboseness of output. [default: %(default)s]

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

--version, -V show program’s version number and exit