2.9. ncbi.py - NCBI Genbank and SRA utilities
This script contains a number of utilities for submitting our analyses to NCBI’s Genbank and SRA databases, as well as retreiving records from Genbank.
usage: ncbi.py subcommand
2.9.2. Sub-commands
2.9.2.1. tbl_transfer
- This function takes an NCBI TBL file describing features on a genome
(genes, etc) and transfers them to a new genome.
ncbi.py tbl_transfer [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
ref_fasta ref_tbl alt_fasta out_tbl
2.9.2.1.1. Positional Arguments
- ref_fasta
Input sequence of reference genome
- ref_tbl
Input reference annotations (NCBI TBL format)
- alt_fasta
Input sequence of new genome
- out_tbl
Output file with transferred annotations
2.9.2.1.2. Named Arguments
- --oob_clip
- Out of bounds feature behavior.
False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds
but truncate any features that are partly out of bounds
Default:
False- --ignoreAmbigFeatureEdge
- Ambiguous feature behavior.
- False: features specified as ambiguous (“<####” or “>####”) are mapped,
where possible
- True: features specified as ambiguous (“<####” or “>####”) are interpreted
as exact values
Default:
False- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.2. tbl_transfer_multichr
- This function takes an NCBI TBL file describing features on a genome
(genes, etc) and transfers them to a new genome.
ncbi.py tbl_transfer_multichr [-h] [--ref_fastas REF_FASTAS [REF_FASTAS ...]]
[--ref_tbls REF_TBLS [REF_TBLS ...]]
[--oob_clip] [--ignoreAmbigFeatureEdge]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
alt_fasta out_dir
2.9.2.2.1. Positional Arguments
- alt_fasta
Input sequence of new genome, all chr/segs in one fasta file, in the same order as ref_fastas
- out_dir
Output files include one fasta and tbl per sequence in alt_fasta, named according to the fasta header ID of each entry in alt_fasta.
2.9.2.2.2. Named Arguments
- --ref_fastas
Input sequences of reference genome, one chr/seg per fasta file
- --ref_tbls
Input reference annotations (NCBI TBL format), one chr/seg per tbl file, in the same order as ref_fastas
- --oob_clip
- Out of bounds feature behavior.
False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds
but truncate any features that are partly out of bounds
Default:
False- --ignoreAmbigFeatureEdge
- Ambiguous feature behavior.
- False: features specified as ambiguous (“<####” or “>####”) are mapped,
where possible
- True: features specified as ambiguous (“<####” or “>####”) are interpreted
as exact values
Default:
False- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.3. tbl_transfer_prealigned
This breaks out the ref and alt sequences into separate fasta files, and then creates unified files containing the reference sequence first and the alt second. Each of these unified files is then passed as a cmap to tbl_transfer_common.
This function expects to receive one fasta file containing a multialignment of a single segment/chromosome along with the respective reference sequence for that segment/chromosome. It also expects a reference containing all reference segments/chromosomes, so that the reference sequence can be identified in the input file by name. It also expects a list of reference tbl files, where each file is named according to the ID present for its corresponding sequence in the refFasta. For each non-reference sequence present in the inputFasta, two files are written: a fasta containing the segment/chromosome for the same, along with its corresponding feature table as created by tbl_transfer_common.
ncbi.py tbl_transfer_prealigned [-h] [--oob_clip] [--ignoreAmbigFeatureEdge]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
inputFasta refFasta refAnnotTblFiles
[refAnnotTblFiles ...] outputDir
2.9.2.3.1. Positional Arguments
- inputFasta
- FASTA file containing input sequences,
including pre-made alignments and reference sequence
- refFasta
FASTA file containing the reference genome
- refAnnotTblFiles
- Name of the reference feature tables,
each of which should have a filename comrised of [refId].tbl so they can be matched against the reference sequences
- outputDir
The output directory
2.9.2.3.2. Named Arguments
- --oob_clip
- Out of bounds feature behavior.
False: drop all features that are completely or partly out of bounds True: drop all features completely out of bounds
but truncate any features that are partly out of bounds
Default:
False- --ignoreAmbigFeatureEdge
- Ambiguous feature behavior.
- False: features specified as ambiguous (“<####” or “>####”) are mapped,
where possible
- True: features specified as ambiguous (“<####” or “>####”) are interpreted
as exact values
Default:
False- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.4. fetch_fastas
This function downloads and saves the FASTA files from the Genbank CoreNucleotide database given a given list of accession IDs.
ncbi.py fetch_fastas [-h] [--api_key API_KEY] [--forceOverwrite]
[--combinedFilePrefix COMBINEDFILEPREFIX]
[--fileExt FILEEXT] [--removeSeparateFiles]
[--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
emailAddress destinationDir accession_IDs
[accession_IDs ...]
2.9.2.4.1. Positional Arguments
- emailAddress
- Your email address. To access Genbank databases,
NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
- destinationDir
Output directory with where .fasta and .tbl files will be saved
- accession_IDs
List of Genbank nuccore accession IDs
2.9.2.4.2. Named Arguments
- --api_key
- Your NCBI API key. If an API key is not provided, NCBI
requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
- --forceOverwrite
Overwrite existing files, if present.
Default:
False- --combinedFilePrefix
- The prefix of the file containing the combined concatenated
results returned by the list of accession IDs, in the order provided.
- --fileExt
The extension to use for the downloaded files
- --removeSeparateFiles
If specified, remove the individual files and leave only the combined file.
Default:
False- --chunkSize
- Causes files to be downloaded from GenBank in chunks of N accessions.
Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
Default:
1- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.5. fetch_feature_tables
This function downloads and saves feature tables from the Genbank CoreNucleotide database given a given list of accession IDs.
ncbi.py fetch_feature_tables [-h] [--api_key API_KEY] [--forceOverwrite]
[--combinedFilePrefix COMBINEDFILEPREFIX]
[--fileExt FILEEXT] [--removeSeparateFiles]
[--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
emailAddress destinationDir accession_IDs
[accession_IDs ...]
2.9.2.5.1. Positional Arguments
- emailAddress
- Your email address. To access Genbank databases,
NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
- destinationDir
Output directory with where .fasta and .tbl files will be saved
- accession_IDs
List of Genbank nuccore accession IDs
2.9.2.5.2. Named Arguments
- --api_key
- Your NCBI API key. If an API key is not provided, NCBI
requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
- --forceOverwrite
Overwrite existing files, if present.
Default:
False- --combinedFilePrefix
- The prefix of the file containing the combined concatenated
results returned by the list of accession IDs, in the order provided.
- --fileExt
The extension to use for the downloaded files
- --removeSeparateFiles
If specified, remove the individual files and leave only the combined file.
Default:
False- --chunkSize
- Causes files to be downloaded from GenBank in chunks of N accessions.
Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
Default:
1- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.6. fetch_genbank_records
This function downloads and saves full flat text records from Genbank CoreNucleotide database given a given list of accession IDs.
ncbi.py fetch_genbank_records [-h] [--api_key API_KEY] [--forceOverwrite]
[--combinedFilePrefix COMBINEDFILEPREFIX]
[--fileExt FILEEXT] [--removeSeparateFiles]
[--chunkSize CHUNKSIZE] [--tmp_dir TMP_DIR]
[--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
emailAddress destinationDir accession_IDs
[accession_IDs ...]
2.9.2.6.1. Positional Arguments
- emailAddress
- Your email address. To access Genbank databases,
NCBI requires you to specify your email address with each request. In case of excessive usage of the E-utilities, NCBI will attempt to contact a user at the email address provided before blocking access. This email address should be registered with NCBI. To register an email address, simply send an email to eutilities@ncbi.nlm.nih.gov including your email address and the tool name (tool=’https://github.com/broadinstitute/viral-ngs’).
- destinationDir
Output directory with where .fasta and .tbl files will be saved
- accession_IDs
List of Genbank nuccore accession IDs
2.9.2.6.2. Named Arguments
- --api_key
- Your NCBI API key. If an API key is not provided, NCBI
requests are limited to 3/second. If an API key is provided, requests may be submitted at a rate up to 10/second. For more information, see: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/
- --forceOverwrite
Overwrite existing files, if present.
Default:
False- --combinedFilePrefix
- The prefix of the file containing the combined concatenated
results returned by the list of accession IDs, in the order provided.
- --fileExt
The extension to use for the downloaded files
- --removeSeparateFiles
If specified, remove the individual files and leave only the combined file.
Default:
False- --chunkSize
- Causes files to be downloaded from GenBank in chunks of N accessions.
Each chunk will be its own combined file, separate from any combined file created via –combinedFilePrefix (default: 1). If chunkSize is unspecified and >500 accessions are provided, chunkSize will be set to 500 to adhere to the NCBI guidelines on information retreival.
Default:
1- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.7. biosample_to_genbank
Prepare a Genbank Source Modifier Table based on a BioSample registration table (since all of the values are there)
ncbi.py biosample_to_genbank [-h] [--biosample_in_smt] [--iso_dates]
[--sgtf_override]
[--filter_to_samples FILTER_TO_SAMPLES]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
attributes num_segments taxid out_genbank_smt
out_biosample_map
2.9.2.7.1. Positional Arguments
- attributes
Input BioSample metadata table – the attributes.tsv returned by BioSample after successful registration
- num_segments
number of chromosomes/segments per genome for this species
- taxid
NCBI Taxonomy numeric taxid to assign to all entries
- out_genbank_smt
Output tab table in Genbank Source Modifier Table format, suitable for prep_genbank_files
- out_biosample_map
Output two-column biosample accession to sample name map, suitable for prep_genbank_files
2.9.2.7.2. Named Arguments
- --biosample_in_smt
Add BioSample and BioProject columns to source modifier table output
Default:
False- --iso_dates
write collection_date in ISO format (YYYY-MM-DD). default (false) is to write in tbl2asn format (DD-Mmm-YYYY)
Default:
False- --sgtf_override
replace “Screening for Variants of Concern (VoC)” with “screened by S dropout” in the note field
Default:
False- --filter_to_samples
Filter output to specified sample IDs in this input file (one ID per line).
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.8. prep_genbank_files
- Prepare genbank submission files. Requires .fasta and .tbl files as input,
as well as numerous other metadata files for the submission. Creates a directory full of files (.sqn in particular) that can be sent to GenBank.
ncbi.py prep_genbank_files [-h] [--comment COMMENT]
[--sequencing_tech SEQUENCING_TECH]
[--master_source_table MASTER_SOURCE_TABLE]
[--organism ORGANISM] [--mol_type MOL_TYPE]
[--biosample_map BIOSAMPLE_MAP]
[--coverage_table COVERAGE_TABLE]
[--assembly_method ASSEMBLY_METHOD]
[--assembly_method_version ASSEMBLY_METHOD_VERSION]
[--tmp_dir TMP_DIR] [--tmp_dirKeep]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
templateFile fasta_files [fasta_files ...] annotDir
2.9.2.8.1. Positional Arguments
- templateFile
Submission template file (.sbt) including author and contact info
- fasta_files
Input fasta files
- annotDir
Output directory with genbank submission files (.tbl files must already be there)
2.9.2.8.2. Named Arguments
- --comment
comment field
- --sequencing_tech
sequencing technology (e.g. Illumina HiSeq 2500)
- --master_source_table
source modifier table
- --organism
species name
- --mol_type
molecule type
- --biosample_map
- A file with two columns and a header: sample and BioSample.
This file may refer to samples that are not included in this submission.
- --coverage_table
- A genome coverage report file with a header row. The table must
have at least two columns named sample and aln2self_cov_median. All other columns are ignored. Rows referring to samples not in this submission are ignored.
- --assembly_method
short description of informatic assembly method
- --assembly_method_version
version of assembly method used
- --tmp_dir
Base directory for temp files. [default: ‘/tmp’]
Default:
'/tmp'- --tmp_dirKeep
- Keep the tmp_dir if an exception occurs while
running. Default is to delete all temp files at the end, even if there’s a failure.
Default:
False- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit
2.9.2.9. prep_sra_table
- This is a very lazy hack that creates a basic table that can be
pasted into various columns of an SRA submission spreadsheet. It probably doesn’t work in all cases.
ncbi.py prep_sra_table [-h]
[--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
[--version]
lib_fname biosampleFile md5_fname outFile
2.9.2.9.1. Positional Arguments
- lib_fname
A file that lists all of the library IDs that will be submitted in this batch
- biosampleFile
- A file with two columns and a header: sample and BioSample.
This file may refer to samples that are not included in this submission.
- md5_fname
- A file with two columns and no header. Two columns are MD5 checksum and filename.
Should contain an entry for every bam file being submitted in this batch. This is typical output from “md5sum *.cleaned.bam”.
- outFile
Output table that contains most of the variable columns needed for SRA submission.
2.9.2.9.2. Named Arguments
- --loglevel
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
Verboseness of output. [default: ‘INFO’]
Default:
'INFO'- --version, -V
show program’s version number and exit