3.9. illumina.py - for raw Illumina outputsΒΆ
Utilities for demultiplexing Illumina data.
usage: illumina.py subcommand
- Sub-commands:
- illumina_demux
Read Illumina runs & produce BAM files, demultiplexing to one bam per sample, or for simplex runs, a single bam will be produced bearing the flowcell ID. Wraps together Picard’s ExtractBarcodes (for multiplexed samples) and IlluminaBasecallsToSam while handling the various required input formats. Also can read Illumina BCL directories, tar.gz BCL directories.
usage: illumina.py illumina_demux [-h] [--outMetrics OUTMETRICS] [--commonBarcodes COMMONBARCODES] [--sampleSheet SAMPLESHEET] [--runInfo RUNINFO] [--flowcell FLOWCELL] [--read_structure READ_STRUCTURE] [--max_mismatches MAX_MISMATCHES] [--minimum_base_quality MINIMUM_BASE_QUALITY] [--min_mismatch_delta MIN_MISMATCH_DELTA] [--max_no_calls MAX_NO_CALLS] [--minimum_quality MINIMUM_QUALITY] [--compress_outputs COMPRESS_OUTPUTS] [--sequencing_center SEQUENCING_CENTER] [--adapters_to_check [ADAPTERS_TO_CHECK [ADAPTERS_TO_CHECK ...]]] [--platform PLATFORM] [--max_reads_in_ram_per_tile MAX_READS_IN_RAM_PER_TILE] [--max_records_in_ram MAX_RECORDS_IN_RAM] [--apply_eamss_filter APPLY_EAMSS_FILTER] [--force_gc FORCE_GC] [--first_tile FIRST_TILE] [--tile_limit TILE_LIMIT] [--include_non_pf_reads INCLUDE_NON_PF_READS] [--run_start_date RUN_START_DATE] [--read_group_id READ_GROUP_ID] [--compression_level COMPRESSION_LEVEL] [--JVMmemory JVMMEMORY] [--threads THREADS] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] inDir lane outDir
- Positional arguments:
inDir Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory. lane Lane number. outDir Output directory for BAM files. - Options:
--outMetrics Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file. --commonBarcodes Write a TSV report of all barcode counts, in descending order. Only applicable for read structures containing “B” --sampleSheet Override SampleSheet. Input tab or CSV file w/header and four named columns: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir. --runInfo Override RunInfo. Input xml file. Default is to look for a RunInfo.xml file in the inDir. --flowcell Override flowcell ID (default: read from RunInfo.xml). --read_structure Override read structure (default: read from RunInfo.xml). --max_mismatches=1 Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: %(default)s) --minimum_base_quality=10 Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: %(default)s) --min_mismatch_delta Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: %(default)s) --max_no_calls Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: %(default)s) --minimum_quality Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: %(default)s) --compress_outputs Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: %(default)s) --sequencing_center Picard IlluminaBasecallsToSam SEQUENCING_CENTER (default: %(default)s) --adapters_to_check=('PAIRED_END', 'NEXTERA_V1', 'NEXTERA_V2') Picard IlluminaBasecallsToSam ADAPTERS_TO_CHECK (default: %(default)s) --platform Picard IlluminaBasecallsToSam PLATFORM (default: %(default)s) --max_reads_in_ram_per_tile=200000 Picard IlluminaBasecallsToSam MAX_READS_IN_RAM_PER_TILE (default: %(default)s) --max_records_in_ram=1000000 Picard IlluminaBasecallsToSam MAX_RECORDS_IN_RAM (default: %(default)s) --apply_eamss_filter Picard IlluminaBasecallsToSam APPLY_EAMSS_FILTER (default: %(default)s) --force_gc Picard IlluminaBasecallsToSam FORCE_GC (default: %(default)s) --first_tile Picard IlluminaBasecallsToSam FIRST_TILE (default: %(default)s) --tile_limit Picard IlluminaBasecallsToSam TILE_LIMIT (default: %(default)s) --include_non_pf_reads=False Picard IlluminaBasecallsToSam INCLUDE_NON_PF_READS (default: %(default)s) --run_start_date Picard IlluminaBasecallsToSam RUN_START_DATE (default: %(default)s) --read_group_id Picard IlluminaBasecallsToSam READ_GROUP_ID (default: %(default)s) --compression_level=7 Picard IlluminaBasecallsToSam COMPRESSION_LEVEL (default: %(default)s) --JVMmemory=7g JVM virtual memory size (default: %(default)s) --threads=0 Number of threads (default: 0) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- lane_metrics
Write out lane metrics to a tsv file.
usage: illumina.py lane_metrics [-h] [--read_structure READ_STRUCTURE] [--JVMmemory JVMMEMORY] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] inDir outPrefix
- Positional arguments:
inDir Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory. outPrefix Prefix path to the *.illumina_lane_metrics and *.illumina_phasing_metrics files. - Options:
--read_structure Override read structure (default: read from RunInfo.xml). --JVMmemory=8g JVM virtual memory size (default: %(default)s) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- common_barcodes
Extract Illumina barcodes for a run and write a TSV report of the barcode counts in descending order
usage: illumina.py common_barcodes [-h] [--truncateToLength TRUNCATETOLENGTH] [--omitHeader] [--includeNoise] [--outMetrics OUTMETRICS] [--sampleSheet SAMPLESHEET] [--flowcell FLOWCELL] [--read_structure READ_STRUCTURE] [--max_mismatches MAX_MISMATCHES] [--minimum_base_quality MINIMUM_BASE_QUALITY] [--min_mismatch_delta MIN_MISMATCH_DELTA] [--max_no_calls MAX_NO_CALLS] [--minimum_quality MINIMUM_QUALITY] [--compress_outputs COMPRESS_OUTPUTS] [--JVMmemory JVMMEMORY] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] inDir lane outSummary
- Positional arguments:
inDir Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory. lane Lane number. outSummary Path to the summary file (.tsv format). It includes several columns: (barcode1, likely_index_name1, barcode2, likely_index_name2, count), where likely index names are either the exact match index name for the barcode sequence, or those Hamming distance of 1 away. - Options:
--truncateToLength If specified, only this number of barcodes will be returned. Useful if you only want the top N barcodes. --omitHeader=False If specified, a header will not be added to the outSummary tsv file. --includeNoise=False If specified, barcodes with periods (”.”) will be included. --outMetrics Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file. --sampleSheet Override SampleSheet. Input tab or CSV file w/header and four named columns: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir. --flowcell Override flowcell ID (default: read from RunInfo.xml). --read_structure Override read structure (default: read from RunInfo.xml). --max_mismatches=1 Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: %(default)s) --minimum_base_quality=10 Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: %(default)s) --min_mismatch_delta Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: %(default)s) --max_no_calls Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: %(default)s) --minimum_quality Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: %(default)s) --compress_outputs Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: %(default)s) --JVMmemory=8g JVM virtual memory size (default: %(default)s) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- guess_barcodes
Guess the barcode value for a sample name, based on the following: - a list is made of novel barcode pairs seen in the data, but not in the picard metrics - for the sample in question, get the most abundant novel barcode pair where one of the barcodes seen in the data matches one of the barcodes in the picard metrics (partial match) - if there are no partial matches, get the most abundant novel barcode pair Limitations: - If multiple samples share a barcode with multiple novel barcodes, disentangling them is difficult or impossible The names of samples to guess are selected: - explicitly by name, passed via argument, OR - explicitly by read count threshold, OR - automatically (if names or count threshold are omitted) based on basic outlier detection of deviation from an assumed-balanced pool with some number of negative controls
usage: illumina.py guess_barcodes [-h] [--readcount_threshold READCOUNT_THRESHOLD | --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]] [--outlier_threshold OUTLIER_THRESHOLD] [--expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION] [--number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS] [--rows_limit ROWS_LIMIT] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] in_barcodes in_picard_metrics out_summary_tsv
- Positional arguments:
in_barcodes The barcode counts file produced by common_barcodes. in_picard_metrics The demultiplexing read metrics produced by Picard. out_summary_tsv Path to the summary file (.tsv format). It includes several columns: (sample_name, expected_barcode_1, expected_barcode_2, expected_barcode_1_name, expected_barcode_2_name, expected_barcodes_read_count, guessed_barcode_1, guessed_barcode_2, guessed_barcode_1_name, guessed_barcode_2_name, guessed_barcodes_read_count, match_type), where the expected values are those used by Picard during demultiplexing and the guessed values are based on the barcodes seen among the data. - Options:
--readcount_threshold If specified, guess barcodes for samples with fewer than this many reads. --sample_names If specified, only guess barcodes for these sample names. --outlier_threshold=0.675 threshold of how far from unbalanced a sample must be to be considered an outlier. --expected_assigned_fraction=0.7 The fraction of reads expected to be assigned. An exception is raised if fewer than this fraction are assigned. --number_of_negative_controls=1 The number of negative controls in the pool, for calculating expected number of reads in the rest of the pool. --rows_limit=1000 The number of rows to use from the in_barcodes. --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- miseq_fastq_to_bam
Convert fastq read files to a single bam file. Fastq file names must conform to patterns emitted by Miseq machines. Sample metadata must be provided in a SampleSheet.csv that corresponds to the fastq filename. Specifically, the _S##_ index in the fastq file name will be used to find the corresponding row in the SampleSheet
usage: illumina.py miseq_fastq_to_bam [-h] [--inFastq2 INFASTQ2] [--runInfo RUNINFO] [--sequencing_center SEQUENCING_CENTER] [--JVMmemory JVMMEMORY] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] outBam sampleSheet inFastq1
- Positional arguments:
outBam Output BAM file. sampleSheet Input SampleSheet.csv file. inFastq1 Input fastq file; 1st end of paired-end reads if paired. - Options:
--inFastq2 Input fastq file; 2nd end of paired-end reads. --runInfo Input RunInfo.xml file. --sequencing_center Name of your sequencing center (default is the sequencing machine ID from the RunInfo.xml) --JVMmemory=2g JVM virtual memory size (default: %(default)s) --loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.
- extract_fc_metadata
Extract RunInfo.xml and SampleSheet.csv from the provided Illumina directory
usage: illumina.py extract_fc_metadata [-h] [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}] [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep] flowcell outRunInfo outSampleSheet
- Positional arguments:
flowcell Illumina directory (possibly tarball) outRunInfo Output RunInfo.xml file. outSampleSheet Output SampleSheet.csv file. - Options:
--loglevel=INFO Verboseness of output. [default: %(default)s]
Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION
--version, -V show program’s version number and exit --tmp_dir=/tmp Base directory for temp files. [default: %(default)s] --tmp_dirKeep=False Keep the tmp_dir if an exception occurs while running. Default is to delete all temp files at the end, even if there’s a failure.