2.1. illumina.py - for raw Illumina outputs

Utilities for demultiplexing Illumina data.

usage: illumina.py subcommand

2.1.1. subcommands

[F: Possible choices: illumina_metadata, splitcode_demux_fastqs, illumina_demux, flowcell_metadata, lane_metrics, common_barcodes, guess_barcodes, miseq_fastq_to_bam, extract_fc_metadata, splitcode_demux, merge_demux_metrics

2.1.2. Sub-commands

2.1.2.1. illumina_metadata

Generate metadata JSON files from Illumina run files without processing reads.

This function extracts metadata from RunInfo.xml and SampleSheet, generating standardized JSON output files. It is designed to be run once per sequencing run to create metadata that’s shared across all parallel demux jobs.
Args:
runinfo (str): Path to RunInfo.xml file samplesheet (str): Path to Illumina SampleSheet.csv file lane (int, optional): Lane number to process. If not specified:

All lanes from samplesheet will be processed

Output run_info.json will use lane=”0”

Sample metadata will preserve lane values from samplesheet if present, else use “0”

sequencing_center (str, optional): Sequencing center name. If not provided,
will be derived from the instrument ID in RunInfo.xml.

append_run_id (bool, optional): If True, append flowcell ID and lane to
library IDs in the format: {sample}.l{library_id}.{flowcell}.{lane} This matches the behavior of illumina_demux and splitcode_demux_fastqs when using –append_run_id, ensuring metadata ‘run’ fields match BAM filenames. Default: False.

out_runinfo (str, optional): Output path for run_info.json out_meta_by_sample (str, optional): Output path for meta_by_sample.json out_meta_by_filename (str, optional): Output path for meta_by_filename.json

Returns:
dict: Dictionary containing paths to generated files

Raises:
FileNotFoundError: If runinfo or samplesheet files don’t exist IOError: If there are issues reading input files or writing output files

Example:
>>> illumina_metadata(
...     runinfo='RunInfo.xml',
...     samplesheet='SampleSheet.csv',
...     lane=1,
...     out_runinfo='run_info.json',
...     out_meta_by_sample='meta_by_sample.json',
...     out_meta_by_filename='meta_by_filename.json'
... )

illumina.py illumina_metadata [-h] --runinfo RUNINFO
                              [--samplesheet SAMPLESHEET] [--lane LANE]
                              [--sequencing_center SEQUENCING_CENTER]
                              [--append_run_id] [--out_runinfo OUT_RUNINFO]
                              [--out_meta_by_sample OUT_META_BY_SAMPLE]
                              [--out_meta_by_filename OUT_META_BY_FILENAME]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]

2.1.2.1.1. Named Arguments

--runinfo

Path to RunInfo.xml file

--samplesheet

Path to Illumina SampleSheet.csv file (required for –out_meta_by_sample and –out_meta_by_filename)

--lane

Lane number to process (optional; if not specified, processes all lanes and uses default value of 0 in outputs)

--sequencing_center

Sequencing center name (default: derived from instrument ID in RunInfo.xml)

--append_run_id

Append flowcell ID and lane to library IDs (e.g., sample.l1.FLOWCELL.1) for SRA compatibility

Default: False

--out_runinfo

Output path for run_info.json

--out_meta_by_sample

Output path for meta_by_sample.json (sample metadata indexed by sample name)

--out_meta_by_filename

Output path for meta_by_filename.json (sample metadata indexed by library ID)

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.2. splitcode_demux_fastqs

Simplified splitcode demultiplexing from paired DRAGEN FASTQ files.

This function performs 3-barcode demultiplexing directly from paired FASTQ files. It’s designed to run in parallel across multiple FASTQ pairs.

For samples with empty barcode_3 (2-barcode samples), it skips splitcode and performs direct FASTQ → BAM conversion.

Picard FastqToSam conversions are parallelized using ProcessPoolExecutor with the number of workers controlled by the ‘threads’ parameter. For optimal performance with N samples, use threads=min(N, num_cpus).
Args:
fastq_r1 (str): Path to R1 FASTQ file fastq_r2 (str): Path to R2 FASTQ file samplesheet (str): Path to custom 3-barcode samplesheet (TSV format) outdir (str): Output directory for BAM files and metrics runinfo (str): Path to RunInfo.xml (optional, but recommended for richer BAM metadata

including RUN_DATE, SEQUENCING_CENTER, and PLATFORM_UNIT)

sequencing_center (str): Sequencing center name (default: None, uses runinfo.get_machine() if available) flowcell_id (str): Override flowcell ID (default: None, extracted from FASTQ filename or RunInfo.xml) run_date (str): Override run date in YYYY-MM-DD format (default: None, read from RunInfo.xml) append_run_id (bool): If True, output BAM filenames will include flowcell ID and lane

in the format: {sample}.l{library_id}.{flowcell}.{lane}.bam If False (default): {sample}.l{library_id}.bam

max_hamming_dist (int): Maximum Hamming distance for barcode matching (default: 1) r1_trim_bp_right_of_barcode (int): Additional bp to trim from R1 after barcode r2_trim_bp_left_of_barcode (int): Additional bp to trim from R2 before barcode predemux_r1_trim_5prime_num_bp (int): Trim from R1 5’ end before demux predemux_r1_trim_3prime_num_bp (int): Trim from R1 3’ end before demux predemux_r2_trim_5prime_num_bp (int): Trim from R2 5’ end before demux predemux_r2_trim_3prime_num_bp (int): Trim from R2 3’ end before demux threads (int): Number of threads for splitcode and parallel Picard conversions

(default: auto-detect)

picard_jvm_memory (str): JVM memory allocation per Picard worker (default: ‘2g’) out_meta_by_sample (str, optional): Output path for meta_by_sample.json.

Contains sample metadata keyed by sample name.

out_meta_by_filename (str, optional): Output path for meta_by_filename.json.
Contains sample metadata keyed by ‘run’ field (matches BAM basenames when append_run_id=True). This is the critical output for fixing metadata/BAM filename mismatches in Terra table insertion.

max_barcode_mismatches (int): Maximum allowed non-N mismatches per outer barcode
when matching FASTQ barcodes to samplesheet (default: 1). DRAGEN tolerates index mismatches during demux, so FASTQ reads may carry barcodes that differ by 1-2 bases from the samplesheet.

num_reads_for_barcode (int): Number of reads to examine from the FASTQ to form
a consensus barcode sequence (default: 10). Using multiple reads avoids relying on a single read that may have a mismatched index.

Raises:
FileNotFoundError: If FASTQ or samplesheet files don’t exist ValueError: If samplesheet format is invalid

Example:
>>> splitcode_demux_fastqs(
...     fastq_r1='Pool1_R1.fastq.gz',
...     fastq_r2='Pool1_R2.fastq.gz',
...     samplesheet='samples_3bc.tsv',
...     outdir='demux_out'
... )
Outputs:

Per-sample unaligned BAMs (zero to many, depending on samplesheet matches)

demux_metrics.json: Read counts per sample (or reason for no output)

demux_metrics_picard-style.txt: Picard-style metrics

Note:

If the input FASTQ is empty or its barcodes don’t match the samplesheet, zero BAMs are produced but metrics files are still generated to document the reason (demux_type field).

This simplified function does NOT generate barcodes_common.txt or barcodes_outliers.txt. For comprehensive barcode reporting, use illumina_demux or splitcode_demux instead.

illumina.py splitcode_demux_fastqs [-h] --fastq_r1 FASTQ_R1 --fastq_r2
                                   FASTQ_R2 --samplesheet SAMPLESHEET --outdir
                                   OUTDIR [--runinfo RUNINFO]
                                   [--sequencing_center SEQUENCING_CENTER]
                                   [--flowcell_id FLOWCELL_ID]
                                   [--run_date RUN_DATE] [--append_run_id]
                                   [--picard_jvm_memory PICARD_JVM_MEMORY]
                                   [--out_meta_by_sample OUT_META_BY_SAMPLE]
                                   [--out_meta_by_filename OUT_META_BY_FILENAME]
                                   [--max_barcode_mismatches MAX_BARCODE_MISMATCHES]
                                   [--num_reads_for_barcode NUM_READS_FOR_BARCODE]
                                   [--threads THREADS]
                                   [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                   [--version] [--tmp_dir TMP_DIR]
                                   [--tmp_dirKeep]

2.1.2.2.1. Named Arguments

--fastq_r1

Path to R1 FASTQ file

--fastq_r2

Path to R2 FASTQ file

--samplesheet

Path to custom 3-barcode samplesheet (TSV format)

--outdir

Output directory for BAM files and metrics

--runinfo

Path to RunInfo.xml (optional, for richer BAM metadata)

--sequencing_center

Sequencing center name (default: None, uses runinfo.get_machine() if RunInfo.xml provided)

--flowcell_id

Override flowcell ID (default: extracted from FASTQ filename or RunInfo.xml)

--run_date

Override run date in YYYY-MM-DD format (default: read from RunInfo.xml)

--append_run_id

Append flowcell ID and lane to output BAM filenames for SRA compatibility

Default: False

--picard_jvm_memory

JVM memory allocation per Picard FastqToSam worker (default: ‘2g’)

Default: '2g'

--out_meta_by_sample

Output path for meta_by_sample.json (sample metadata indexed by sample name)

--out_meta_by_filename

Output path for meta_by_filename.json (sample metadata indexed by run/library ID, matches BAM basenames)

--max_barcode_mismatches

Maximum allowed non-N mismatches per outer barcode when matching FASTQ barcodes to samplesheet. DRAGEN tolerates index mismatches, so FASTQ reads may carry barcodes that differ from the samplesheet. (default: 1)

Default: 1

--num_reads_for_barcode

Number of reads to examine from the FASTQ to form a consensus barcode sequence. Using multiple reads avoids relying on a single read that may have a mismatched index. (default: 10)

Default: 10

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.3. illumina_demux

Read Illumina runs & produce BAM files, demultiplexing to one bam per sample, or: for simplex runs, a single bam will be produced bearing the flowcell ID. Wraps together Picard’s ExtractBarcodes (for multiplexed samples) and IlluminaBasecallsToSam while handling the various required input formats. Also can read Illumina BCL directories, tar.gz BCL directories.

illumina.py illumina_demux [-h] [--outMetrics OUTMETRICS]
                           [--commonBarcodes COMMONBARCODES]
                           [--max_barcodes MAX_BARCODES]
                           [--sampleSheet SAMPLESHEET] [--runInfo RUNINFO]
                           [--flowcell FLOWCELL]
                           [--read_structure READ_STRUCTURE] [--append_run_id]
                           [--collapse_duplicated_barcodes [COLLAPSE_DUPLICATED_BARCODES]]
                           [--rev_comp_barcodes_before_demux [REV_COMP_BARCODES_BEFORE_DEMUX ...]]
                           [--out_meta_by_sample OUT_META_BY_SAMPLE]
                           [--out_meta_by_filename OUT_META_BY_FILENAME]
                           [--out_runinfo OUT_RUNINFO]
                           [--max_mismatches MAX_MISMATCHES]
                           [--minimum_base_quality MINIMUM_BASE_QUALITY]
                           [--min_mismatch_delta MIN_MISMATCH_DELTA]
                           [--max_no_calls MAX_NO_CALLS]
                           [--minimum_quality MINIMUM_QUALITY]
                           [--compress_outputs COMPRESS_OUTPUTS]
                           [--sequencing_center SEQUENCING_CENTER]
                           [--adapters_to_check [ADAPTERS_TO_CHECK ...]]
                           [--platform PLATFORM]
                           [--max_records_in_ram MAX_RECORDS_IN_RAM]
                           [--apply_eamss_filter APPLY_EAMSS_FILTER]
                           [--force_gc FORCE_GC] [--first_tile FIRST_TILE]
                           [--tile_limit TILE_LIMIT]
                           [--include_non_pf_reads INCLUDE_NON_PF_READS]
                           [--run_start_date RUN_START_DATE]
                           [--read_group_id READ_GROUP_ID]
                           [--compression_level COMPRESSION_LEVEL]
                           [--sort SORT] [--JVMmemory JVMMEMORY]
                           [--threads THREADS]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           inDir lane outDir

2.1.2.3.1. Positional Arguments

inDir: Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.
lane: Lane number.
outDir: Output directory for BAM files.

2.1.2.3.2. Named Arguments

--outMetrics

Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file.

--commonBarcodes

Write a TSV report of all barcode counts, in descending order.: Only applicable for read structures containing “B”

--max_barcodes

Cap the commonBarcodes report length to this size (default: 10000)

Default: 10000

--sampleSheet

Override SampleSheet. Input TSV or CSV file w/header and four named columns:: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir.

--runInfo

Override RunInfo. Input xml file.: Default is to look for a RunInfo.xml file in the inDir.

--flowcell

Override flowcell ID (default: read from RunInfo.xml).

--read_structure

Override read structure (default: read from RunInfo.xml).

--append_run_id

If specified, output filenames will include the flowcell ID and lane number.

Default: False

--collapse_duplicated_barcodes

If specified, reads from samples with duplicated barcodes or barcode pairs: will be collapsed into a single output for each distinct barcode (or distinct barcode pair). Intended for protocols allowing additional demultiplexing downstream by other means (ex. breaking out samples based on a third, inner barcode, added via swift-seq). If not specified, an error will be raised if duplicated barcodes (or barcode pairs) are present in the sample sheet. If a value is specified, it will be used as the path to store output sample sheet with barcodes collapsed

Default: False

--rev_comp_barcodes_before_demux

Reverse complement barcodes before demultiplexing.

If specified without setting a value,: the “barcode_2” column will be reverse-complemented.
If one or more values are specified,: the columns with those names will be reverse-complemented.

(and if not specified, barcodes will not be reverse-complemented)

--out_meta_by_sample

Output json metadata by sample

--out_meta_by_filename

Output json metadata by bam file basename

--out_runinfo

Output json metadata about the run

--max_mismatches

Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: 0)

Default: 0

--minimum_base_quality

Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: 20)

Default: 20

--min_mismatch_delta

Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: None)

--max_no_calls

Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: None)

--minimum_quality

Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: None)

--compress_outputs

Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: None)

--sequencing_center

Picard IlluminaBasecallsToSam SEQUENCING_CENTER (default: None)

--adapters_to_check

Picard IlluminaBasecallsToSam ADAPTERS_TO_CHECK (default: (‘PAIRED_END’, ‘NEXTERA_V1’, ‘NEXTERA_V2’))

Default: ('PAIRED_END', 'NEXTERA_V1', 'NEXTERA_V2')

--platform

Picard IlluminaBasecallsToSam PLATFORM (default: None)

--max_records_in_ram

Picard IlluminaBasecallsToSam MAX_RECORDS_IN_RAM (default: 2000000)

Default: 2000000

--apply_eamss_filter

Picard IlluminaBasecallsToSam APPLY_EAMSS_FILTER (default: None)

--force_gc

Picard IlluminaBasecallsToSam FORCE_GC (default: None)

--first_tile

Picard IlluminaBasecallsToSam FIRST_TILE (default: None)

--tile_limit

Picard IlluminaBasecallsToSam TILE_LIMIT (default: None)

--include_non_pf_reads

Picard IlluminaBasecallsToSam INCLUDE_NON_PF_READS (default: False)

Default: False

--run_start_date

Picard IlluminaBasecallsToSam RUN_START_DATE (default: None)

--read_group_id

Picard IlluminaBasecallsToSam READ_GROUP_ID (default: None)

--compression_level

Picard IlluminaBasecallsToSam COMPRESSION_LEVEL (default: 7)

Default: 7

--sort

Picard IlluminaBasecallsToSam SORT (default: True)

Default: True

--JVMmemory

JVM virtual memory size (default: ‘7g’)

Default: '7g'

--threads

Number of threads; by default all cores are used

Default: 1

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.4. flowcell_metadata

Writes run metadata to a file

illumina.py flowcell_metadata [-h]
                              (--inDir IN_DIR | --runInfo RUN_INFO | --flowcellID FLOWCELL_ID)
                              [--threads THREADS]
                              [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                              [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                              outMetadataFile

2.1.2.4.1. Positional Arguments

outMetadataFile: path of file to which metadata will be written.

2.1.2.4.2. Named Arguments

--inDir

Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.

--runInfo

RunInfo.xml file.

--flowcellID

flowcell ID (default: read from RunInfo.xml).

--threads

Number of threads; by default all cores are used

Default: 1

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.5. lane_metrics

Write out lane metrics to a tsv file.

illumina.py lane_metrics [-h] [--read_structure READ_STRUCTURE]
                         [--JVMmemory JVMMEMORY]
                         [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                         [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                         inDir outPrefix

2.1.2.5.1. Positional Arguments

inDir: Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.
outPrefix: Prefix path to the *.illumina_lane_metrics and *.illumina_phasing_metrics files.

2.1.2.5.2. Named Arguments

--read_structure

Override read structure (default: read from RunInfo.xml).

--JVMmemory

JVM virtual memory size (default: ‘8g’)

Default: '8g'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.6. common_barcodes

Extract Illumina barcodes for a run and write a TSV report of the barcode counts in descending order

illumina.py common_barcodes [-h] [--truncateToLength TRUNCATETOLENGTH]
                            [--omitHeader] [--includeNoise]
                            [--outMetrics OUTMETRICS]
                            [--sampleSheet SAMPLESHEET] [--flowcell FLOWCELL]
                            [--read_structure READ_STRUCTURE]
                            [--max_mismatches MAX_MISMATCHES]
                            [--minimum_base_quality MINIMUM_BASE_QUALITY]
                            [--min_mismatch_delta MIN_MISMATCH_DELTA]
                            [--max_no_calls MAX_NO_CALLS]
                            [--minimum_quality MINIMUM_QUALITY]
                            [--compress_outputs COMPRESS_OUTPUTS]
                            [--JVMmemory JVMMEMORY] [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            inDir lane outSummary

2.1.2.6.1. Positional Arguments

inDir

Illumina BCL directory (or tar.gz of BCL directory). This is the top-level run directory.

lane

Lane number.

outSummary

Path to the summary file (.tsv format). It includes several columns:: (barcode1, likely_index_name1, barcode2, likely_index_name2, count), where likely index names are either the exact match index name for the barcode sequence, or those Hamming distance of 1 away.

2.1.2.6.2. Named Arguments

--truncateToLength

If specified, only this number of barcodes will be returned. Useful if you only want the top N barcodes.

--omitHeader

If specified, a header will not be added to the outSummary tsv file.

Default: False

--includeNoise

If specified, barcodes with periods (“.”) will be included.

Default: False

--outMetrics

Output ExtractIlluminaBarcodes metrics file. Default is to dump to a temp file.

--sampleSheet

Override SampleSheet. Input tab or CSV file w/header and four named columns:: barcode_name, library_name, barcode_sequence_1, barcode_sequence_2. Default is to look for a SampleSheet.csv in the inDir.

--flowcell

Override flowcell ID (default: read from RunInfo.xml).

--read_structure

Override read structure (default: read from RunInfo.xml).

--max_mismatches

Picard ExtractIlluminaBarcodes MAX_MISMATCHES (default: 0)

Default: 0

--minimum_base_quality

Picard ExtractIlluminaBarcodes MINIMUM_BASE_QUALITY (default: 20)

Default: 20

--min_mismatch_delta

Picard ExtractIlluminaBarcodes MIN_MISMATCH_DELTA (default: None)

--max_no_calls

Picard ExtractIlluminaBarcodes MAX_NO_CALLS (default: None)

--minimum_quality

Picard ExtractIlluminaBarcodes MINIMUM_QUALITY (default: None)

--compress_outputs

Picard ExtractIlluminaBarcodes COMPRESS_OUTPUTS (default: None)

--JVMmemory

JVM virtual memory size (default: ‘8g’)

Default: '8g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.7. guess_barcodes

Guess the barcode value for a sample name,

based on the following:

a list is made of novel barcode pairs seen in the data, but not in the picard metrics

for the sample in question, get the most abundant novel barcode pair where one of the barcodes seen in the data matches one of the barcodes in the picard metrics (partial match)

if there are no partial matches, get the most abundant novel barcode pair

Limitations:

If multiple samples share a barcode with multiple novel barcodes, disentangling them is difficult or impossible

The names of samples to guess are selected:

explicitly by name, passed via argument, OR

explicitly by read count threshold, OR

automatically (if names or count threshold are omitted) based on basic outlier detection of deviation from an assumed-balanced pool with some number of negative controls

illumina.py guess_barcodes [-h]
                           [--readcount_threshold READCOUNT_THRESHOLD | --sample_names [SAMPLE_NAMES ...]]
                           [--outlier_threshold OUTLIER_THRESHOLD]
                           [--expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION]
                           [--number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS | --neg_control_prefixes NEG_CONTROL_PREFIXES [NEG_CONTROL_PREFIXES ...]]
                           [--rows_limit ROWS_LIMIT]
                           [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                           [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                           in_barcodes in_picard_metrics out_summary_tsv

2.1.2.7.1. Positional Arguments

in_barcodes

The barcode counts file produced by common_barcodes.

in_picard_metrics

The demultiplexing read metrics produced by Picard.

out_summary_tsv

Path to the summary file (.tsv format). It includes several columns:: (sample_name, expected_barcode_1, expected_barcode_2, expected_barcode_1_name, expected_barcode_2_name, expected_barcodes_read_count, guessed_barcode_1, guessed_barcode_2, guessed_barcode_1_name, guessed_barcode_2_name, guessed_barcodes_read_count, match_type), where the expected values are those used by Picard during demultiplexing and the guessed values are based on the barcodes seen among the data.

2.1.2.7.2. Named Arguments

--readcount_threshold

If specified, guess barcodes for samples with fewer than this many reads.

--sample_names

If specified, only guess barcodes for these sample names.

--outlier_threshold

threshold of how far from unbalanced a sample must be to be considered an outlier.

Default: 0.775

--expected_assigned_fraction

The fraction of reads expected to be assigned. An exception is raised if fewer than this fraction are assigned.

Default: 0.7

--number_of_negative_controls

If specified, the number of negative controls in the pool, for calculating expected number of reads in the rest of the pool.

--neg_control_prefixes

If specified, the sample name prefixes assumed for counting negative controls. Case-insensitive.

Default: ['neg', 'water', 'NTC', 'H2O']

--rows_limit

The number of rows to use from the in_barcodes.

Default: 1000

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.8. miseq_fastq_to_bam

Convert fastq read files to a single bam file. Fastq file names must conform: to patterns emitted by Miseq machines. Sample metadata must be provided in a SampleSheet.csv that corresponds to the fastq filename. Specifically, the _S##_ index in the fastq file name will be used to find the corresponding row in the SampleSheet

illumina.py miseq_fastq_to_bam [-h] [--inFastq2 INFASTQ2] [--runInfo RUNINFO]
                               [--sequencing_center SEQUENCING_CENTER]
                               [--JVMmemory JVMMEMORY]
                               [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                               [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                               outBam sampleSheet inFastq1

2.1.2.8.1. Positional Arguments

outBam: Output BAM file.
sampleSheet: Input SampleSheet.csv file.
inFastq1: Input fastq file; 1st end of paired-end reads if paired.

2.1.2.8.2. Named Arguments

--inFastq2

Input fastq file; 2nd end of paired-end reads.

--runInfo

Input RunInfo.xml file.

--sequencing_center

Name of your sequencing center (default is the sequencing machine ID from the RunInfo.xml)

--JVMmemory

JVM virtual memory size (default: ‘4g’)

Default: '4g'

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.9. extract_fc_metadata

Extract RunInfo.xml and SampleSheet.csv from the provided Illumina directory

illumina.py extract_fc_metadata [-h]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                flowcell outRunInfo outSampleSheet

2.1.2.9.1. Positional Arguments

flowcell: Illumina directory (possibly tarball)
outRunInfo: Output RunInfo.xml file.
outSampleSheet: Output SampleSheet.csv file.

2.1.2.9.2. Named Arguments

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.10. splitcode_demux

Main function to call splitcode_demux from the command line.

illumina.py splitcode_demux [-h] [--sampleSheet SAMPLESHEET]
                            [--illuminaRunDirectory ILLUMINA_RUN_DIRECTORY]
                            [--unmatched_name UNMATCHED_NAME]
                            [--max_hamming_dist MAX_HAMMING_DIST]
                            [--runInfo RUNINFO]
                            [--platform_name PLATFORM_NAME]
                            [--flowcell FLOWCELL] [--run_date RUN_DATE]
                            [--sequencing_center SEQUENCING_CENTER]
                            [--rev_comp_barcodes_before_demux [REV_COMP_BARCODES_BEFORE_DEMUX ...]]
                            [--predemux_trim_r1_5prime [PREDEMUX_R1_TRIM_5PRIME_NUM_BP]]
                            [--predemux_trim_r1_3prime [PREDEMUX_R1_TRIM_3PRIME_NUM_BP]]
                            [--predemux_trim_r2_5prime [PREDEMUX_R2_TRIM_5PRIME_NUM_BP]]
                            [--predemux_trim_r2_3prime [PREDEMUX_R2_TRIM_3PRIME_NUM_BP]]
                            [--trim_r1_right_of_barcode [R1_TRIM_BP_RIGHT_OF_BARCODE]]
                            [--trim_r2_left_of_barcode [R2_TRIM_BP_LEFT_OF_BARCODE]]
                            [--out_meta_by_sample OUT_META_BY_SAMPLE]
                            [--out_meta_by_filename OUT_META_BY_FILENAME]
                            [--out_runinfo OUT_RUNINFO]
                            [--JVMmemory JVM_MEMORY] [--threads THREADS]
                            [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                            [--version] [--tmp_dir TMP_DIR] [--tmp_dirKeep]
                            inDir lane outDir

2.1.2.10.1. Positional Arguments

inDir: File path to folder containing gzipped FASTQ files.
lane: Lane number (used to populate BAM header). Default: read from RunInfo.xml.
outDir: Output directory for BAM files and other output files.

2.1.2.10.2. Named Arguments

--sampleSheet

Override SampleSheet. Input tab or CSV file with columns:: sample, library_id_per_sample, I7_Index_ID, barcode_1, I5_Index_ID, barcode_2, Inline_Index_ID, barcode_3.. Default is to look for a SampleSheet.csv in the inDir.

--illuminaRunDirectory

Path to Illumina Run Directory. Default is to look for a RunInfo.xml file in the inDir.

--unmatched_name

ID for reads that don’t match an inline barcode. Default: ‘unmatched’.

Default: 'unmatched'

--max_hamming_dist

Max allowed Hamming distance for inline barcode matching. Default: 1.

Default: 1

--runInfo

Override RunInfo. Input xml file.: Default is to look for a RunInfo.xml file in the inDir.

--platform_name

Platform name (used to populate BAM header). Default: read from RunInfo.xml.

--flowcell

Flowcell ID (used to populate BAM header). Default: read from RunInfo.xml.

--run_date

Run date (used to populate BAM header). Default: read from RunInfo.xml.

--sequencing_center

Sequencing center (used to populate BAM header). Default: read from RunInfo.xml.

--rev_comp_barcodes_before_demux

Reverse complement barcodes before demultiplexing.

If specified without setting a value,: the “barcode_2” column will be reverse-complemented.
If one or more values are specified,: the columns with those names will be reverse-complemented.

(and if not specified, barcodes will not be reverse-complemented)

--predemux_trim_r1_5prime

number of bases to trim from the 5’ end of read 1 (before demux)

--predemux_trim_r1_3prime

number of bases to trim from the 3’ end of read 1 (before demux)

--predemux_trim_r2_5prime

number of bases to trim from the 5’ end of read 2 (before demux)

--predemux_trim_r2_3prime

number of bases to trim from the 3’ end of read 2 (before demux)

--trim_r1_right_of_barcode

number of bases to trim after the barcode on the right (3’) side of read 1 (after demux)

--trim_r2_left_of_barcode

number of bases to trim after the barcode on the left (5’) side of read 2 (after demux)

--out_meta_by_sample

Output json metadata by sample

--out_meta_by_filename

Output json metadata by bam file basename

--out_runinfo

Output json metadata about the run

--JVMmemory

JVM virtual memory size (default: ‘4g’)

Default: '4g'

--threads

Number of threads; by default all cores are used

Default: 2

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False

2.1.2.11. merge_demux_metrics

Merge multiple Picard-style demux metrics tab files into a single file.

This function takes multiple tab-delimited metrics files (e.g., from Illumina/Picard demux and splitcode demux) and combines them into a single output file. It preserves comment lines from the first file and combines data rows from all files.

The function expects all input files to have the same column structure (same columns in the same order), and will raise an error if headers don’t match (unless skip_header_merge_check=True).
Args:

input_metrics_files (list): List of paths to input metrics files (TSV format).
Each file should be in Picard-style format with: - Optional comment lines starting with ‘##’ - A header line with tab-separated column names - Data rows with tab-separated values

output_metrics_file (str): Path to the output merged metrics file. skip_header_merge_check (bool): If True, skips validation that all files have

matching headers. Use with caution - mismatched columns will produce invalid output. Default: False

Returns:
str: Path to the output metrics file

Raises:

ValueError: If input_metrics_files is empty or if headers don’t match (when
skip_header_merge_check=False)

FileNotFoundError: If any input file doesn’t exist

Example:
>>> merge_demux_metrics(
...     input_metrics_files=[
...         'illumina_demux_metrics.txt',
...         'splitcode_demux_metrics_picard-style.txt'
...     ],
...     output_metrics_file='merged_demux_metrics.txt'
... )
'merged_demux_metrics.txt'
Output format:
The output file will have: - Comment lines from the first file (lines starting with ‘##’) - The header line from the first file - All data rows from all input files (in order of input_metrics_files)

Notes:

Skips empty lines and whitespace-only lines

Preserves all comment lines from the first file only

The DEMUX_TYPE column (if present) should differentiate rows from different sources

Files are processed in the order provided in input_metrics_files

illumina.py merge_demux_metrics [-h] [--skip_header_merge_check]
                                [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                [--version] [--tmp_dir TMP_DIR]
                                [--tmp_dirKeep]
                                input_metrics_files [input_metrics_files ...]
                                output_metrics_file

2.1.2.11.1. Positional Arguments

input_metrics_files: Input Picard-style demux metrics files to merge (TSV format). Specify multiple files separated by spaces.
output_metrics_file: Output path for merged demux metrics file

2.1.2.11.2. Named Arguments

--skip_header_merge_check

Skip validation that all input files have matching column headers. Use with caution.

Default: False

--loglevel

Possible choices: DEBUG, INFO, WARNING, ERROR, CRITICAL, EXCEPTION

Verboseness of output. [default: ‘INFO’]

Default: 'INFO'

--version, -V

show program’s version number and exit

--tmp_dir

Base directory for temp files. [default: ‘/tmp’]

Default: '/tmp'

--tmp_dirKeep

Keep the tmp_dir if an exception occurs while: running. Default is to delete all temp files at the end, even if there’s a failure.

Default: False