How is the read structure defined for demultiplexing?

Define read structures with the --read-structures option.

Although the index reads (I1 and I2) typically contain barcode information, there are many NGS assays that deviate from this convention. The Singular Genomics Demultiplex program allows the user to define how the demultiplexer will process each segment of each read through defining the read structure for each input FASTQ file. The read-structure is a functional designation for how the program processes that read segment. Example, the 10X barcode sequence and UMI sequences from scRNA runs needs to be in the FASTQ line of the resulting files from demultiplexing, so they should be marked as a part of the template rather than barcode or UMI. A read structure is composed of a sequence of <integer><operator> pairs. The integer defines the number of bases the following operator applies to, and the operator defines the purpose of the read. There are 4 possible operators:

Operator

Description

Template (T)

This segment of the read contains the sequence data of experimental interest such as the sequence of the genomic DNA or RNA. Template data will typically be the primary sequence of the FASTQ files resulting from demultiplexing.

Barcode (B)

This segment of the read contains the barcode sequence used to identify the sample. The barcode data will be used to assign the read into the particular sample output FASTQ file resulting from demultiplexing. The sequence of the barcode is also added to the header information of the template sequence in the output FASTQ file.

Molecular Barcodes (M)

This segment contains the molecular barcode such as the UMI. The molecular barcode sequence is added to the header information of the template sequence in the resulting FASTQ file.

Skip (S)

This segment of the sequence should be skipped. This is for extra bases that should not be output or used for any purpose. For example, the barcode is the last 6 bp of the 8 bp index read. In this case, the read structure should be 2S6B.

The last (or only) segment in a read structure can be designated with a + rather than an integer, meaning the remainder of the read is that read structure.

Below is an example of read structure and the associated commands.

1. Index 1 read (I1) contains 9 bp UMI followed by 8 bp barcode. The read structure is 9M8B.

2. Index 2 read (I2) contains 2 bp extra (skip) followed by 8 bp barcode. The read structure is 2S8B.

3. Insert 1 read (R1) contains 150 bp template. The read structure is 150T or +T.

4. Insert 2 read (R2) contains 140 bp template with the last 10 bp extra (skip). The read structure is 140T10S or 140T+S.

The command for demultiplexing this read structure is the following. Note that the order of the FASTQ files must correspond to the order of the read structure.

sgdemux --fastqs I1.fastq.gz I2.fastq.gz R1.fastq.gz R2.fastq.gz --read-structures 9M8B 2S8B 150T 140T10S --sample-metadata sample sheet .csv --output-dir demux_output/

The resulting demultiplexing step will generate 1 set of forward (R1) and reverse (R2) FASTQ files for each sample. Sequences where the barcodes do not match the expected barcode sequences in the sample sheet will go into the Undetermined FASTQ file set. The barcode and UMI information will be in the header line of the FASTQ record for each FASTQ file.

For more information, see the Demultiplexing Guide for G4 Sequencing Platform, reach out to your Field Application Scientist, or contact Customer Care.

4/18/2024