This article describes the process for demultiplexing data off-instrument
The output of the G4 Sequencing Platform is FASTQ data that has already been basecalled. The Singular Demultiplexer Software (sgdemux) is capable of demultiplexing this FASTQ data and performing other FASTQ file manipulations. Detailed documentation is available from the GitHub Repo. You can compile from the source code directly following the direction from the README or download precompiled versions from the Releases Tab on the right-hand side. Below is a guide for a few common use cases.
Case 1: Specifying input files directly
The most powerful and flexible way to use sgdemux is by specifying the read structure and input files directly. The order of the read structure needs to match the order of the input FASTQ file. Suppose you have 4 files with the following read structure:
- Undetermined_S0_L002_I1_001.fastq.gz - 8B - this file contains 8 base pairs reads which is the first barcode.
- Undetermined_S0_L002_I2_001.fastq.gz - 9M8B - the first 9 base pairs of this file correspond to the UMI (9M) and the last 8 base pairs is the second barcode (8B).
- Undetermined_S0_L002_R1_001.fastq.gz - +T - this file contains the template (T). All the base pairs are template (+).
- Undetermined_S0_L002_R2_001.fastq.gz - +T - this file contains the template (T). All the base pairs are template (+).
This is the command for demultiplexing the FASTQ files and inserting the UMI sequence into the header of the resulting FASTQ file. Each flow cell from the G4 Sequencing Platform generates 4 lanes of data (4 separate FASTQ file data sets). You would need to run the sgdemux command for each lane of FASTQ data separately. Please note that you will need to manually create the output directory first using mkdir <output_dir> if it doesn't exist.
mkdir <output_dir>
sgdemux --read-structures 8B 9M8B +T +T --sample-metadata samplesheet_FC1_Lane2.csv --fastqs unfiltered_fastqs/Undetermined_S0_L002_I1_001.fastq.gz unfiltered_fastqs/Undetermined_S0_L002_I2_001.fastq.gz unfiltered_fastqs/Undetermined_S0_L002_R1_001.fastq.gz unfiltered_fastqs/Undetermined_S0_L002_R2_001.fastq.gz --output-dir <output_dir>/
Case 2: Demultiplex the entire directory automatically
Sgdemux can also autodetect the "unfiltered_fastq" directory of a G4 run output and demultiplex the run based on the file name. It will assume all the base pairs from the index reads (I1 or I2) are barcodes, and all the base pairs from the insert reads (R1 or R2) are the template. This inferred read structure cannot be overridden in the command line or using the Sample Data Sheet. Sample name and barcode combinations from the Sample Data Sheet must also have 1 to 1 correspondence even if they are in different lanes (unless you use the --lane option described below), and the FASTQ output from the different lanes are merged into 1 set of FASTQ files for the entire flow cell. An example command is below.
mkdir <output_dir>
sgdemux --fastqs <run_directory>/unfiltered_fastqs/Undetermined --sample-metadata G4_samplesheet.csv --output-dir <output_dir>
Case 3: Demultiplex lanes separately but detect files and read structure automatically
Because each G4 flow cell contains 4 fluidically independent lanes, it is feasible to load different sets of samples sharing the same barcode in different lanes. In this situation, you can specify which lanes to demultiplex using the "--lane" option. For example, lane 3 contains different samples and needs to be demultiplexed independently from the other lanes. However, lanes 1, 2, and 4 contain the same samples with the same barcodes, and the resulting demultiplexed reads from the different lanes should be combined. The following commands can be run to achieve this goal:
mkdir output_lane3
sgdemux --fastqs <run_directory>/unfiltered_fastqs/Undetermined --sample-metadata G4_samplesheet.csv --output-dir <output_dir> --lane 3
mkdir output_lane124
sgdemux --fastqs <run_directory>/unfiltered_fastqs/Undetermined_S0_ --sample-metadata G4_samplesheet.csv --output-dir <output_dir> --lane 1 2 4
Including parameters in the Sample Data Sheet
You can pass parameters to the Singular Demultiplexer Software by including a [Demux] section in your Sample Data Sheet. Each command line option must be on a separate line. The first column is the option long name without the leading --. The second column contains the option value if required. When the demux option is specified in both the command line and the Sample Data Sheet, the Sample Data Sheet options take precedence. Please check the README in the GitHub Repo for the latest parameters.
Suppose you want to demultiplex the entire G4 output directory, but want to specify the read structure and use only data from lanes 1, 2, and 4. You can include the following option in your Sample Data Sheet (samplesheet.csv):
[Demux]
read-structures,8B 9M8B +T +T
lane,1 2 4
And run the following command:
mkdir <output_dir>
sgdemux --fastqs <run_directory>/unfiltered_fastqs/Undetermined --sample-metadata samplesheet.csv --output-dir <output_dir>
Nextflow Workflow
Nextflow is a commonly used workflow engine. We created a Nextflow workflow with the containerized Singular Demultiplexer software for demultiplexing the folder automatically (Case 2 and Case 3). If the "lane_level" parameter is set as true, the 4 different lanes will be demultiplexed independently (Case 3). Otherwise, lanes will be demultiplexed as together (Case 2). If the "multiqc" parameter is set as true, the FASTQ file generated through the demultiplexer is analyzed with FastQC and the results are aggregated with MultiQC to generate a quality metrics report.
A select number of options for the Demultiplexer can be changed through the Nextflow input parameter file. However, you can specify options using the Sample Data Sheet as described above. Documentation is available in the GitHub Repo.
Link to the Nextflow workflow: https://github.com/Singular-Genomics/sgdemux_folder_nf2
Link to the Docker Container: https://hub.docker.com/r/omicpublic/sgdemux
This article is part of our Guide to Working With G4 Sequencing Data Technical Bulletin.