What data format does the G4 Sequencing Platform output?

This article summarizes the FASTQ file output of the G4

The G4 Sequencing Platform directly outputs your sequencing data into the FASTQ format. The G4 uses a base caller that converts the fluorescent intensities directly into bases as your samples are sequenced. If you include the index information in your sample sheet for the run, the G4 can also automatically demultiplex your samples when the sequencing is complete. Alternatively, you can start (or restart) the demultiplex process with the Singular demultiplex software.

The FASTQ files generated by the G4 Sequencing Platform follow the standard four line-separated fields per sequence convention. Therefore, data generated on the G4 is compatible with most bioinformatic analytical pipelines and software that uses FASTQ files as input. The FASTQ file is further compressed by bgzip to save space.

The first line is the sequence identifier (colloquially called the sequence header).

The second line is the raw genetic sequence.

The third line is the ‘+’ character separator.

The fourth line is the corresponding quality value for the sequence in the second line. The quality values use the Phred 33 encoding.

Sample Sequence:

@G4-014:0080:OM1482O:1:1001:139542:22240 1:N:0:ATCCAGAG+TAACGTCG

GTTTTGTGGAGGTAATGTTAGTTTATTAATGAGTAAGTTATTGTTTTAAATTGGTTTTGTGGAG

+

?19GF9AI@J61(AJDEE2JBGHB0?:BCG(566IF%B(C0B@GH>DGJI<9%<GJJ:0<D?,A

The sequence identifier field further encodes the following information:

Block 1:
- G4-014 - Machine name - the name of the G4 machine that the data was sequenced on.
- 0080 - Run number - the number of the run on the specified machine (integer). Kept track by the instrument.
- OM1482O - Flow cell ID - the flow cell ID number from the EPROM: OM0887O
- 1 - Lane_number - which lane that data was sequenced on (1-4)
- 1001 - Tile_number - denotes which tile the data was sequenced on
- 139542 - X-coordinate - the x-position of associated cluster on the tile (int)
- 22240 - Y-coordinate - the y-position of associated cluster on the tile (int)

Block 2:
- 1 - Read_number - is the read Read 1 or Read 2 in Paired End Read sequencing. If Single end, this is 1.
- N - Filter_bool - Did this read fail read filtering (N or Y).
- 0 - Control_bool - Is this read part of an internal control (0 if No, >0 is Yes. PhiX is typically the control. Identified PhiX is flagged as 1).
- ATCCAGAG+TAACGTCG - Sample_Seq - The sequence of the associated sample index used for multiplexing. In this case, this is the dual barcode associated with the read.

The G4 FASTQ sequence identifier is compatible with most bioinformatics tools. The X and Y coordinates are used for calculating optical duplicates using a tool such as PICARD. We typically set the threshold to 500 in MarkDuplicates (Picard).

This article is part of our Guide to Working With G4 Sequencing Data Technical Bulletin.