CHALLENGES IN NEXT-GENERATION SEQUENCING
Next-generation sequencing (NGS) is widely used throughout the biological sciences. However, the accurate analysis of sequenced libraries is challenging due to the size and complexity of genome sequences, and is further confounded by technical variables introduced during library preparation, sequencing and subsequent bioinformatics analysis (1-4).
As NGS is translated into clinical diagnosis, the standardized and accurate analysis of sequenced libraries is becoming increasingly important for reliable clinical diagnosis (5). The development of validated genome materials against which NGS analysis can be benchmarked has been cited as a key development required for the translation of NGS in clinical diagnosis (6,7).
The NA12878 platinum genome sequence is one of the foremost reference materials developed to date by the Genome in a Bottle Consortium (8). This individual genome has been well-characterized with a variety of sequencing platforms and analytical processes to provide an arbitrated genotype (9). The widespread provision of this genome provides a validated reference genotype against which as individual laboratories can benchmark their NGS analysis.
The NA12878 platinum genome has the advantage of providing the complexity of the human genome sequence. However, whilst reference genome materials constitute invaluable controls, their addition will contaminate any receiving sample, and therefore cannot be employed as internal sample-specific controls.
Spike-in controls constitute synthetic RNA or DNA molecules that are added directly to a user’s sample, and undergo concurrent library preparation and sequencing with the accompanying sample (10-12). This enables their use as internal sample-specific controls that are subject to the same bias and artifacts as the accompanying sample during the NGS workflow. However, spike-in controls typically have synthetic or alien sequences that enable derivative reads to be distinguished from the accompany sample following sequencing.
CHIRAL GENOME DESIGN
The human genome encodes information in the 5’ to 3’ direction. This directionality is imposed by nucleotide elongation, and is observed by transcription, translation and replication processes. By reversing this directionality to the 3’ to 5’ direction, we can generate a ‘mirror’ genome sequence that is distinct to the original human genome, yet retains the same nucleotide composition, and sequence repetitive and alignment profiles.
This reverse sequence has two features that make it an ideal substrate for the design of synthetic spike-in controls;
Firstly, the mirror sequence is non-superimposable to the natural human genome sequence, and as a result, sequenced reads from the human genome will not align to mirror’ genome, and reciprocally, reads simulated from a ‘mirror’ genome will not align to the human genome. This inability to cross-align between human and mirror genome holds for over 99% of the genome, with palindromic sequences being the sole exception.
Secondly, despite be distinct, many sequence properties, including the nucleotide composition, repetitiveness and alignment properties of the mirror sequence are identical to their natural human counterpart. This is because read alignment occurs similarly, whether in the forward direction in the human genome, or in the reverse direction in the mirror genome. As a result, chiral sequences can act as controls that match the alignment of the original human genome sequence.
Almost any feature of the human genome can be represented using synthetic mirror sequences that are physically synthesized into RNA or DNA molecules that we term sequins (sequencing spike-in controls; (13,14).
Individual sequins are combined across a range of precise concentrations to formulate mixtures. By modulating the concentration at which each sequin is present in the mixture, we can emulate quantitative features of genome biology, such as gene expression, allele frequencies or copy number variation.
By titrating sequins at increasing concentrations, we can establish an internal scale within a mixture. In this case, each sequin demarcates known input concentrations, and provides a reference against which to measure the abundance of RNA and DNA sequences in the accompanying sample.
Sequins also provide an ideal internal ladder for comparing differences and similarities between samples. In this case, mixtures typically contain two distinct sequin constituents; firstly a subset of sequins whose concentration does not vary between mixtures that are ideal for normalizing between multiple samples (15), and a second subset of sequins with differing concentration between the two mixtures, and are useful for measuring fold change differences between samples (such as gene expression).
IN THE NGS WORKFLOW
An aliquot of sequins are typically added at a fractional concentration (typically 1-3%) As a result, a proportional fraction of reads in the final library will derive from sequins. To distinguish these reads that derive from sequins, the library is aligned to a combined index containing both the reference genome sequence, and also the synthetic or mirror genome sequence. This partitions the reads from sequins that align to the synthetic genome, from reads from the accompanying sample that align to the reference genome.
Once sequins have been added to a sample, they comprise a set of quantitative and qualitative internal controls that can be used to assess multiple steps in the NGS workflow. This includes the use of sequins for rapid quality control and troubleshooting purposes, and the detection of bias in library preparation and errors during sequencing. However, sequins are particularly useful in assessing bioinformatics steps, providing an internal reference against which to benchmark alternative bioinformatics processes, optimize modifiable parameters, and ultimately improve analysis. Sequins can also be used as a set of scaling factors to normalize between multiple samples or replicates, and provide an empirical assessment of significant fold-difference between samples (15).
To aid in the analysis of sequins, we have developed a software toolkit, termed Anaquin, which performs many of the most common analysis. For example, Anaquin can calculate a range of statistics to describe an NGS library, and rapidly assess analytical performance. Anaquin is designed to be incorporated into the NGS workflow, and accordingly works with most commonly used bioinformatics tools, and supports standard data formats. In addition, an Anaquin package is available in the R programming environment that enables integration with existing bioinformatic and statistical packages, and can be used to visualize sequin data and analysis.
- Wall, J.D. et al. Estimating genotype error rates from high-coverage next-generation sequence data. Genome Res 24, 1734-9 (2014).
- Nakamura, K. et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res 39, e90 (2011).
- Meacham, F. et al. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12, 451 (2011).
- Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res 21, 2213-23 (2011).
- Gargis, A.S. et al. Assuring the quality of next-generation sequencing in clinical laboratory practice. Nat Biotechnol 30, 1033-6 (2012).
- Altman, R.B. et al. A research roadmap for next-generation sequencing informatics. Sci Transl Med 8, 335ps10 (2016).
- Zook, J.M. & Salit, M. Advancing Benchmarks for Genome Sequencing. Cell Syst 1, 176-7 (2015).
- Zook, J.M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32, 246-51 (2014).
- Zook, J.M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data 3, 160025 (2016).
- Baker, S.C. et al. The External RNA Controls Consortium: a progress report. Nat Methods 2, 731-4 (2005).
- Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res 21, 1543-51 (2011).
- Zook, J.M., Samarov, D., McDaniel, J., Sen, S.K. & Salit, M. Synthetic spike-in standards improve run-specific systematic error analysis for DNA and RNA sequencing. PLoS One 7, e41356 (2012).
- Hardwick, S.A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat Methods 13, 792-8 (2016).
- Deveson, I.W. et al. Representing genetic variation with synthetic DNA standards. Nat Methods 13, 784-91 (2016).
- Risso, D., Ngai, J., Speed, T.P. & Dudoit, S. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat Biotechnol 32, 896-902 (2014).