The human genome is pervasively transcribed into a range of coding and noncoding RNAs that are collectively termed the transcriptome (1). RNA sequencing (RNAseq) has the ability to both reconstruct complex spliced isoforms whilst simultaneously measuring gene or isoform expression, and can thereby provide a global profile of the transcriptome (2-4).
Given these advantages, RNAseq has become a commonly applied tool throughout biological research, and is being increasingly applied in clinical diagnosis (5). However, the sheer size and complexity of gene expression, a pervasive expression dependent bias, and additional technical variables during library preparation, sequencing and analysis confounds analysis of the transcriptome using RNAseq (6-8).
Spike-in controls are exogenous RNA molecules that are directly added to an RNA sample prior to library preparation and sequencing. The External RNA Controls Consortium previously developed a suite of RNA spike-in controls that are derived from non-human sequences, and can be added to a user’s RNA sample, to act as internal quantitative and qualitative controls (9, 10). However, despite conveying a wide range of advantages during gene expression analysis, ERCC controls were originally developed for microarray analysis and only constitute single exon transcripts, and therefore do not reproduce the complexity of human gene splicing (10).
SEQUIN DESIGN & USE
We have developed a suite of synthetic RNA spike-in controls, termed ‘sequins’ (sequencing spike-ins), that represent complex spliced isoforms (11). The primary sequence of sequins is entirely artificial, and shares no homology to the genomes of known organisms. As a result, sequins do not align to a reference genome, but rather align to synthetic gene loci encoded in an artificial in silico chromosome sequence.
RNA sequins proportionally represent the diversity of human genes (12). Each RNA sequin represents a mature mRNA isoform (with intervening intron sequences excluded) and terminate with a ~60nt poly-A tail. Sequins represent a total of 164 alternative isoforms, and range from small single-exon isoforms to large multi-exons transcripts up to ~7 kb in size. Together these alternative isoforms are encoded within 78 artificial gene loci on the in silico chromosome, and display the diversity of splicing events typically observed in the transcriptome, including intron retention, cassette exons, alternative transcription initiation and termination, and non-canonical splicing (4).
RNA sequins are combined at different concentrations to formulate a mixture that emulates quantitative gene expression and alternative splicing. For example, modulating the relative concentration of alternative sequin isoforms within a single gene locus emulates alternative splicing. Furthermore, we have also formulated multiple alternative mixtures for gene expression profiling experiments. The modulation of sequin gene abundance between mixtures, which are alternatively added to multiple samples, emulate differential gene expression, and can be used to both normalize and measure fold-change differences in gene expression between samples.
RNA sequin mixtures are carefully prepared, with only a single freeze-thaw cycle, before being shipped frozen to recipient users. Upon their first use, RNA sequins should be divided into single-use aliquots to minimize future freeze-thaw cycles. RNA sequins are then typically added at a fractional concentration to a users’ RNA sample before library preparation (the exact amount of sequins added depends on the amount of sample RNA and the library preparation methods used and detailed instructions can be found in the laboratory protocol).
The combined RNA sample and sequins then undergo library preparation, and sequencing together. To distinguish reads derived from sequins from the accompanying RNA sample, the resulting library is aligned to a combined index comprising both the reference genome (such as hg38 or mm10) and the in silico chromosome. Reads derived from the sequins align to the in silico chromosome, whilst reads from the accompanying sample align to the reference genome. If appropriate, users may be required to subsample alignments to the in silico chromosome to calibrate the fraction of reads derived from sequence between multiple libraries or replicates. This partitions the analysis of RNA sequins as internal quantitative and qualitative controls in downstream analysis.
Alignment to the in silico chromosome partitions the analysis of sequins as internal quantitative and qualitative controls for the RNAseq workflow. This includes an assessment of library preparation, sequencing, and subsequent bioinformatic analysis, including split-read alignment, transcript assembly, gene expression profiling and alternative splicing. Sequins also constitute ideal scaling factors by which to normalize multiple samples, and detect significant differences in gene expression and splicing.
Here we describe some example analyses that are enabled by RNA sequins. Many of the following analysis can be easily performed using the Anaquin software toolkit (Figure 2), which is designed for easy incorporation within an RNAseq workflow, supports standard file formats (eg. .BED, .GTF, .BAM and .SAM) and integrates with common RNAseq software (such as StringTie, Tophat, DESeq2, STAR, EdgeR etc.).
To demonstrate the use of sequins in RNA sequencing, we first spiked the staggered RNA sequins Mixture A into total RNA harvested from the K562 human cell-type (three replicates; example libraries are available for download for training and testing purposes from here). The samples (with sequins) then underwent concurrent library preparation and sequencing, with reads aligned to a combined genome comprising both hg38 and the in silico chromosome. Alignments to in silico chromosome were then subsampled to calibrate spike-in amounts between replicates as required (Details methods are available here; 11).
Novel genes and isoform structures can be assembled from RNAseq reads. However, robust transcript assembly challenging due to the complexity of splicing, and insufficient sequence coverage of lowly expressed genes (13-15).
We can use RNA sequins to empirically determine the minimum expression level required for assembly of novel spliced isoforms (Figure 3). By plotting the fraction of isoforms assembled relative to their input concentration, we typically observe a sigmoidal dose-response curve that clearly indicates the minimum gene expression required to achieve sufficient assembly (9). This threshold can then be applied to the accompanying RNA sample within the library to determine the lower limit on novel isoform detection and assembly.
GENE EXPRESSION MEASUREMENTS
The measurement of gene expression is a major application of RNA sequencing. However, sequence coverage and biases introduced during library preparation and sequencing can bias gene expression measurements (6, 13). We can use sequins to assess the accuracy with which gene expression is measured within an RNAse library, and empirically determine the limit of quantification. By plotting the measured expression of sequins (in FPKM) relative to their expected input concentration (in attomoles/ul) illustrates a linear trend that can describe the quantitative accuracy of an RNA sequencing experiment (9). Notably the correlation and coefficient of determination provide user’s with a descriptive statistic on how well gene expression is measured within an RNA sequencing library. The slope of the linear trend indicates the proportionality between expected and observed abundance, and provides a rate by which to convert between sequin input concentration (in attomoles/μL) and measured abundance (in FPKM).
In RNAseq libraries with high sequencing depth, we often observe a lower inflection point that can be determined using piecewise linear regression analysis. This inflection point, termed the ‘limit of quantification’, indicates the minimal expression below the measurement of gene expression becomes non-linear and highly variable. When this inflection point is applied to the accompanying RNA sample, it indicates the fraction of genes with sufficient sequence coverage to enable an accurate measurement of gene expression.
DIFFERENTIAL GENE EXPRESSION PROFILING
Gene expression profiling is a common technique to identify the genes responsible for orchestrating developmental or differentiation programs. Typically, users compare multiple samples from different tissues or stages to identify differences in gene expression. However, technical bias, differences between RNAseq libraries and insufficient sequence coverage of genes, can confound gene expression profiling. RNA sequins are formulated into alternative mixtures that emulate fold-changes in gene expression. These alternative mixtures are added to alternative samples, enabling user’s to empirically assess the detection of fold-change differences in gene expression between samples.
To illustrate the accuracy with which fold-change differences are measured, we can plot the observed relative to expected fold-change in sequin expression between samples. Whilst a linear regression provides a global description of fold-change measurements, these measurements are highly expression dependent. The relationship between gene-expression, fold-change and the confidence with which differences are detected can be illustrated with a limit-of-detection ratio (LODR) plot (16).
Sequins can also be used to empirically assess differential gene expression analysis (11). By comparing the detection of significant differences reported by common gene expression analysis software to expected fold-changes in sequin expression, we can empirically assess analytical performance. This performance can be illustrated with a ROC curve, that indicates the diagnostic performance at multiple significance thresholds according to a user’s requirement.
DATA & DOWNLOADS
Access to useful information and data for using and analysing RNA sequins, including gene annotations, in silico chromosome sequences, mixture concentrations, laboratory protocols for adding sequins to your RNA, and example sequenced libraries.
|RNA_LabProtocol.pdf||Protocol for using RNA sequins in the laboratory (includes detail on dilution and addition to RNA samples).|
|RNAsequins.v2.2.fa||RNA – sequin nucleotide sequences||.FA|
|RNAsequins_isoform_mix.v2.2.tsv||RNA – concentration of isoform sequins in Mix A and B (v2.2)||.TSV|
|RNAsequins_gene_mix.v2.2.tsv||RNA – concentration of gene sequins in Mix A and B (v2.2)||.TSV|
|RNAsequins.v2.2.gtf||RNA – Synthetic gene loci annotations encoded within in silico chromosome.||.GTF|
|chrIS.fa||RNA – in silico chromosome (chrIS)||.FA|
|RNAsequins_mixA_rep1||RNA sequins mix A neat library (replicate 1)||.TXT||.FQ||.FQ|
|RNAsequins_mixA_rep2||RNA sequins mix A neat library (replicate 2)||.TXT||.FQ||.FQ|
|RNAsequins_mixA_rep3||RNA sequins mix A neat library (replicate 3)||.TXT||.FQ||.FQ|
|RNAsequins_mixB_rep1||RNA sequins mix B neat library (replicate 1)||.TXT||.FQ||.FQ|
|RNAsequins_mixB_rep2||RNA sequins mix B neat library (replicate 2)||.TXT||.FQ||.FQ|
|RNAsequins_mixB_rep3||RNA sequins mix B neat library (replicate 3)||.TXT||.FQ||.FQ|
- Okazaki, Y. et al. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420, 563-73 (2002).
- Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5, 621-8 (2008).
- Cloonan, N. et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 5, 613-9 (2008).
- Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet 40, 1413-5 (2008).
- Byron, S.A., Van Keuren-Jensen, K.R., Engelthaler, D.M., Carpten, J.D. & Craig, D.W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat Rev Genet 17, 257-71 (2016).
- Tarazona, S., Garcia-Alcalde, F., Dopazo, J., Ferrer, A. & Conesa, A. Differential expression in RNA-seq: a matter of depth. Genome Res 21, 2213-23 (2011).
- Ozsolak, F. & Milos, P.M. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet 12, 87-98 (2011).
- Vijay, N., Poelstra, J.W., Kunstner, A. & Wolf, J.B. Challenges and strategies in transcriptome assembly and differential gene expression quantification. A comprehensive in silico assessment of RNA-seq experiments. Mol Ecol 22, 620-34 (2013).
- Jiang, L. et al. Synthetic spike-in standards for RNA-seq experiments. Genome Res 21, 1543-51 (2011).
- Consortium, E. Proposed methods for testing and selecting the ERCC external RNA controls. BMC Genomics 6, 150 (2005).
- Hardwick, S.A. et al. Spliced synthetic genes as internal controls in RNA sequencing experiments. Nat Methods 13, 792-8 (2016).
- Derrien, T. et al. The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res 22, 1775-89 (2012).
- Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods 8, 469-77 (2011).
- Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol 28, 511-5 (2010).
- Martin, J.A. & Wang, Z. Next-generation transcriptome assembly. Nat Rev Genet 12, 671-82 (2011).
- Munro, S.A. et al. Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures. Nat Commun 5, 5125 (2014).
Garvan Institute of Medical Research © 2016. All rights reserved.