Metagenome sequencing can profile of the microbial community within an environmental sample, and diagnose the presence of pathogens directly from patient samples or isolates. However, the size and diversity of microbial communities often poses a challenge to metagenome analysis that can be addressed with reference standards.
Metagenomics directly applies next-generation sequencing to an environmental DNA sample that may contain many individual microbe genomes. This approach can resolve microbes that are novel or uncultivable, and provides a global profile of microbial community diversity and ecology. Given these advantages, metagenomics has been widely used to explore microbial ecology in diverse ecosystems, from patient isolates to deep sea hydrothermal vents.
However, the diversity, variation and novelty of microbial communities presents a challenge to metagenome analysis, and there are few standardized tools and references to compare between samples. Microbiome analysis can be further confounded by the presence of host DNA sequences, such as human DNA, and technical variation during the next-generation sequencing workflow.
The diversity of population sizes typically exhibited by microbial communities often needs to be normalized for accurate comparison, and this challenge is further compounded by different genome sizes (which can be unknown). Sequencing depth may also not achieve saturated coverage of microbial genomes, and can limit accurate quantitative measures of microbial abundance.
To provide a standardized reference against which to measure the microbe communities within each sample, and assess the performance of metagenome analysis, we have designed a set of synthetic references standards terms termed ‘sequins’ (sequencing spike-ins).
We have designed a set of 86 sequins for use with metagenome applications of generation sequencing. Each meta-sequin mirrors a microbe genome sequence, and they are combined together to emulate the GC content, sequence complexity and phylogenetic diversity encountered in a natural microbial community, despite having no homology with natural DNA sequences.
Figure 1. Schematic overview of the design, use and analysis of meta-sequins.
Sequins are added to environmental DNA samples prior to library preparation, and undergo concurrent sequencing and analysis with the accompanying sample. Accordingly, sequins can act as a reference set of internal controls throughout the NGS workflow, and allow sequencing bias, assembly performance and quantitative accuracy to be assessed. Because the sequins share no sequence homology with known natural sequences, sequins and derivative reads can be readily distinguished from the accompanying environmental sample, thereby comprising an independent, internal, sample-specific reference control.
To represent a synthetic microbe community, we retrieved a range of high-quality microbe genome sequences that represent of range of taxa, sizes, GC content, rRNA operon count, and isolation from a diverse range of environments, aiming to represent the complex phylogenetic and genomic heterogeneity typically encountered in a microbial community.
Figure 2. A 4kb regions of theSynechocystic sp. PCC6803 genome is represented with a mirrored sequin (MQ_23). Following sequence, the alignment of reads to the sequin (red) recapitulates the alignment of reads to the original natural genome (blue).
Each microbe sequences was then mirrored and assembled to form a representative sequin that recapitulates features of nucleotide composition, information content and repeat content of the original microbe genome. For example, the Synechocystic sp. PCC6803 genome is represented with a mirrored sequins that recapitulates the GC content and sequence architecture (Figure 2). This collection of sequins thereby mirrors the range, features and complexity of a natural microbial community.
To support quantitative metagenomics, sequins have been formulated at different concentrations to form a quantitative ladder. This can be used to measure the abundance of accompanying microbe genomes, the impact of sequencing coverage, and enable normalisation and comparisons between multiple samples. Given sequins represent only a fraction of the counterpart microbial genome, the reduced sequin length enables the full quantitative range of the sequins community to be assessed, whilst only using only a small fraction of the sequenced library.
Metagenome sequins are added to a sample of interest following DNA extraction, with the combined DNA pool undergoing concurrent library preparation and sequencing. We typically recommend that you add metagenome sequins at 1-2% fractional concentration by mass, although this can be changed at your discretion. Once added, the combined sample should is the used as input according to the protocol of your desired library preparation kit. Detailed instructions are available in the laboratory protocol from here.
Here we describe an example workflow for library preparation and sequencing of meta-sequins that have been added to the Mock Bacteria Archaea Community sample (MBARC-26; Singer et al., 2016), which is composed of 23 bacterial and 3 archaeal strains with finished genomes. Firstly, 1.5 μL of a 1:10 dilution of metagenome sequins (Mixture A) was added to 100 ng of MBARC-26 genomic DNA (i.e. 1.5% by mass). The combined DNA sample then used as normalized gDNA in of the Nextera XT DNA Library Prep Kit Reference Guide (Illumina®, 15031942 v02). Subsequent steps were performed as per manufacturer’s protocol, and libraries were then sequenced on an Illumina HiSeq 2500 (with paired end 125 bp reads). Following QC validation, sequenced reads are be trimmed to remove adaptor sequences.
Analysis of meta-sequins
By plotting the relationship between the observed and expected abundance of meta-sequins, the quantitative accuracy of the library can be assessed. Quantification can be done by aligning reads to the reference sequences, or by counting k-mers present within the reads. For an alignment-based analysis, trimmed reads are aligned (using Bowtie2; Langmead et. al., 2012) to a combined genome index comprising the sequin sequences and the MBARC-26 reference genome sequences.The total number of reads aligning to sequins should correspond to the fractional concentration at which they were added to the sample, and any substantial deviation can indicate an experimental error. To assess the quantitative accuracy of the library, we plot the mean-fold alignment coverage of the sequins relative expected input concentrations, with the missed detection of low-abundance sequins indicates the lower sensitivity limit of the library (Figure 3A).
Figure 3. (A) Comparison of the measured meta-sequin abundance (mean fold-coverage) relative to known mixture concentration illustrates the quantitive accuracy of the library, and the lower detection limit. (B) Comparison of the fraction of assembled sequins (blue) and microbe genomes (red) indicates the minimum fold-coverage required for effective de novo assembly.
Users may also perform de novo assembly of any genomes present in their sample, particularly if they are seeking to characterize novel microbial taxa that do not have published genome sequences. We can use meta-sequins to assess the minimal fold coverage sufficient for de novo assembly, or interpret the difficult or erroneous assembly of repetitive sequences. We first performed de novo assembly (using Ray Meta; ref), and evaluated the quality of resulting sequence assemblies relative to the reference sequences for sequins and MBARC-26 genomes. Plotting the fraction of assembled sequin relative to their known mixture concentration indicated to the minimum fold-coverage required to achieve sufficient de novo assembly (Figure 3B).