Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:
- generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
- pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
- use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
- map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool
The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.
The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.
First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target
A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!
Option 1: bits and pieces
- Libraries: paired end or single end sequencing
- Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
- Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
- Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
- Pros: fast and cheap, OK for presence/absence of e.g. genes
- Cons: doesn’t give much insight into the genome
- Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead
Today, a paper entitled ‘Hybrid error correction and de novo assembly of single-molecule sequencing reads’ came out in Nature Biotechnology by Sergey Koren, Michael Schatz and others. In it, the authors describe a method to error-correct PacBio reads and use them in de novo genome assembly. I was gracefully given an advance copy by Mike Schatz which I used to prepare the following post.
The PacBio RS instrument from Pacific Biosciences gives extremely long reads (several 1000 bases), but with high single-pass error rates (85% accuracy – 15% error). Alternatively, one can use the short-insert mode, where each fragment is sequenced mutliple times (Circular Consensus Sequencing – CCS), resulting in high quality, but much shorter (up to 1 kb – 1000 bases) reads.
Even though in principle, longer reads are ideal for de novo genome assmebly, using the high-error PacBio reads natively is hard: for alignment, the between-read error rate doubles to 30%. So, the long PacBio reads would be most advantageous if the error can be overcome. This is what the authors of the Koren et al. paper try to achieve. In the following, I’ll summarise their main findings.
First, the authors tested where along the reads the error occurs, and, as claimed by the company, there was no bias detected: the average error rate was tightly distributed around the mean along the entire read length. Also, coverage of PacBio reads over the genome they were derived from (in this case, yeast) was very even.
Nick Loman was kind enough to give me an advanced copy of his paper in Nature Biotechnology entitled “Performance comparison of benchtop high-throughput sequencing platforms” (Loman et al, 2012). I thought to present a quick summary of the paper here and add some comments of my own.
The paper sets out to “compare the performance of three sequencing platforms [Roche GS Junior, Ion Torrent PGM and Illumina MiSeq] by analysing data with commonly used assembly and analysis pipelines.” To do this, they chose a strain from the outbreak of food-borne illness caused by Shiga-toxin-producing E. coli O104:H4, which caused a lot of trouble in Germany about a year ago. The study is unique in that it is focuses on the use of these instruments for de novo sequencing, not resequencing.
First, they used the ‘big brother’ of the GS Junior, the GS FLX, to generate a reference genome (combining long reads obtained using the GS FLX+, and mate pairs using Titanium chemistry). Then, the same strains were sequenced on the benchtop instruments, and these reads were compared to the reference assembly. The reads were both compared directly, and after assembly with a few commonly used programs.
(The impatient reader might want to skip to the conclusion at the end of this post…)
Last wednesday, Ion Torrent released a tech note and associated run data with shotgun (single-end) and Mate Pair runs for Escherichia coli K12, substrain MG1655. Both a 3.5 kb and 8.9 kbp insert size, as well as a shotgun library, were sequenced on a 316 chip each. In the tech note, they describe assemblies using different combinations of the data, and show how adding the mate pairs yields assemblies with fewer scaffolds and gaps. The Ion mate-pair protocol is very similar to the one used by 454 Life Sciences for their (unfortunately called) Paired-end libraries: long fragments are circularized using a linker sequence, and sequencing is peformed across this linker, allowing for easy identification of the pair halves.
This is the first real ‘long-distance’ Mate Pair data from Ion Torrent, which is exciting and made me have a close look at it. I was especially interested in how the newbler program, developed by 454 Life Science for their 454 reads, would perform on these data.