Paper: error-correcting PacBio reads using high-quality short reads
July 1, 2012
Today, a paper entitled ‘Hybrid error correction and de novo assembly of single-molecule sequencing reads’ came out in Nature Biotechnology by Sergey Koren, Michael Schatz and others. In it, the authors describe a method to error-correct PacBio reads and use them in de novo genome assembly. I was gracefully given an advance copy by Mike Schatz which I used to prepare the following post.
The PacBio RS instrument from Pacific Biosciences gives extremely long reads (several 1000 bases), but with high single-pass error rates (85% accuracy – 15% error). Alternatively, one can use the short-insert mode, where each fragment is sequenced mutliple times (Circular Consensus Sequencing – CCS), resulting in high quality, but much shorter (up to 1 kb – 1000 bases) reads.
Even though in principle, longer reads are ideal for de novo genome assmebly, using the high-error PacBio reads natively is hard: for alignment, the between-read error rate doubles to 30%. So, the long PacBio reads would be most advantageous if the error can be overcome. This is what the authors of the Koren et al. paper try to achieve. In the following, I’ll summarise their main findings.
First, the authors tested where along the reads the error occurs, and, as claimed by the company, there was no bias detected: the average error rate was tightly distributed around the mean along the entire read length. Also, coverage of PacBio reads over the genome they were derived from (in this case, yeast) was very even.
Unless one uses a de Bruijn Graph approach (most approriate for short read datasets), assembly starts by finding overlaps between reads in the so-called Overlap Layout Consensus approach, or OLC assembly. So, the authors next looked at the theoretical detection of overlaps versus error rates. From the paper:
“For both 454 and Illumina, over 80% of the overlaps are detected by 3% error. By contrast, on PacBio, only 10% are detected at 15% error”
In other words, finding overlaps between short, high quality reads and the long PacBio reads is much easier than between PacBio reads themselves.
Long reads are expected to help most in OLC, and that is what they focussed on.
The authors have developed a method to use the high-quality short reads, be it Illumina, 454 or CCS PacBio reads, to correct the errors in the long, low quality PacBio reads. They built a pipeline for achieving this on top of the Celera assembler. Celera was developed for Sanger type of reads, and is the only assembly program from that era that has been adjusted to be able to tackle the newest technologies as well (first 454, then Illumina, and now PacBio).
The pipeline, called PacBioToCA, is available with the latest Celera release, see this link.
How does this approach work?
- first, create long, single-pass PacBio reads and short, high quality Illumina or 454 reads, or PacBio circular consensus reads
- map the short reads to the long reads, start by finding perfect 14 nucleotide overlaps
- for PacBio reads containing repeats, choose the mapping which maximizes the identity between the short read and the PacBio read
- build consensus (using AMOS) for the long read based on the aligned short reads, breaking the long reads when there is no coverage with short reads
Special care is needed when dealing with repeats, and the pipeline tackles this by trying to divide the short reads among the different copies in such a way as to maximise the chance that the correct repeat copy is reconstructed in the end. In an email, Mike Schatz told me he expects repeats that are up to 99% identical to be corrected to the right sequence. This is, however, dependent on the eveneness of both the read coverage and the errors in the PacBio reads.
Error correction yields slightly shorter reads, due to some regions not having (enough) short-reads coverage. Also, some reads are too short to be included for correction. In total, about 50%-60% of the raw PacBio reads end up in the corrected set. But still, after correction, one has a long-read dataset, with many reads of several Kbp and with >99% accuracy!
The pipeline was tested on lambda, E. coli, yeast and parrot (from assemblathon2) datasets, with 50X of Illumina reads for the correction. 50X was determined the sweet spot for short-read coverage. From the paper:
“The accuracy of the long reads improved from ≈ 85% to over 99.9%, and chimeric and improperly trimmed reads measured < 2.5% and < 1% respectively”
Next, these corrected reads were used in hybrid de novo assembly using Celera. For this, the maximum input read size in Celera had to be increased to deal with long- error-corrected reads. Except for parrot, reference genomes could be used to check the resulting assemblies.
The effect of adding corrected PacBio reads were very promising. Up to tripling of contig sizes (as measured by contig N50′s) were found, without introducing more errors (The N50 size is the size such that half tot total length is in contigs minimally of this length). For bacteria, contig N50s of several hundred Kbps were obtained; one assembly even had a contig N50 of more than 0.5 Mbp.
These gains were mainly due to resolving long repeats. As repeats are the biggest problem for reconstructing genomes from reads, this is a significant finding. The authors simulations showed that with increases in read length, single contig (!) assembly for E. coli type bacterial genomes are possible using this approach.
For the large heterozygous parrot genome assembly, the best assembly was obtained using 454 + PacBio reads corrected using Illumina data, with a contig N50 of almost 100kbp. This large contig N50 is usually very hard to obtain! The parrot Illumina only (using ALLPATHS_LG) and 454 only (using Celera) assemblies with optimal read datasets had contig N50′s of 47Kbp and 75 Kbp, respectively. Mapping known transcripts to the contigs and scaffolds showed good quality gene reconstructions in the assemblies with error-corrected PacBio reads. Interestingly, ALLPATHS_LG managed to assemble and scaffold more exons then Celera. Notable was also that the >70% GC promotor region in the ERG1 genes was only assembled without a gap using PacBio reads.
So, assemblies done using lower coverage PacBio corrected reads can outperform higher coverage short read only assemblies, even without the use of mate pairs. Hybrid assemblies reached maximum N50′s at around 10X cov of corrected PacBio reads.
Finally, the authors show how error-corrected PacBio reads also aid in transcriptome assembly, and can help in splice variant discovery.
The high error-rate of the single-pass PacBio reads results in a good deal of skepsis in the community on how useful this technology is. With this paper, a first step is made to show that in combination with other technologies (or CCS reads), long PacBio can be very useful for de novo genome assembly, also of large complex genomes. Significantly longer contigs can be obtained, and single-contig bacterial genome assemblies may become possible in the near future with modest increases in PacBio raw read lengths.
For de novo assembly, long PB reads can be an attractive alternative to mate pairs.
There are a few drawbacks of using this approach. The error-correction is computationally intensive, requiring a good dose of CPUs. The correction of the parrot data took a week. If one were ever to generate a suitable long-read PacBio dataset for the human genome, the authors estimated error-correcting these to take 10 days on 250 cpus. So, adding PacBio reads and error-correcting them effectively doubles assembly times. Also, not all labs may have access to the required compute resources. Another problem is the low throughput (50-60% of the reads survive the error correction). Here, methods to overcome the low short-read coverage (which breaks up some long reads) are needed.
I am very excited by this paper. It shows how the main problem people have with PacBio, the high error-rate, can be overcome. Using the PacBio, combined with cheap, short reads, can result in reads up to several Kbp of very high quality. This is a dream for everyone doing de novo genome assembly. I also have hopes that these long reads, with appropriate tools, can help assembling heterozygous genomes, and aid in in haplotype separation. As raw PacBio read lengths overlap the lengths of whole transcripts, de novo transcriptome assembly using this approach may give much more full-length transcripts assembled then is currently possible.
I think we have now seen the first of hopefully more approaches to correct PacBio reads. I would think other smart bioinformaticians are now inspired to search for less computationally-intensive methods (and faster) for correction.
Many people talk about how nanopore-based extremely long read technologies will make PacBio obsolete in the near future. I keep saying, seeing is believing (who outside the companies has real data in their hands?). And, if you want long reads today, PacBio is your only option.
Pacific Biosciences needs papers like this to show that their technology actually is going to deliver. Now it is up to the research community to use these long high-quality reads for de novo assembly of complex genomes. This is something we in our group have recently started with.
Datasets, recipies etc can be found here.