A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!
Option 1: bits and pieces
- Libraries: paired end or single end sequencing
- Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
- Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
- Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
- Pros: fast and cheap, OK for presence/absence of e.g. genes
- Cons: doesn’t give much insight into the genome
- Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead
Image from Wikimedia Commons. (Buzz was a low-cost airline based at London Stansted operating services to Europe. It was sold to Ryanair.)
I am not attending the American Society of Human Genetics meeting in San Fransisco, but can’t escape the buzz it creates on twitter (hashtag #ashg2012). Strikingly, it is almost another AGBT when it comes to announcements from companies selling sequencing instrument. All of them had something new to bring to the floor. This post summarizes what I picked up from twitter and a few websites, and I give a bit of my perspectives on the respective announcements. I am focussing on technology improvements, especially with regard to read lengths, not so much on applications such as cancer resequencing panels.
After having given an overview of the PacBio error-correction (PacBioToCA) pipeline of Koren et al (see previous blog post), it was interesting to see another paper coming out describing combining PacBio and Illumina for assembling bacterial genomes: Ribeiro et al, “Finished bacterial genomes from shotgun sequence data“, Genome research accepted preprint. The authors are all from the Broad Institute. David Jaffe was so kind as to provide me with the supplementary material file, which so far is not yet available online. In this post, I will summarise the ALPATHS_LG paper, and contrast the approach with the PacBioToCA pipeline.
The Broad Institute of MIT/Harvard is an impressive genome centre. Sorry, ‘Genomic Medicine Center’. One of their achievements in recent years is an optimised pipeline for assembly of (small and) large genomes based on short read (Illumina) data. The software program they developed, ALLPATHS_LG, combined with a kind of special type of ‘recipe’, has proven to be extremely succesful. The Broad keeps churning out very respectable assemblies of, among other groups, large complex eukaryotes. In order to be able to handle large amounts of genome projcts, I get the impression that the Broad Institute aims for standardisation and optimisation of protocols and, what they call ‘recipes’. The idea being that, if you follow their recipes and use their programs, you are more or less guaranteed an optimal result. In the case of ALLPATHS_LG, it also means, however, that one is required to have at least one Illumina jumping (‘mate pair’) library and a short-insert (‘paired-end’) Illumina library for which the insert size is shorter than twice the read length. Paired end libraries with slightly larger insert sizes are standard, making the ALLPATHS_LG requirements kind of uncommon. It also means not many projects are only able to compare other programs with ALLPATHS without generating an extra read dataset (I speak from experience here…).
The Ribeiro et al paper is entirely focussing on microbial size genomes, a big difference with the Koren et al (PacBioToCA) paper. The short-read recipe for ALLPATHS_LG remains unchanged: 50x coverage short-insert (up to 220bp) paired reads and 50x coverage jumping (2-10kb) reads. This is then supplemented with 50x PacBio reads, with library insert sizes between 1 and 3 kb (although for the paper, they also created two larger insert libraries for two of the strains, at 6 and 10kb respectively). In total, there are three size ranges represented this way: short – overlapping paired reads, intermediate – PacBio reads, long – illumina mate pairs. Finally, the short reads have high quality relative to the PacBio reads.
Today, a paper entitled ‘Hybrid error correction and de novo assembly of single-molecule sequencing reads’ came out in Nature Biotechnology by Sergey Koren, Michael Schatz and others. In it, the authors describe a method to error-correct PacBio reads and use them in de novo genome assembly. I was gracefully given an advance copy by Mike Schatz which I used to prepare the following post.
The PacBio RS instrument from Pacific Biosciences gives extremely long reads (several 1000 bases), but with high single-pass error rates (85% accuracy – 15% error). Alternatively, one can use the short-insert mode, where each fragment is sequenced mutliple times (Circular Consensus Sequencing – CCS), resulting in high quality, but much shorter (up to 1 kb – 1000 bases) reads.
Even though in principle, longer reads are ideal for de novo genome assmebly, using the high-error PacBio reads natively is hard: for alignment, the between-read error rate doubles to 30%. So, the long PacBio reads would be most advantageous if the error can be overcome. This is what the authors of the Koren et al. paper try to achieve. In the following, I’ll summarise their main findings.
First, the authors tested where along the reads the error occurs, and, as claimed by the company, there was no bias detected: the average error rate was tightly distributed around the mean along the entire read length. Also, coverage of PacBio reads over the genome they were derived from (in this case, yeast) was very even.
(Picture found on rockstar-pickup.com, of all places)
I recieved an email today via the assemblathon mailing list. Erich Jarvis, from the Howard Hughes Medical Institute, wrote that he had asked somebody at Pacific Biosciences for some feedback on my previous post regarding the PacBio parrot reads.
There are a few things in the response worth repeating here.
A quick summary for the impatient: A set of in total more than 4 million PacBio reads, up to 17kb in length, is available as part of the assemblathon. Most of the reads are short (peaking at 600-700 or 950 bases), but a significant fraction is very long. The average read quality metrics show no reads above Q11.
The assemblathon is a ‘competition’ to enhance genome assembly. After the first round earlier this year, which was solely based on simulated reads, now assemblathon 2 has started, solely based on real reads. Researchers are asked to come up with their best assemblies, which will be compared against eachother using several metrics.
Two of the assemblathon 2 genomes have only Illumina sequences, but for the third one, a parrot, there are reads available from Illumina, 454 and PacBio.
As we are about to get our PacBio RS instrument later this year, I decided to have a look at the data available for the parrot.
First some info from the readme file:
“Two libraries were created at 7.5kb and 13kb insert sizes. The 7.5kb library was sequenced at 45 and 90 minutes and the 13kb library was only sequenced at 90 minute movies. The raw reads are filtered on a ReadQuality metric split on the adapter region to generate ‘subreads’.”
I guess this is for those reads where the sequencing goes around the SMRTBell and reads the same molecule twice. It further states that reads are only included if they are 100 bp (the 7.5 kb insert runs) or 500 bp (the 13 kb insert run) long, and had an internally developed (it seems) read quality parameter of at least 0.75 ‘raw single pass accuracy.’