Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

After having given an overview of the PacBio error-correction (PacBioToCA) pipeline of Koren et al (see previous blog post), it was interesting to see another paper coming out describing combining PacBio and Illumina for assembling bacterial genomes: Ribeiro et al, “Finished bacterial genomes from shotgun sequence data“, Genome research accepted preprint. The authors are all from the Broad Institute. David Jaffe was so kind as to provide me with the supplementary material file, which so far is not yet available online. In this post, I will summarise the ALPATHS_LG paper, and contrast the approach with the PacBioToCA pipeline.

Picture from zazzle.com (http://bit.ly/RMGgZR)

The Broad Institute of MIT/Harvard is an impressive genome centre. Sorry, ‘Genomic Medicine Center’. One of their achievements in recent years is an optimised pipeline for assembly of (small and) large genomes based on short read (Illumina) data. The software program they developed, ALLPATHS_LG, combined with a kind of special type of ‘recipe’, has proven to be extremely succesful. The Broad keeps churning out very respectable assemblies of, among other groups, large complex eukaryotes. In order to be able to handle large amounts of genome projcts, I get the impression that the Broad Institute aims for standardisation and optimisation of protocols and, what they call ‘recipes’. The idea being that, if you follow their recipes and use their programs, you are more or less guaranteed an optimal result. In the case of ALLPATHS_LG, it also means, however, that one is required to have at least one Illumina jumping (‘mate pair’) library and a short-insert (‘paired-end’) Illumina library for which the insert size is shorter than twice the read length. Paired end libraries with slightly larger insert sizes are standard, making the ALLPATHS_LG requirements kind of uncommon. It also means not many projects are only able to compare other programs with ALLPATHS without generating an extra read dataset (I speak from experience here…).

The Ribeiro et al paper is entirely focussing on microbial size genomes, a big difference with the Koren et al (PacBioToCA) paper. The short-read recipe for ALLPATHS_LG remains unchanged: 50x coverage short-insert (up to 220bp) paired reads and 50x coverage jumping (2-10kb) reads. This is then supplemented with 50x PacBio reads, with library insert sizes between 1 and 3 kb (although for the paper, they also created two larger insert libraries for two of the strains, at 6 and 10kb respectively). In total, there are three size ranges represented this way: short – overlapping paired reads, intermediate – PacBio reads, long – illumina mate pairs. Finally, the short reads have high quality relative to the PacBio reads.

Continue reading