Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

After having given an overview of the PacBio error-correction (PacBioToCA) pipeline of Koren et al (see previous blog post), it was interesting to see another paper coming out describing combining PacBio and Illumina for assembling bacterial genomes: Ribeiro et al, “Finished bacterial genomes from shotgun sequence data“, Genome research accepted preprint. The authors are all from the Broad Institute. David Jaffe was so kind as to provide me with the supplementary material file, which so far is not yet available online. In this post, I will summarise the ALPATHS_LG paper, and contrast the approach with the PacBioToCA pipeline.

Picture from zazzle.com (http://bit.ly/RMGgZR)

The Broad Institute of MIT/Harvard is an impressive genome centre. Sorry, ‘Genomic Medicine Center’. One of their achievements in recent years is an optimised pipeline for assembly of (small and) large genomes based on short read (Illumina) data. The software program they developed, ALLPATHS_LG, combined with a kind of special type of ‘recipe’, has proven to be extremely succesful. The Broad keeps churning out very respectable assemblies of, among other groups, large complex eukaryotes. In order to be able to handle large amounts of genome projcts, I get the impression that the Broad Institute aims for standardisation and optimisation of protocols and, what they call ‘recipes’. The idea being that, if you follow their recipes and use their programs, you are more or less guaranteed an optimal result. In the case of ALLPATHS_LG, it also means, however, that one is required to have at least one Illumina jumping (‘mate pair’) library and a short-insert (‘paired-end’) Illumina library for which the insert size is shorter than twice the read length. Paired end libraries with slightly larger insert sizes are standard, making the ALLPATHS_LG requirements kind of uncommon. It also means not many projects are only able to compare other programs with ALLPATHS without generating an extra read dataset (I speak from experience here…).

The Ribeiro et al paper is entirely focussing on microbial size genomes, a big difference with the Koren et al (PacBioToCA) paper. The short-read recipe for ALLPATHS_LG remains unchanged: 50x coverage short-insert (up to 220bp) paired reads and 50x coverage jumping (2-10kb) reads. This is then supplemented with 50x PacBio reads, with library insert sizes between 1 and 3 kb (although for the paper, they also created two larger insert libraries for two of the strains, at 6 and 10kb respectively). In total, there are three size ranges represented this way: short – overlapping paired reads, intermediate – PacBio reads, long – illumina mate pairs. Finally, the short reads have high quality relative to the PacBio reads.

The ALLPATHS_LG assembly algorithm was adjusted to take the PacBio reads into consideration. Without going too much into the details (described in detail in the supplementary and some of it a bit over my head), ALLPATHS_LG first builds a unipath graph using K=64. An interesing, and surprising, point the authors mention here is that

“… because [the reads from the Illumina jumping libraries] can be less biased than [the reads from the short-insert paired-end libraries] we next fill some gaps in the unipath graph using the jumping reads, ignoring their pairing.”

This is the first time I hear of this apparently lower GC bias for jumping reads compared to paired-end reads. The authors have this to say about it:

“These differences in bias are not well understood but might be attributable to protocol differences between data types or sample-to-sample variability in protocol instantiation.”

Next, rather than correcting the long, error-rich PacBio reads, as Koren et al chose to do, these long reads are used directly to fill gaps in the unipath graph. Then the PacBio reads are overlaid on the unipath graph to help solve repeats, and consensus sequences are built. Using these consensus sequences, a new unipath graph is built, this time using the unusually large K value of 640 (approximately half the average PacBio read length). Finally, the long jump Illumina reads are used to build the assembly graph. The end result is this graph, representing the maximum amount of information that can be obtained from the data. The authors present these graphs in the paper. From them, fasta sequences are extracted and reported as the familiar contigs and scaffolds. A big advantage of the graph representation is that the relation between the contigs/scaffolds is reported, which makes finishing efforts, and comparative analyses, much easier.

The paper describes 16 datasets and assemblies, all bacterial genomes ranging from 1.6 to 6Mbp, with GC percentages ranging from 27 to 69%. For three species, reference genomes were available (although the authors went to great lengths to correct these based on their read datasets – in itself interesting reading). Samples were oversequenced, then datasets reduced to the 50x coverage levels of the recipe. For Illumina, libraries were pooled and sequenced on the HiSeq (using V3 TruSeq chemistry). For the PacBio datasets, C1 chemistry (the first version as per April 2011) was used for the 3kb libraries, but notably, for two strains, C2 chemistry (released February 2012, with a doubling of the readlength, and increased per-base read qualities) was used. For these strains, a 6kb and 10kb insert library were prepared. Between 8 and 30 SMRTCells (‘chips’) were used (with the exception of one strain, for which only 4 SMRTCells were used – incidentally, this was one of the strains for which C2 chemistry was used).

The authors assembled the data using ALLPATHS_LG with default settings, and describe, using up to 48 CPUS – runtimes of up to 2.5 hours and peak memory up to 46 GB. The exception to this was one strain that took 6.5 hours and needed 83 GB of RAM. This means that a modest server – by today’s standard – with, for example 96 GB of RAM, can handle these assemblies (but one would come quite far with only 48 GB RAM).

For the three strains for which a reference genome was available, the assemblies could be compared to these. All three assemblies showed the main, large chromosome assembled into one gapless (!) circular contig. A very large repeat shared between two plasmids casued a break in the contigs representing these. One genome assembled without errors, the others had very few (2 and 4 respectively, substitutions and one single-base deletion). For the other 13 strains, three had the main chromosome represented as a closed circle, many others were largely resolved except for some very long repeats. This is a quite amazing result! In the words of the authors:

“The resulting assemblies appear at least in some cases to be better than the finished genome reference sequences produced in the past, and yet can be produced rapidly at a cost that is about an order of magnitude lower.”

In the supplementary material, the authors give an overview of the costs of an assembly using their recipe (direct costs, with amortization of instruments). They arrive at $3,133 for sequencing and assembly (excluding finishing) based on some assumptions. I dare to say this is quite affordable! However, it remains to be seen whether other labs than the Broad – which is focussed entirely on these kinds of projects – can get this done for the same price.

On the ALLPATHS_LG blog it is mentioned that:

“The Broad Institute is preparing to offer a service that takes as input a bacterial DNA sample and provides as output a genome assembly based on the new laboratory and computational methods.”

Before I round off, I have a two comments on the libraries presented in the paper. In the paper, the authors mention “that because of the wide distribution of the jumping pair fragment sizes (2-10 kb), it was necessary to devise a new statistical model to compute these distances.” The explanation ofr this wide distribution is that “jumping libraries were size selected using magnetic beads rather than agarose gel”. This I think has to do with the strategy of the Braod to simplify and streamline protocols – getting rid of the gel step is an improtant factor, but consequenctly yields jumping reads with wider distributions.

Second, the authors mention that coverage in PacBio reads over 2 or 3kb is much lower then ‘raw’ coverage of reads coming off the instrument. This has to do with the fact that the distribution of the PacBio read lengths has a peak much lower than the average, and a long tail towards long read lengths. Also, reads need to be split into shorter subreads if the polymerase seauenced the same insert multiple times. In the Supplementary, tables are presented with absolute coverage at 1, 2 and 3 kb. I used the data to calculate relative coverage (taking the coverage by all reads as 100%), and plotted the results:

Data from Ribeiro et al, 2012 Supplementary Table 3 normalised  as percentage of coverage at 0kb (all reads). The same strain identifiers (1-16) as in the article are used.

Note the two strains standing out with significantly more relative coverage at 1, 2 and 3 kb: numbers 4 and 8, something not really commented on by the Ribeiro et al. Strains 4 and 8 are also the strains that were sequenced using the newer C2 chemistry, and longer insert libraries (6 and 10 kb instead of 3kb for the C1 libraries). As the inserts for these libraries are longer, the subreads will be longer, resulting in more coverage of longer reads than the C1 chemistry sequenced 3kb libraries. See below for why this is significant.

How does this method compare to the PacBioToCA pipeline presented by Koren et al?

  • Error-correcting the PacBio reads as described in the Nature Biotech paper did not result in single-contig bacterial genome assemblies. In that respect, the ALLPATHS_LG approach outperforms PacBioToCA. However, no jumping libraries were used for the bacterial genomes presented in Koren et al paper.
  • The PacBioToCA pipeline is computationally very intensive (more or less doubling assembly times), requiring a compute grid or big server, while the ALLPATHS_LG assemblies were done in a few hours, and with a modest server.
  • Around 10x uncorrected PacBio reads were sufficient for the error-correction + assembly approach, while ALLPATHS_LG needs 50x PacBio coverage. The higher coverage requirement for ALLPATHS_LG has probably to do with the need to obtain enough long reads to be able to resolve reepats and fill gap in the unipath graph. Error-corrected long PacBio reads have by themselves enough information to yield good enough overlaps for assembly, thereby reducing the coverage needed. Note that the increased coverage using longer insert libraries and C2 chemistry might bring the ALLPATHS_LG coverage requirement down.
  • Ribeiro et al report that based on 4 months using the C2 chemistry, 3 PacBio SMRTCells would be needed for 50x coverage. Compared to the 10x needed for PacBioToCA, this translates to 1 SMRTCell for this program. Not a huge difference, but still.
  • It needs to be said that the Koren et al paper was entirely based on C1 chemistry, so improvements may be expected with C2 chemistry-based datasets
  • For the ALLPATHS_LG approach to work, one has to tailor the Illumina libraries to it, requiring the short-insert library to yield partially overlapping reads, and the sequencing of a jumping library. The PacBioToCA error-correction approach only needs one short-insert read dataset (although assemblies do improve if jumping libraries are used).
  • The ALLPATHS_LG approach is currently only available for bacterially sized genomes. In principle, as the authors notice, the program can be adapted to work for larger, and even complex eukaryotic, genomes. However, we are not there yet. As of today, the PacBioToCA error-correction pipeline is already available for large genomes, and long error-corrected reads have been shown to significantly aid assembly. The 50x coverage requirement with ALLPATHS_LG would make the approach even more prohibitively costly for large genomes than the 10x requirement for PacBioToCA.

My conclusion: for bacterially sized genomes, ALLPATHS_LG is preferred – if you don’t mind using their recipe. For larger genomes, stick with PacBioToCA for now, but keep an eye on what the Broad Institute is doing!

Feel free to let me know in the comments if you (dis)agree!

P.S. Laudably, all datasets used for the Ribeiro et al paper have been made available for us to try to replicate – or perhaps even improve – the results.

Advertisements

3 thoughts on “Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

  1. As the SAM format is the current standard for representing read mappings to a reference, what is the most widely used/accepted format for representing unfinished (or finished) de-novo assemblies?
    Thank you very much!

  2. There not yet a generally accepted format. Different programs all have their own version, while the newest softwares often only output contig and scaffold sequences. There is a movement towards a representation of the assembly graph in the so-called fastg format. See the ongoing discussion on the mailing list mentioned here: http://assemblathon.org/pages/mailing-list.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s