De novo bacterial genome assembly: a solved problem?

Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:

  • generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
  • pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
  • use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
  • map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool

The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.

Continue reading