Longing for the longest reads: PacBio and BluePippin

PacBio sequencing is all about looong reads, especially in relation to de novo sequencing. A few things are needed to get the longest reads possible:

  • the longer the enzyme is active on the template, the longer the raw read will be – PacBio calls these ‘Polymerase reads’
  • a library for pacBio sequencing consist of circular molecules, with the target insert between two hairpin adaptors, allowing the enzyme to ‘go around’ and sequence the opposite strand once it reaches the end of the insert. See my previous post on this here. It then follows that the longer the template used for library preparation, the smaller the chance the polymerase goes around the hairpin, leading to longer uniquely sequenced template – PacBio calles these ‘reads of insert’ – and these represent the most useful reads for de novo sequencing applications
  • finally, the distribution of sizes of the library has an influence: any high-throughput sequencing technology, as well as PCR, has problems with ‘preferential treatment’ of smaller fragments. With PacBio sequencing, shorter molecules tend to load preferentially into the wells of the SMRTCell (‘chip’). It then makes sense to try to reduce the shoulder of shorter fragments for the final library preparation.

Recently, PacBio and Sage Science announced a co-marketing partnership for the BluePippin. This instrument allows for tight size selection of DNA samples, effectively making the peak of the size distribution much more narrow. With regard to PacBio sequencing, a narrow peak lessens the problem of preferential loading of short fragments, leading to much longer ‘reads of insert’. A demonstration can be seen on this poster (I think I know which fish they used for that one plot…).

Continue reading

On assembly uncertainty (inspired by the Assemblathon2 debate)

The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the  homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.

First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):

An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target

I stress this aspect of the assembly being a ‘putative reconstruction’ first. I then move on to explain the hierarchy from reads to contigs to scaffolds (and perhaps superscaffolds/pseudochromosomes). In the future, I intend stress the fact that an assembly is a hypothesis of the genome more than I already do. Like Titus Brown said:

[...] each assembly is a computational hypothesis, developed from noisy data using approximate computation, and that this assembly must be treated with skepticism

This brings me to my main point: I think we are more then ever positioned to quantify the uncertainty of an assembly, and report both globally and locally how trustworthy it is. The Assemblathon2 mostly used data not involved in building the assembly, such as fosmid sequenced and optical maps. Titus Brown rightly points out in his review that RNA-seq (transcriptomics) data could also be used. I would like to add another aspect: checking that the assembly you obtain is in fact in correspondence with the data you used to built it. Do the reads agree with the assembly? As Titus says, did all reads actually end up being used?

The following tools are already available for this:

  • REAPR, used for the Assemblathon2 validations, is ’a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison’. Link
  • FRCBAM extends the FRC approach, detecting and quantifying ‘features’ possibly indicating assembly problems, based on mapping reads back to the assembly. For their paper on FRCBAM, the authors actually tested their approach on the Assemblathon2 results. Link, preprint on arXiv
  • ALE (‘Assembly Likelihood Evaluation’) uses aligned reads and k-mer distributions to determine a likelihood score. ALE is meant as a comparative measure, i.e. to compare different assemblies by their ALE score. Link, paper
  • CGAL is another likelihood based method.  Link, paper

Note that all these tools start by (having you) mapping back one or more of the input read datasets to the assembled sequences, and then using different methods/algorithms to score fidelity of the sequences, and/or detect possible misassemblies. Locally aligned read depth, insert sizes between paired reads, orientations of the pairs are all indicators used by these approaches. Most importantly, none of these tools need a reference sequence for the task!

Notably, many of these tools are able to generate information on assembly uncertainty along the assembled sequences, information which can be transformed into genomic tracks (e.g., through gff files) to be used in a genome browser. This would allow for supplying these tracks with a finished assembly.

As some of the readers of this blog may be aware, we are working on a new version of the Atlantic cod genome. We intend to try out these tools (we haven’t gotten very far with this yet), and see if we can generate tracks in an appropriate format. Hopefully, we will be able to develop a way to integrate the information of all the tracks into a final one with a summary ‘assembly (un)certainty’ score along the assembled sequences. When this score drops below a certain threshold, we intend to break the assembly (REAPR has implemented this already) to improve the final assembly.

On important caveat of this approach is that the tools describe depend on mapping reads back to the assembly. This poses its own problems, as mapping, like assembly, is far from a solved problem. Nor can, for that matter, one mapping exercise serve all tools. Not only have we had two assemblathons, and other such projects (GAGE, dnGASP), now two or perhaps more alignment competitions are starting (Alignathon and Variathon). So I expect some fiddling with the alignments to get all the tools to work properly (not looking forward to that :-( ).

In conclusion, I think we are more than able to report on uncertainties – and obviously certainties – associated with the hypothesis we call the assembly. I think it is time this reporting, for example in the form of annotation tracks visualisable in a genome browser, becomes standard for new assemblies. It must be said, though, that whether people actually are going to use this information remains to be seen. How many programs take per-base quality values of the reference into account when mapping reads, or of gene sequences when generating phylogenies (not my field, so I may be very off here)?

By the way, for certain reference genomes (human, mouse and zebrafish), tracks with clone-derived assembly problems are already available, see this blog post from The Genome Reference Consortium.

Let’s hope that by using methods to explicitly report assembly uncertainties, we can make the end-users of our ‘reference’ assemblies more aware of the strengths and weaknesses of the material they work with.

(Note that I left out QUAST and Amosvalidate+FRC, the first because it is not very useful without a reference sequence, the others because these require a file provided by the assembler describing the read layout, and there are very few assemblers able to generate this file)

Applications for PacBio circular consensus sequencing

I am a fan of the Pacific Biosciences PacBio RS sequencing instrument, as some of the readers of this blog maybe already are aware of. We use data from this instrument in our work, and the Norwegian Sequencing Centre – with which I am affiliated – offers the technology to its users.

PacBio recently increased their read lengths. With this so-called C2XL chemistry, average raw length is now around 4.3-4.5 kbp and maximum read lengths can go over 20kbp. These extra long reads are great for de novo genome sequencing applications, something we are trying ourselves. However, a bit buried in the news about these longer reads is the consequences for PacBio’s so-called Circular Consensus Sequencing, or CCS.

What is CCS?
In order to understand CCS, one needs to know a bit about PacBio sequencing in general. Readers familiar with this may skip to the next section.

The template for PacBio sequencing is a so-called SMRTBell. These are created by ligating hairpin adaptors to both ends of double-stranded DNA molecules. See the figure
The hairpin adaptor has a priming site for sequencing, and the polymerase used for sequencing has strand-displacement capacity (it basically doesn’t care if it has to work through a double-stranded region, it simply ‘kicks-off’ the opposite strand). The effect of this is that the sequence template (SMRTBell) acts like a single-stranded closed circle. The enzyme starts sequencing at the primer location and will sequence the template until it either falls of, or is ‘killed’ as a side-effect of the fluorescent excitation. Given a long enough life for the enzyme, and a short enough insert, the enzyme will in fact go around the hairpin on the other end of the SMRTBell. For the shortest inserts, it can potentially circle around multiple times.

smrtbell

Schematic representation of the SMRTBell template for PacBio sequencing. From http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2926623/

This multi-pass sequencing allows for calling a consensus of the sequence of the insert, overcoming the high single-pass error-rate of the technology (quoted by PacBio as median error rate of ~11%). For successful sequencing in the CCS mode, the insert needs to be short and polymerise life long. A minimum of three passes is needed for the software to even consider calling the consensus, and five or more are preferred to reach 99% accuracy (Q20) or better. One important aspect is the raw read length distribution for PacBio sequencing (see the figure below). The requirement of three or more passes limits the insert size. For example, a 3 kbp raw read is required to get a consensus of a 1 kbp insert. There will therefore always be a – significant! – fraction of raw reads that are not going to yield a consensus read, as they are simply to short. The number of useful reads per SMRTCell (‘chip’) is therefore lower than the number of raw reads useful for long-read (single pass) sequencing.

Typical PacBio C2Xl raw read length distribution. From http://pacificbiosciences.com/brochure (February 2013)

CCS and the the new extra-long raw reads
From the above, it follows that an increase in raw read lengths allows for longer insert libraries for CCS reads, or a higher throughput for shorter insert CCS libraries. PacBio used to recommend the 500bp to 1kbp range for CCS libraries. With the C2XL read length increase, this doubles to 1-2 kbp. Throughput is around 40 000 reads per SMRTCell. The remainder of this post describes three possible areas where these extra-long consensus reads may be useful.

Disclaimer
First, I may be a fan of PacBio, and the company is helping us out with our genome projects, but I am not receiving any financial or other personal gains from them. What I write here is my own opinion. Second, the suggestions below are not tested by me or my colleagues – although others may be working along these lines. How the suggested CCS uses I describe work in practice needs therefore to be determined.

1) CCS for sequencing full-length 16S rRNA
Before the days of Next Generation Sequencing, people used to amplify 16S regions of samples and make bacterial clone libraries, sequencing a set of clones using Sanger sequencing. NGS allowed for many more reads per sample, but restricted the length of the sequenced part to a few hundred bases at best. NGS therefore yielded much deeper sequenced datasets, but at a lower discriminatory power (less phylogenetic/taxonomic signal).
The long CCS reads now possible with PacBio imply that one could consider going back to full-length 16S sequencing. The throughput (40 000 reads per SMRTCell) will be much higher than doable using Sanger sequencing, and the quality potentially even better than Sanger. However, the price per read will be significantly higher than using short-read technology. I think that for certain diversity studies, using PacBio CCS with full-length 16S amplicons will be very beneficial.

2) CCS for shotgun metagenomics
Similarly, people interested in whole-sample shotgun metagenomics (as opposed to PCR-based diversity studies) could consider using the 1-2 kbp CCS reads. The long reads could yield much more useful information for gene mining, for example. Some may suggest that using the Roche/454 GS FLX+, which now seems to be working – at least in our lab it does – will yield much more reads (1 to 1.2 million) around that length (we have seen 1kb mode read lengths with GS FLX+!), making this technology more cost-effective. Some back-of-the-envelope calculations using prices for the Norwegian Sequencing Centre show that the comparison actually is in favour of PacBio. Given one library preparation, and 1 million required reads, the 454 is currently out-priced by PacBio, while the latter has the potential to give better quality (no homopolymer errors) and longer reads. However, first, generating 1 million CCS reads (25 SMRTCells) takes more time (several days, not counting library preparation) than a full 454 run (about a day) [Note that lab teams will like not having to do emulsion PCR for PacBio CCS!]. Secondly, the pricing situation may be different for other centres (pricing structures I think really differ from centre to centre).

3) CCS to replace Sanger capillary plate sequencing
Sanger sequencing is far from dead. Its main attraction is that it allows very small sample sizes (down to a single sample can be submitted to a facility), and long reads with high quality. A key difference between Sanger and NGS is the fact that each read can be traced back to a well on a plate. I think that, if a suitable and cost-effective barcoding scheme could be designed for multiplexed PacBio CCS sequencing, PacBio CCS could potentially replace Sanger plate sequencing. To keep the costs down, one would need to massively multiplex, perhaps dozens of 96 well plates with fragments that need to be tracked back to their original plate and well. But a laboratory with good automation experience might pull it off. At the same time, the scheme requires a steady flow of Sanger samples. It wouldn’t work for a facility that sequences less then, say, a dozen plates per week. Commercial providers may actually already consider doing this switch. The benefits could be longer reads than one can get with Sanger, with higher per-base qualities.

[Technical note: the per-SMRTCell throughput of 40 000 reads may allow for adding the same template multiple times, increasing the final consensus accuracy. The fraction of raw reads too short to give a consensus call (see above) may actually contribute to quality as well as they are barcoded]
In summary: PacBio CCS may be an alternative for short read sequencing, or even Sanger Capillary sequencing. However, there will be a trade-off in information content, versus price per read.

For a technical, but very readable paper describing CCS (from the PacBio researchers themselves), see this paper in Nucleic Acids Research.

Developments in next generation sequencing – a visualisation

With this post I present a figure I’ve been working on for a while now. With it, I try to summarise the developments in (next generation) sequencing, or at least a few aspects of it. I’ve been digging around the internet to find the throughput metrics for the different platforms since their first instrument version came out. I’ve summarised my findings in the table at the end of this post. Then, I visualised the results by plotting throughput in raw bases versus read length in the graph below.

Developments in next generation sequencing. http://dx.doi.org/10.6084/m9.figshare.100940

Developments in next generation sequencing. http://dx.doi.org/10.6084/m9.figshare.100940

Continue reading

How to sequence a bacterial genome at the end of 2012

A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!

Option 1: bits and pieces

  • Libraries: paired end or single end sequencing
  • Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
  • Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
  • Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
  • Pros: fast and cheap, OK for presence/absence of e.g. genes
  • Cons: doesn’t give much insight into the genome
  • Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead

Option 2: a few gaps remaining

  • Libraries: paired end or single end sequencing combined with at least one mate pair library of 8kb distance, optionally a 3-5 kb mate pair library
  • Platform: paired end: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; mate pairs: preferably on the 454, otherwise Illumina, demonstrated protocol for Ion PGM
  • Bioinformatics: assembly: Illumina data: Velvet, SOAPdenovo; 454 data: Newbler MIRA, Celera; hybrid: Newbler, Celera
  • Outcome: down to one or a few scaffolds per chromosome/plasmid with a few to numerous gaps remaining
  • Pros: gives a lot of long-range information
  • Cons: mate pair libraries more cumbersome to make; more expensive, remaining gaps often due to repeats, which may be exactly the regions of interest (e.g. transposons, rDNA operons)
  • Remarks: hybrid assembly is largely uncharted territory; gap closing programs may help (IMAGE2, Soap gap closer, GapFiller, …)

Option 3: closing the gaps

  • Libraries: as for option 2, but in addition a long-insert (10kb) PacBio library
  • Platform: one of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; plus PacBio RS
  • Bioinformatics: assembly: same as option 2; gap closing/finishing: PBJelly; AHA from PacBio can also be used
  • Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
  • Pros: no mate pairs needed, potentially very complete genome
  • Cons: requires multiple libraries, expensive
  • Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage

Option 4: The ALLPATHS_LG way

  • Libraries: Illumina paired end + Illumina Mate Pair + a long-insert (10kb) PacBio library
  • Platform: Illumina HiSeq or MiSeq; PacBio RS
  • Bioinformatics: assembly: ALLPATHS_LG
  • Outcome: down to one or a few contigs per chromosome/plasmid
  • Pros: can yield finished genome
  • Cons: complex mixture of libraries, can become expensive
  • Remarks: special recipe: the paired end library should have a short insert, such that the forward and reverse read overlap

Option 5: PacBio hybrid

  • Libraries: paired end or single end sequencing + a long-insert (10kb) PacBio library
  • Platform: as for option 2, but PacBio RS in addition
  • Bioinformatics: error correction of the PacBio reads: PacBioToCA, LSC, P_ErrorCorrection from PacBio; assembly: Celera, MIRA
  • Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
  • Pros: can yield finished genome; only needs two libraries that are simple to produce
  • Cons: error-correction can be computationally demanding
  • Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage

Option 6: PacBio only

  • Libraries: a single, long-insert (10kb) PacBio library
  • Platform: PacBio RS
  • Bioinformatics: error-correction; HGAp; assembly: Celera or Allora
  • Outcome: down to one or a very few contigs per chromosome/plasmid
  • Pros: uses a single library, cheap, finished genome of very high quality
  • Cons: somewhat experimental, limited examples available, therefore currently a bit more risky
  • Remarks: use Quiver for polishing

EDIT: additional tools

  • Reader BenK suggested adding optical mapping as a way to improve bacterial genome assemblies. Not a sequencing technology per se, but, as he writes, “Many genomes have undiscovered repetitive regions that mapping unveils.”
  • Reader Roy suggested mentioning stand-alone scaffolders: SSPACE, SOPRA, and the reference-based scaffolder ABACUS . See also this SeqWiki howto.
  • Sébastien Boisvert, author of Ray, suggested I add his program. Done hereby. The reason that it is not listed above has to do with my inexperience with it with regards to de novo bacterial genome assemblies
  • In the same spirit, reads may want to look at the GAGE results for more programs.

Relevant previous blog posts:

Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

Ion Torrent Mate Pairs and a single scaffold for E coli K12 substr. MG1655

Links to programs mentioned:

ALLPATHS_LG
Celera
GapFiller
HGAp
IMAGE2
LSC
MIRA
Newbler: request here
P_ErrorCorrection, Allora and AHA see here, these need smrtpipe
PacBioToCA (also part of Celera 7.0 and smrtpipe)
PBJelly
Quiver see here
SOAPdenovo 
Velvet

Reader Roy suggested adding Spades.

(If you feel I missed a program here, let me know in the comments below!)

My take on the sequencing buzz at #ASHG2012

Image from Wikimedia Commons. (Buzz was a low-cost airline based at London Stansted operating services to Europe. It was sold to Ryanair.)

I am not attending the American Society of Human Genetics meeting in San Fransisco, but can’t escape the buzz it creates on twitter (hashtag #ashg2012). Strikingly, it is almost another AGBT when it comes to announcements from companies selling sequencing instrument. All of them had something new to bring to the floor. This post summarizes what I picked up from twitter and a few websites, and I give a bit of my perspectives on the respective announcements. I am focussing on technology improvements, especially with regard to read lengths, not so much on applications such as cancer resequencing panels.

Continue reading

Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

After having given an overview of the PacBio error-correction (PacBioToCA) pipeline of Koren et al (see previous blog post), it was interesting to see another paper coming out describing combining PacBio and Illumina for assembling bacterial genomes: Ribeiro et al, “Finished bacterial genomes from shotgun sequence data“, Genome research accepted preprint. The authors are all from the Broad Institute. David Jaffe was so kind as to provide me with the supplementary material file, which so far is not yet available online. In this post, I will summarise the ALPATHS_LG paper, and contrast the approach with the PacBioToCA pipeline.

Picture from zazzle.com (http://bit.ly/RMGgZR)

The Broad Institute of MIT/Harvard is an impressive genome centre. Sorry, ‘Genomic Medicine Center’. One of their achievements in recent years is an optimised pipeline for assembly of (small and) large genomes based on short read (Illumina) data. The software program they developed, ALLPATHS_LG, combined with a kind of special type of ‘recipe’, has proven to be extremely succesful. The Broad keeps churning out very respectable assemblies of, among other groups, large complex eukaryotes. In order to be able to handle large amounts of genome projcts, I get the impression that the Broad Institute aims for standardisation and optimisation of protocols and, what they call ‘recipes’. The idea being that, if you follow their recipes and use their programs, you are more or less guaranteed an optimal result. In the case of ALLPATHS_LG, it also means, however, that one is required to have at least one Illumina jumping (‘mate pair’) library and a short-insert (‘paired-end’) Illumina library for which the insert size is shorter than twice the read length. Paired end libraries with slightly larger insert sizes are standard, making the ALLPATHS_LG requirements kind of uncommon. It also means not many projects are only able to compare other programs with ALLPATHS without generating an extra read dataset (I speak from experience here…).

The Ribeiro et al paper is entirely focussing on microbial size genomes, a big difference with the Koren et al (PacBioToCA) paper. The short-read recipe for ALLPATHS_LG remains unchanged: 50x coverage short-insert (up to 220bp) paired reads and 50x coverage jumping (2-10kb) reads. This is then supplemented with 50x PacBio reads, with library insert sizes between 1 and 3 kb (although for the paper, they also created two larger insert libraries for two of the strains, at 6 and 10kb respectively). In total, there are three size ranges represented this way: short – overlapping paired reads, intermediate – PacBio reads, long – illumina mate pairs. Finally, the short reads have high quality relative to the PacBio reads.

Continue reading