The one and only Oxford Nanopore talk at AGBT 2014 – with real data
Twitter started buzzing this morning at AGBT because researchers started getting confirmation emails from Oxford Nanopore regarding their application for the MinION Access Program (MAP). This timed well with the first talk discussing some serious data from the platform, by David Jaffe from the Broad Institute, entitled “Assembly of Bacterial Genomes Using Long Nanopore Reads”. Probably no coincidence…
David Jaffe’s highly anticipated talk showed data generated by Oxford Nanopore on their MinION from two bacterial genomes. One was a methylation negative E coli (the fact that it was methylation negative may have been significant, but he didn’t say). The second species was a Scardovia. 5 micrograms of DNA were sent to the company. Library prep consisted of fragmentation and adaptor ligation, basically the classical workflow. Nothing was said about the type of adaptors.
Jaffe showed how the platform worked, but stressed he just told it as it was told to him by the company. Double stranded DNA is ratcheted through the nanopore, becoming single stranded in the process. Multiple bases are present in the pore at one time. At AGBT in 2012, the company said they were reading bases in groups of three. Jaffe showed a squiggly electrical nanopore signal with peaks representing overlapping 6-mers (!). For example, 6-mers read were TCGACC followed by CGACCC followed by GACCCT. So, the last five bases overlap with the first five bases of the next read. This means the software has to be able to distinguish 4096 different signals! Not much was said about this base-calling process.
Read lengths were a distribution with a mean of 5.4 kbp for E coli and 4.9 kbp for Scardovia. Longest reads were around 9kbp. Jaffe commented that the company said the lengths were restricted to the length of the input (he showed a bioanalyzer trace of the fragmented DNA matching the length distribution of the reads). There is clearly room for improvement here, likely longer input will generate longer reads.
Jaffe discussed and showed an example of the types of errors they were seeing, given that they have a reference genome for both strains. The reads showed long stretches of perfect matches, interrupted by clusters of errors (‘error-blocks’), which were mostly most deletions. He mentioned that 84% of reads have at least a perfect 50-mer, and that 100% have at least one perfect 25-mer. He also showed what looked very much like a systematic error, where for a certain stretch of reference, five out of six reads were missing a T. Jaffe mentioned solutions the company is suggesting here, in particular a better base-caller, as well as using different pore types in the same MinION, hoping these will each have a different error-profile, allowing for building a consensus.
Jaffe’s title was “Assembly of Bacterial Genomes Using Long Nanopore Reads”. It turns out he needed to combine the MinION reads with short-read data to be able to do that. So they generated PCR-free libraries for the same strains for the Illumina platform (probably the MiSeq) and their program DISCOVAR for the assembly. The MinION reads were then used to resolve the assembly graph from DISCOVAR, basically overcoming the repeats larger than the illumina reads. The same approach, by the way, was used for their ALLPATHS program and PacBio reads a couple of years ago.
Concluding, Jaffe said that Oxford Nanopore data is not yet useful for de novo assembly on its own, so he is envisioning applications where getting every base right is not an issue.
Here are a few of my thoughts about this presentation. On its own, this is a fantastic breakthrough for nanopore based DNA sequencing. Not earlier has a nanopore platform been able to show this kind of data. So, congratulations are in place for Oxford Nanopore.
The lengths of reads presented may seem disappointing given the presentation at AGBT two years ago, where 100 kbp reads were mentioned. However, the limit may have been the size distributions after fragmentation. As users of the PacBio platform know, fragmentation is an art for obtaining long reads, so there is a clear potential for improvement here.
The error-model, which seems to be biased and systematic, is worrying from an assembly and variant calling point of view. It will probably improve over time. On the other hand, there are, as Jaffe eluded to, lot’s of applications where having (near)-perfect reads are not crucial. Think scaffolding, structural variation detection, isoforms for RNA (cDNA) sequencing, etc.
There remain a few open questions, though. Nothing was mentioned on coverage of the genomes sequenced (a clear miss by Jaffe). Hopefully, there will be no GC bias, for example. A careful look at the errors is needed, for example, are there chimeric reads, adaptors present, long-range errors in the reads?
What about the comparison with other long-read platforms? The data presented is somewhat similar to moleculo data in terms of length, with a lower quality and a yet-to-be determined GC bias. PacBio reads are significantly longer, and of a better quality after self error-correction. The big difference is of course price. A PacBio instrument, and a HiSeq, are very much more expensive than the MinION. A small USB stick also takes up much less space, and can be carried around to the field, especially if library prep can be made small enough.
We are all comparing this presentation to what Oxford Nanopore described two years ago. The length of the reads, as mentioned, leaves much room for improvement. Direct sequencing of for example blood samples does not yet seem to be possible. It is quite easy to accuse the company of over promising, but they merely joined the club…
Nothing was said about data release, and I really, really hope the company will give the community access to the data.
So here is my summary: MinION nanopore reads are error-prone, likely with a systematic error (not random). They already show potential for assembly improvement, but are not yet ready for de novo assembly on their own. Other applications will certainly appear. If the data quality doesn’t turn out to be a major headache, there is a clear niche for the platform, where they will compete with the other available long-read technologies. If it is a game-changer (and a PacBio killer) remains to be seen.
P.S. see NextGenSeek’s post on the presentation here
P.S.2 our application to the MinIOn Access Program was not granted, so it will be a while until I get my hands on real data myself…