Each sequencing company has a workhorse genome they sequence a lot. PacBio sequences the lambda virus, Illumina uses PhiX. Both Ion Torrent and 454 use E. coli DNA, but while Ion Torrent takes E. coli K12 substr. DH10B, 454 chose E. coli K12 substr. MG1655.
I am interested in finding out to what extent Ion reads can replace the much more expensive 454 reads for de novo genome assembly (my field of speciality). Currently, the Ion read length is too short for the technology to be competitve, but this might change later this year, when (if …) the promised 400 bp reads become a reality.
In my comparisons of Ion data with 454 reads, I have always been hampered by the the fact that the strains the platforms use as test samples were not completely identical. Luckily for me, today Ion Torrent released (behind the Ion Community login) a dataset on E coli MG1655, a run with ID BEL-335. Imagine my joy! (Saves me from having our centre generate one on our – yet to be unpacked- PGM). The data is from a 314 chip, has 468 thousand reads and 54 Mbp raw data. That represents around 11x coverage of the MG166 genome. This is a bit too low for what I ideally would like to have (around 30 x), but alright. I set out to try this data in a de novo assembly using newbler, together with an equivalent data set of 454 reads.
Supposedly, if you need long reads of high quality and throughput, you should be using GS FLX+, yielding 750-800 peak read length, from Roche/454, right? Illumina reads are getting longer, but 250 is for now the limit (notwithstanding the 300 bp run done by the broad). IonTorrent promises 400 bases, but we will have to see what the quality is going to be. And PacBio, well we all know they are long, but with relatively low throughut and quality peaking at 85% – 86% accuracy (useful nonetheless).
Commercially, GS FX+ had been around for more than half a year. So far, the community is reporting some success, see this thread at SeqAnswers. But, there are problems all around when you talk to people. Our own centre (the Norwegian Sequencing Centre) got the upgrade in August. I’ll spare you the details on numerous control/test fragment runs, short read runs etc, but the bottom line is that we still haven’t been able to sucessfully sequence a FLX+ library on our GS FLX+. Right now, we are having a service visit, and I intend to chain the guy to the instrument until we see some real good data from one of our own libraries…
There is another thing that surprises me too: I am attending PAGXX, the Plant and Animal Genome conference in San Diego, where Roche is one of the main sponsors. However, when you look at the little text they have as an exhibitor, well, let me just quote a part of it for you (source): Continue reading
A quick summary for the impatient:
An analysis of the homopolymer distribution of the recently released ‘longer’ Ion Torrent reads indicates a possible significant over-calling of homopolymer lengths towards the ends of the reads. Trimming the ends off, however, only marginally improved de novo assembly of the reads using newbler.
Life recently released ‘long’ IonTorrent reads (B14_387, resequencing of E coli strain DH10B, available through the Ion Community here). There is an accompanying application note that brags about the read’s accuracy, especially over reads from the MiSeq platform. These accuracy measurements are logically based on alignment to a reference genome.
But what about de novo assembly? Thing is, the dataset presented, with a peak length of 241 (see below) and 350 000 read, is quite similar to what a full plate of GS FLX gave you in 2007 (peak length 250 bases, 400 000 reads). And 454 reads were very useful at the time for de novo assembly (in fact, the only reads available for this purpose, obviously besides Sanger reads).
For three platforms, reads longer than the commercially available, and/or from not-yet released instruments, have become accessible online. With online, I mean that we all can download these data to have a look at:
1) MiSeq 2x 150 bases runs
As part of the German E. Coli (EHEC) ‘Crowdsourcing Project’, Illumina sequenced fie strains for the UK Health Protection Agency, the fastq files can be downloaded from http://www.hpa-bioinformatics.org.uk/lgp/genomes. These are the first data in the public domain from a MiSeq!
See also this post on GenomeWeb.
2) IonTorrent 316 chip
Keith Robison shares a bit of info on data from an Ion 316 chip ion his ‘Omics! Omic!’ blog: “1.69M reads, with 1.53M of those >=50 bp long and 1.07M 100bp or longer”:
I downloaded the run files, and quickly looked at the read length distribution of the trimmed reads in the sff file (which listed 260 flows, 40 more than the file I analyzed in my previous post), showing a peak exactly one base longer at 109 bases. So, many more reads but not much gain in length (yet). Note the strange shape of the peak:
3) 454 GS FLX+
As part of the assemblathon2 (a de novo assembly competition), there have been released the first GS FLX+ reads (from a parrot), peak read length around 736 bases: http://bioshare.bioinformatics.ucdavis.edu/Data/hcbxz0i7kg/Parrot/. Those are at Sanger read length, now!
Now I need to find the time to have a look at these data!