Each sequencing company has a workhorse genome they sequence a lot. PacBio sequences the lambda virus, Illumina uses PhiX. Both Ion Torrent and 454 use E. coli DNA, but while Ion Torrent takes E. coli K12 substr. DH10B, 454 chose E. coli K12 substr. MG1655.
I am interested in finding out to what extent Ion reads can replace the much more expensive 454 reads for de novo genome assembly (my field of speciality). Currently, the Ion read length is too short for the technology to be competitve, but this might change later this year, when (if …) the promised 400 bp reads become a reality.
In my comparisons of Ion data with 454 reads, I have always been hampered by the the fact that the strains the platforms use as test samples were not completely identical. Luckily for me, today Ion Torrent released (behind the Ion Community login) a dataset on E coli MG1655, a run with ID BEL-335. Imagine my joy! (Saves me from having our centre generate one on our – yet to be unpacked- PGM). The data is from a 314 chip, has 468 thousand reads and 54 Mbp raw data. That represents around 11x coverage of the MG166 genome. This is a bit too low for what I ideally would like to have (around 30 x), but alright. I set out to try this data in a de novo assembly using newbler, together with an equivalent data set of 454 reads.
Life technologies released reads from a new benchmark run for the Grand Challenge. For those with an account at the Ion community, here are the links to the announcement, and the data (sff, fastq and reports). It was a 316 chip that was used.
Let me start by saying the data set is pretty impressive:
– 2.76 million reads
– peak length at 251 bases, longest 428
– 664 Mbp raw sequence
Compare this to where we were when we obtained the GS FLX in October 2007: around 400 000 reads of 250 bases, totaling approximately 100 Mbp… Then again, how good are these data.
With the release follows a report which shows mapping statistics, and they are pretty impressive. I won’t go into the details, as I expect an application note (with the obligatory comparison with the MiSeq data) to appear soon, and/or other bloggers throwing themselves onto this data. i thought to use the same kind of analysis as I used for my previous post on this data set. Continue reading
A quick summary for the impatient:
An analysis of the homopolymer distribution of the recently released ‘longer’ Ion Torrent reads indicates a possible significant over-calling of homopolymer lengths towards the ends of the reads. Trimming the ends off, however, only marginally improved de novo assembly of the reads using newbler.
Life recently released ‘long’ IonTorrent reads (B14_387, resequencing of E coli strain DH10B, available through the Ion Community here). There is an accompanying application note that brags about the read’s accuracy, especially over reads from the MiSeq platform. These accuracy measurements are logically based on alignment to a reference genome.
But what about de novo assembly? Thing is, the dataset presented, with a peak length of 241 (see below) and 350 000 read, is quite similar to what a full plate of GS FLX gave you in 2007 (peak length 250 bases, 400 000 reads). And 454 reads were very useful at the time for de novo assembly (in fact, the only reads available for this purpose, obviously besides Sanger reads).
Looking at the discussions on the Ion Community website, I came across an entry that mentions something interesting about the flow order. For both 454 and Ion torrent, sequencing happens by flowing one dNTP (base) at a time over the template. For each read, one or more of these bases gets incorporated, or none at all (see also an entry on this at my other blog).
454 has been using the same flow order since the beginning: TACG. This can be seen from the ‘header’ part of the sff file, which lists the flow order under ‘Flow Chars’ (see here and here for examples).
The first Ion Torrent runs on the 314 chip used the exact same flow order as 454. The Ion Community entry I mentioned explains how for the 316 chip, for which the first data were released not too long ago on the Ion Community website, an entirely different flow order was used. Instead of a four-base repeated cycle, the following 32 base (!) sequence was used repeatedly:
Why would this ‘weird’ flow order be used? The different flow order apparently helps to remove incomplete extension, yielding longer read lengths. Incomplete extension happens when a subset of the template molecules on a single bead wrongly does not incorporate a base, making them out of sync with the rest of the molecules, and causing noise during later flows. The new flow order allows for these molecules to ‘catch-up’, so the different template molecules are better synchronized. A drawback of the 32-base flow order is the (on average) lower number of incorporations per flow, meaning more flows are needed for the same read length.
It looks like Ion is experimenting with other flow orders to get even better results. Now there is something 454 might give a try (although Life probably has or will take a patent out on this)…
For those of you who have access, the post is located here, and another one here.
For three platforms, reads longer than the commercially available, and/or from not-yet released instruments, have become accessible online. With online, I mean that we all can download these data to have a look at:
1) MiSeq 2x 150 bases runs
As part of the German E. Coli (EHEC) ‘Crowdsourcing Project’, Illumina sequenced fie strains for the UK Health Protection Agency, the fastq files can be downloaded from http://www.hpa-bioinformatics.org.uk/lgp/genomes. These are the first data in the public domain from a MiSeq!
See also this post on GenomeWeb.
2) IonTorrent 316 chip
Keith Robison shares a bit of info on data from an Ion 316 chip ion his ‘Omics! Omic!’ blog: “1.69M reads, with 1.53M of those >=50 bp long and 1.07M 100bp or longer”:
I downloaded the run files, and quickly looked at the read length distribution of the trimmed reads in the sff file (which listed 260 flows, 40 more than the file I analyzed in my previous post), showing a peak exactly one base longer at 109 bases. So, many more reads but not much gain in length (yet). Note the strange shape of the peak:
3) 454 GS FLX+
As part of the assemblathon2 (a de novo assembly competition), there have been released the first GS FLX+ reads (from a parrot), peak read length around 736 bases: http://bioshare.bioinformatics.ucdavis.edu/Data/hcbxz0i7kg/Parrot/. Those are at Sanger read length, now!
Now I need to find the time to have a look at these data!
A quick summary for the impatient:
The sff files from the E coli Ion Torrent runs released by EdgeBio show much longer raw reads than the trimmed reads in the corresponing fastq/fasta files. The quality of those extra bases, however, is very low. This shows the potential for longer reads from the Ion Torrent platform.
The sff file released by Ion Torrent through their Dev Community site has these extra bases masked, which makes one wondering what if are trying to hide something…
Part 1: EdgeBio’s data
When EdgeBio released six runs with E coli DH10B Ion Torrent data (see http://www.edgebio.com/blog/?p=191), I decided to have a look inside the sff files they provided. I downloaded the data from data.edgebio.com, and used Roche’s sffinfo command to ‘peek inside’. The sffinfo command, accompanying the 454 Life Science software suite, will list the content of the binary sff file in text format (see the post on my other blog). Other, open source/access tools, such the ones my mention on my blog, might do this as well.