IonTorrent: many long reads, still longer homopolymers

Life technologies released reads from a new benchmark run for the Grand Challenge. For those with an account at the Ion community, here are the links to the announcement, and the data (sff, fastq and reports). It was a 316 chip that was used.

Let me start by saying the data set is pretty impressive:
–  2.76 million reads
– peak length at 251 bases, longest 428
– 664 Mbp raw sequence

Compare this to where we were when we obtained the GS FLX in October 2007: around 400 000 reads of 250 bases, totaling approximately 100 Mbp… Then again, how good are these data.

With the release follows a report which shows mapping statistics, and they are pretty impressive. I won’t go into the details, as I expect an application note (with the obligatory comparison with the MiSeq data) to appear soon, and/or other bloggers throwing themselves onto this data. i thought to use the same kind of analysis as I used for my previous post on this data set.

First: the read length distribution. I first extracted the same amount of reads from the sff file as I used in my previous post to keep the scales similar:

The distribution of the B15_410 data shows a remarkably sharp peak around 250, with hardly any shorter and longer reads. Also, the weird peak around 400 seen in the B14 data is gone.

Then, I again looked at the homopolymers. First, the frequencies of homopolymers of all lengths, comparing again to 454 data, and one of the other Ion runs:

At first sight, the B15_410 run seems to be following the E coli and 454 data more parallel, but there is still the deviation at the larger homopolymers, as was seen with other Ion runs.

I then again counted the starting positions for all homopolymers of each read. For comparison, I repeat the graphs for 454 and the B14 long reads from my last post:

454 data

B14 long data:

New, B15_410 data:

It looks to me like the problem has gotten worse: from around 230, first the frequency of 1-mers goes slightly up, then it goes down dramatically until it comes below the 2-mer frequency, which goes up steeper than in the B14 data set.

So what, is this data set than the same as the previous, long read data set? Well, there is one point of significant improvement to be had. For assemblies, the reads perform much better than the B14_387 data. I again selected the same amount of reads for all assemblies, except for the B15_410 full length assembly, where I picked the same number of bases (same coverage):

The B15_410 assemblies show half the amount of contigs compared to the B14_387 reads, with much better contig N50s. The trimming back to 230 bases did not improve the assembly , though, so the homopolymer extension does not explain the poorer performance to the 454 assemblies. Hey, do you notice a slightly longer ‘longest’ contig for the b15_410 full length assembly compared to the 454 one?

Note that the assemblies based on the GS FLX reads still significantly outperform the Ion Torrent-based assemblies when it comes to contig numbers and N50. How much of this is due to the fact the DH10B genome (Ion Torrent reads) is much harder to assemble than the MG1655 strain (454 reads) remains to be seen. Only a more fair comparison, by generating MG1655 Ion Torrent reads of the same length, can answer that question. Also, a more thorough comparison, with mapping contigs back the reference genomes, will give much better information regarding the assembly qualities.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s