Ion Torrent data on E. coli K12 MG1655 – a fairer comparison with 454

Each sequencing company has a workhorse genome they sequence a lot. PacBio sequences the lambda virus, Illumina uses PhiX. Both Ion Torrent and 454 use E. coli DNA, but while Ion Torrent takes E. coli K12 substr. DH10B, 454 chose E. coli K12 substr. MG1655.

I am interested in finding out to what extent Ion reads can replace the much more expensive 454 reads for de novo genome assembly (my field of speciality). Currently, the Ion read length is too short for the technology to be competitve, but this might change later this year, when (if …) the promised 400 bp reads become a reality.

In my comparisons of Ion data with 454 reads, I have always been hampered by the the fact that the strains the platforms use as test samples were not completely identical. Luckily for me, today Ion Torrent released (behind the Ion Community login) a dataset on E coli MG1655, a run with ID BEL-335. Imagine my joy! (Saves me from having our centre generate one on our – yet to be unpacked- PGM). The data is from a 314 chip, has 468 thousand reads and 54 Mbp raw data. That represents around 11x coverage of the MG166 genome. This is a bit too low for what I ideally would like to have (around 30 x), but alright. I set out to try this data in a de novo assembly using newbler, together with an equivalent data set of 454 reads.

First, I checked the read length distribution (line marked ‘BEL-335’).

Note the small peak at 193 bases, such peaks are somehow usually present for Ion Torrent data.

I next created a comparable 454 dataset. I used the same GS FLX sff file as for previous posts, resetting the number of cycles to 48  (shortening the GS FLX reads) to get the same length distribution, which is shown in the plot above as well. Note how the peak of the Ion reads is slightly narrower (a tighter distribution around the mode length). Finally, I randomly picked the same amount of bases as in the Ion dataset, to get the same coverage.

Assemblies were done with the latest release of newbler (2.6) and default settings. As no paired end (mate pairs) were available, all that was generated were contigs. This is the first time I was able to do a direct comparsion of these data from the exact same genomic source.

The table below summarizes the metrics (‘large’ contigs are those of at least 500 bp):

As the table shows, newbler performed much better with the 454 data than with the Ion data, even though the read dataset was as much comparable (length and coverage) as possible. The Ion data resulted in more reads partially assembled, more singletons, and more, shorter contigs. Interestingly, the total length (‘all contigs’) is very similar between the two assemblies. Both were 94 kbp short of the ‘real’ genome size of 4,639,675 bases, probably caused by collapsed repeats.

This exercise is of course not completely fair: I am using the assembly program optimized for 454 reads with reads from another technology. However, the results show that when it comes to de novo genome assembly, one cannot simply replace 454 reads with Ion Torrent reads and expect the same results. Assembly software needs to be tuned to the technology platform specific error model, even when the sequencing technologies are very similar (454 and Ion both measure homopolymer length, and both suffer from homopolymer length errors).

The next step for this comparison would be assemblies using other programs, especially those that say they are tailored towards Ion data, or both Ion and 454 (and other types). Candidates are Mira (free), the (commercial) ‘Floton assembler’ from Softgenetics, CLCBio (also commercial), and I probably missed a few more.

Code used

The sff file is called BEL-355 and was downloaded via http://lifetech-it.hosted.jivesoftware.com/docs/DOC-2492 (Ion Community login required).

Read length distribution (using sffinfo from the Roche/454 software suite):

sffinfo -s BEL-335.sff|fasta_length | awk '{x[$1]++}END{for (i in x){print i"\t"x[i]}}'| sort -n >read_length_hist.tsv

fasta_length is a simple perl script which spits out the length of each fasta file (available upon request)

454 data: I used the files EBO6PME01.sff and EBO6PME02.sff, which were part of a demo GS FLX data set that came with software DVD for newbler 1.1.03. To reset the trimpoints to 48 cycles (yielding a read length distribution comparable to the Ion reads), I used

sfffile -c 48 -o EBO6PME01_48_cycles.sff EBO6PME01.sff
sfffile -c 48 -o EBO6PME02_48_cycles.sff EBO6PME02.sff
sfffile -o EBO6PME_48_cycles.sff EBO6PME01_48_cycles.sff EBO6PME02_48_cycles.sff

The last command combines the reads into one file. The read length distribution was obtained as for the Ion reads (see above).

To get the final sff file for assembly, with the same amount of bases as the Ion dataset:

sfffile -pick 54292729 -o EBO6PME_48_cycles_54Mbp.sff EBO6PME_48_cycles.sff

Assemblies were run with:

runAssembly -o outputdir reads.sff

Metrics were obtained with my newblermetrics script.

2 thoughts on “Ion Torrent data on E. coli K12 MG1655 – a fairer comparison with 454

    • 1) OK, but sorry to hear that’s the policy…
      2) Thanks! I just need to find the time to give these programs a go (preferably with higher Ion coverage).

Leave a comment