IonTorrent: longer reads, longer homopolymers?

A quick summary for the impatient:

An analysis of the homopolymer distribution of the recently released ‘longer’ Ion Torrent reads indicates a possible significant over-calling of homopolymer lengths towards the ends of the reads. Trimming the ends off, however, only marginally improved de novo assembly of the reads using newbler.

Life recently released ‘long’ IonTorrent reads (B14_387, resequencing of E coli strain DH10B, available through the Ion Community here). There is an accompanying application note that brags about the read’s accuracy, especially over reads from the MiSeq platform. These accuracy measurements are logically based on alignment to a reference genome.

But what about de novo assembly? Thing is, the dataset presented, with a peak length of 241 (see below) and 350 000 read, is quite similar to what a full plate of GS FLX gave you in 2007 (peak length 250 bases, 400 000 reads). And 454 reads were very useful at the time for de novo assembly (in fact, the only reads available for this purpose, obviously besides Sanger reads).

Nick Loman had a stab at using these reads for assembly with newbler, the program developed by 454 Life Sciences, on his blog. He reported, compared to the shorter Ion reads, much worse results for the longer reads. On twitter, he wrote that the assembly programs MIRA and CLC Bio performed really bad with these reads.

So I wondered what could be the problem with these reads, that presumably show good mapping accuracy, but perform not very well for assembly. For comparison, I generated a similar 454 GS FLX dataset, 350 000 reads with a peak length at 250 bases. I used sff files that came with the software installation discs of newbler version 1.1, these represent a resequencing run on E coli K12 strain MG1655. I randomly selected at the exact number of reads as are in the Ion Torrent B14_387 dataset.

First, I looked into the sff file and noticed that the run had 520 flows, twice that was used earlier. Using my ‘2.5 bases per four flows’ calculation I described before, this translates in a potential 325 bases. The flow order used was reported as TACG all the way. In a run I discussed earlier , I noticed a 32 base repeated flow order which presumably yields better results, but and this flow order was apparently not used for this run as well [thanks to Nils Homer for pointing out my oversight here, I’ve ordered new contact lenses…).

I then took a closer look at the read length distribution of both datasets

The Ion reads peak slightly lower than the 454 reads (241 and 253, respectively). But, they also show a strange peak around 400 bases. On the Ion Community forums, this peaks was discussed, but I couldn’t find a satisfactory explanation for it.

For problems with the data generated by both 454 and Ion Torrent, the usual suspect is homopolymers, consecutive runs of the same bases. Due to the way the bases are read (basically not reading single bases, but the length of homopolymers). I calculated the frequencies of homopolymers of all lengths present in both the Ion long reads dataset, the B13_328 dataset (100 bases run), the 454 dataset, and the E coli K12 MG1655 genome (both forward and reverse strands). (The results for DH10B were identical to MG1655 and are therefore not shown), see the graph.

In the figure, the counts of the homopolymers of lengths 1-11 are plotted, with the Y-axis on a log scale. A homopolymer length of 1 represent all single bases (monomers), length 2 represents all occurrences of AA, CC, GG and TT, length 3 represents all AAA, CCC, GGG and TTTs, etc.

The line representing the E coli genome (the lowest one due to the 2x coverage, while the run data have much higher coverage) showed a steady (exponential) decline in the counts, accelerating beyond 8-mers. The 454 reads nicely follow this line and showed the same pattern. The Ion reads, on the other hand, deviated significantly. The showed a slowing down in the decline beyond 8-mers. So, it looks like the Ion reads have significantly more longer homopolymers than expected based on the composition of the genome. This could mean that there is a profound over-calling of homopolymers in these reads.

(As an aside, the Ion B14_387 long reads had no homopolymers beyond 11-mers, the B13_328 short run had a few above that, up to 23-mers, the 454 data showed very few above 10-mers and maxed out at a whopping 31-mers (artifacts, perhaps?). The E coli genome did not have longer homopolymers than 10-mers.)

I next tested whether these longer homopolymers were located uniformly along the reads, or where perhaps clustered more towards the ends. To this end, I counted the starting positions for all homopolymers of each read, for the 454 and B14 long read datasets.

The 454 reads, above, showed a uniform distribution of homopolymers of the different lengths along the read up to about 320 bases. The pattern for the last bases is a result of the very few reads that are of this length, so that N is very low for these positions.

The long Ion reads, on the other hand, show a significant reduction of the fraction of 1-mers beyond base 230, and an increase of the 2-mer and longer homopolymers. The very long reads (the peak around 400 nt) show the opposite, a reduction in the number of homopolymers in favor of monomers.

In an analysis of the test fragments spiked into the IonTorrent runs, lek2k showed on his biolektures blog (posts here and here) both undercalls and overcalls for two-mers in the test fragments for longer reads deep into the reads [thanks to lek2k for correcting me on this, see his interesting comment below]. My results indicates that overcalls are the bigger of the two problems.

These analysis indicate (but in no way prove) that homopolymer overcalls could contribute significantly to a lower accuracy at the end of the longer IonTorrent reads. Is this problem then explaining the low assembly performance of these reads?

In order to test this, I created new sff files for both B14_387 and the 454 dataset, with the reads trimmed beyond 230 bases. Then, I performed assemblies for the original files, and the retrimmed files using newbler 2.6 with default settings.

(‘Large’ contigs are those of at least 500 bp). The table shows not that much difference between the full-length 454 reads, and those trimmed at 230 bases, with only a few more contigs and a 20% lower N50. for the latter assembly.  The full-length Ion reads assembly is similar to the one Nick Loman produced, three times more contigs and an N50 that is a fifth of the full-length 454 assembly. Note that many reads are only partially assembled (meaning that a part of them could be used for alignment against the other reads, the rest was too different), whereas close to 99% of the 454 reads are fully assembled. Finally, the assembly using the trimmed Ion reads is only marginally better.

In conclusion, these results indicate that I have found a problem, but that it most likely is not fully able to explain the poor performance of the long Ion reads for assembly. Also, the GS FLX reads (peak length at 250), around for more than four years now, perform significantly better than Ion reads of comparable length for de novo assembly.

Finally, I’d be interested to see an analysis of the homopolymer errors based on mapping the reads to the reference genome for both these datasets. Since this is something that would take me way too much time, I hope somebody else will do this analysis (I can help if needed, and you’re welcome to write a guest post on my blog if you want)!

Code used
Randomly picking 350109 reads from the 454 sff files to get the same number as the IonTorrent B14_387 run
sfffile -pickr 350109 -o EBO6PME_350k_reads.sff EBO6PME0*.sff

Readlength distributions
(see also this post):
sffinfo -s reads.sff|fasta_length | awk ‘{x[$1]++}END{for (i in x){print i”\t”x[i]}}’| sort -n >read_length_hist.tsv

Homopolymer counts
I used the following script (thanks to ‘Deep’ for pointing out a bug in this script, see the Comments section. The results changed slightly after fixing the bug but this did not influence the conclusions. Below is the correct version):

#!/usr/bin/perl

use strict;
use warnings;

my $seq;
my $seen;        # for tracking homopolymers
my $i;            # running length of homopolymer
my $max=0;        # longest homopolymer found
my @hompol;        # hompol = homopolymer counts, e.g. hompol["A"][3] is the count of AAA
my %bases; $bases{"A"} = 1;$bases{"C"} = 2;$bases{"G"} = 3;$bases{"T"} = 4;

# set the record separator to the '>' symbol
$/=">";
<>;        # remove the empty first 'sequence'
# loop over the sequences
while (<>){
$seen = "";
$i=1;
# remove the trailing '>' symbol
chomp;
# split the entry into individual lines based on the newline character
my @lines = split(/\n/,$_);
# the header is the first line (now without the '>' symbol)
my $header = shift @lines;
# the sequence is the rest
my $seq = join "", @lines;

# loop over the bases
for (split (//, uc($seq)))  {
# skip N's
next if /N/;
# when the current base is the same as the previous one
if ($seen eq $_){$i++}
# otherwise, the homopolymer run has ended
else {
$hompol[$bases{$seen}][$i]++ if $seen ne "";
$max=$i if $i>$max;
$i=1;
}
$seen=$_;
}
# do not forget last homopolymer
$hompol[$bases{$seen}][$i]++ if $seen ne "";
$max=$i if $i>$max;
}
$/="\n"; # reset the record separator

# output
print "\tA\tC\tG\tT\n";
# homopolymer length loop
for $i (1..$max){
print $i;
# bases loop
for my $j (1..4){
print "\t";
print $hompol[$j][$i] || 0;
}
print "\n";
}

This produces a table with homopolymer length in rows, and bases in columns and the counts as values in the cells. I than (using excel) summed over the rows to get the totals, and plotted these in excel.

Homopolymer starting positions
For this, I used a modified version of the above script (thanks to ‘terry’ for pointing out a bug in this script, see the Comments section. Again, the results changed very slightly after fixing the bug, and again this did not influence the conclusions. Below is the correct version):

#!/usr/bin/perl
use strict;
use warnings;

my $seq;
my $seen;        # for tracking homopolymers
my $i;            # running length of homopolymer
my $pos;        # running counter of base position
my $max_hpol=0;    # maximum homopolyer length encountered
my $max_len=0;    # maximum sequence length encountered
my @poscounts;     # hash counting starting positions for all homopolymers; $poscounts[basepos][hompollength] = count

$/=">"; # set the record separator to the '>' symbol
<>;        # remove the empty first 'sequence'
# loop over the sequences
while (<>){
$seen = "";
$i=1;
$pos = 0;
chomp;    # remove the trailing '>' symbol
my @lines = split(/\n/,$_);    # split the entry into individual lines based on the newline character
my $header = shift @lines;    # the header is the first line (now without the '>' symbol)
# the sequences is now the rest
my $seq = join "", @lines;
$max_len = length ($seq) if length($seq) > $max_len;

# Loop over the bases
for (split (//, uc($seq)))  {
$pos++;
# skip N's
next if /N/;
# when the current base is the same as the previous one
if ($seen eq $_){$i++}
# otherwise, the homopolymer run has ended
else{
# $pos is now position of current base, one beyond the last homopolymer
# $i is length of last homopolymer
# $pos-$i =  starting base of last homopolymer
$poscounts[$pos-$i][$i]++ if $seen ne "";
$max_hpol=$i if $i>$max_hpol;
$i=1;
}
$seen=$_;
}
# do not forget last homopolymer
$pos++;
$poscounts[$pos-$i][$i]++ if $seen ne "";

}
$/="\n"; # reset the record separator

# output
print join "\t", ("", 1..$max_hpol);
print "\n";
# bases loop
for $i (1..$max_len){
print $i;
for my $j (1..$max_hpol){
# homopolymer length loop
print "\t";
print $poscounts[$i][$j] || 0;
}
print "\n";
}

The output of this script is a table with the base positions in the rows, and the homopolymer lengths as columns, the counts as values in the cells.

Resetting the trimpoints to 230 bases
First, I created a trimpoint file:

sffinfo R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_B14-387_cafie_0.05.sff|awk '{
if ($0 ~ />/){id=$0}
if ($0 ~/Clip Qual Right/){l=$4; if(l>230){l=230}}
if ($0 ~/Clip Adap Left/){print id" 5 "l}
}' |sed 's/>//' >B14_387_cafie_0.05_new_trimpoints_max_230.txt

Then, I used the sfffile command to generate a new sff file, with new right trimpoints for those reads that are over 230 bases:

sfffile -o  B14_387_cafie_0.05_new_trimpoints_max_230.sff -tr B14_387_cafie_0.05_new_trimpoints_max_230.txt R_2011_07_19_20_05_38_user_B14-387-r121336-314_pool30-ms_B14-387_cafie_0.05.sff

Assemblies
All assemblies were done with newbler 2.6 with default parameters, e.g.

runAssembly -o trim_at_230 B14_387_cafie_0.05_new_trimpoints_max_230.sff

For the summary of the assembly statistics, I used my newblermetrics script. > <

Advertisements

20 thoughts on “IonTorrent: longer reads, longer homopolymers?

  1. Pingback: Ion Torrent: What is the impact of the new longer reads on assembly?

  2. Awesome work. I can appreciate the hard work that went into each figure. Thanks for the mentions :o)

    [..]lek2k showed on his biolektures blog (posts here and here) both undercalls and overcalls for two-mers in the test fragments for longer reads.[..]

    I think it would be better worded as deep into a read instead of “longer reads”. Readers may get confused that this was based on the longer read data set which is the main topic of this post.

    It’s a very interesting point that the homopolymers occur towards the end of the read. I think it is due to over amplification of the signal to compensate for a weaker signal. This would unfortunately also amplify the noise also. It is also due to breakdown in the background modelling used in the signal processing. They have used v1.55 code to perform the analysis, which based on the version number would not have much changes since v1.52 (released with Torrent Suite v1.4). This was optimized for 65 cycles (260 flows) and not for 130 cycles. In that run they only changed the CR parameter = 0.05 which produces a more stringent filter resulting in an increase in poor signal reads (45%).

    I am confident in the coming weeks, a new version of Analysis software will address this.

    Great post and I can’t wait to read your future ones :o) I might take you up on the offer “I’d be interested to see an analysis of the homopolymer errors based on mapping the reads to the reference genome for both these datasets.” since I’m very interested in analyzing 454 data set but have no access :o(

    • Thanks for your comments, I’ve changed the wording accordingly. I’d be more than happy to share the 454 data with you. If a near future software relase addresses these problems, it looks like Ion perhaps released these data too early?
      Keep on blogging, yourself!

  3. Congrats on the Nature publication!
    I really like the look and feel of your graphs. What program did you use to graph?

    • Thanks!
      For graphs, we use different programs, both R and SigmaPlot. Editing was mostly done in Adobe Illustrator. There is one excel graph in the supplementary, as well as some powerpoint slides (sometimes the easy way out works just as well 🙂 )

  4. Hi,
    This is a good analysis. I wanted to test something along the same lines. I took your code on homopolymers and may have found a bug.
    As simple test, below is the input data:
    >R1 length=42
    ATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCTTCTGTAG
    >R2 length=46
    ATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCTTCTGTAGTACG
    >R3 length=46
    ATTACCGCGGCTGCTGGCACGTAGTTAGCCGTCCCTTTCTCGGTAA
    The above is amplicion data so primers are present and therefore first few bases are identical.
    If I run the perl script the output is :
    A C G T
    1 18 20 21 19
    2 0 8 7 8
    3 0 1 0 1

    However, as a sanity check:
    grep -v “>” delme | awk ‘BEGIN{a=0} {a+=gsub(/A/,””)} END{print “A=”a}’
    20
    The table should be
    It should be:
    A C G T
    1 18 20 23 19
    2 1 8 7 8
    3 0 1 0 1

    • Thanks for pointing out this bug in my script (this is why I love blogging, it’s like asking the whole world to be your reviewer).

      If you look in the second homopolymer script, the one that counts the starting positions, on line 45, it says ‘# do not forget last homopolymer’. This is missing from the first script, hence the bug. For the analysis of a whole genome, this makes a negligible difference, but for the reads not counting the very last homopolymer could in fact have an effect. So, I reran the counts for the read datasets. For the IonTorrent long reads, there were a few % more of the longest homopolymers (consistent with the fact that these are more dominant at the end of the reads), for the 454 reads there was less than 1% difference, regardless of homopolymer length.

      I have updated the script, but left the figure and text as it was, as the basic results are still the same. But, once again for letting me know, really appreciated!

      • Hi,
        Yes, I should have also seen the problem rather then just pointing it out. Thanks for updating the script.
        Another thing that is making is me curious with the 454 Titanium, assuming 2.5 incorporation, expectation is ~500 read length. But you will see more longer reads. In a good run the average is normally ~450. Now looking at the reads after 500 length, the base quality is not bad.
        Have you looked into this? Its worth the investigation especially in the case of metagenomic data sets. Why are we seeing more then 500 bases reads, are they just because of homopolymers, either repetitions of one base or due to occurrence of TACG (the flow of bases from the sequencer) in the bacterial genomes.

        With amplicon data and after quality filtering the read length are around ~350-400. Which is again very curious as with 454 you would expect more good quality longer reads.

      • On your first question, I have not looked into this specifically. One reason for the read length distributions you see is purely the effect of sequencing with the TACG flow order. If you chop up a known genomic reference sequence and start sequencing with 200 flows of TACG, you will get a top around 500 nt. So, a longer read would be de to on average, loger homopolymers, or more bases ‘following’ the flow order, e.g. in the form [T]n[A]n[C]n[G]n, rather than for example CACACA (do you see why?). What would be interesting is to compare the real reads with ‘ideal’ reads from the reference, and see where the real reads stop, i.e. are they shorter than they should be, and why?

        On amplicon sequencing, this is I think harder to do as the reads are so similar, so the image processing has to be tuned towards that. This is also the reason one has to load less beads on a plate for amplicon projects (and thus gets less reads), and why they cannot be mixed with shotgun projects on the same plate…

    • Thanks. I agree, I deliberatley set myself the goal of adding code and commands as an excercise in ‘open science’, and so far, I like how it works out. One bug has already been fixed!

      I also would very much like to somehow include all code and ‘labjournal’ stuff into my next publication(s). I like the approach from Galaxy/taverna etc where workflows can be recorded, shared and repeated. Hurray to open science and reproducibility!

  5. Pingback: Ion Torrent – Rapid Accuracy Improvements | BioLektures

  6. Please let me know what is the difference (if any) between 454 and Ion torrent sff files? Do they contain exactly the same information?
    Thanks,
    A

    • They basically contain the same information, but I haven’t done a thorough enough comparison… At least you should be able to do the same ‘things’ with sff files from both platforms (e.g. use them for downstream analysis with newbler).

  7. Your results and explanations are very good, I am very interested.
    But in the script of Homopolymer starting positions, i think you should add “pos++;” between line 45 and line 46.

    • Oh dear, another bug 🙂 Thanks, fixed accordingly! I checked, and the absolute numbers changed slightly after the fix. but the plots look nearly identical. Don’t you just love ‘post publication peer review’?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s