Nick Loman was kind enough to give me an advanced copy of his paper in Nature Biotechnology entitled “Performance comparison of benchtop high-throughput sequencing platforms” (Loman et al, 2012). I thought to present a quick summary of the paper here and add some comments of my own.
The paper sets out to “compare the performance of three sequencing platforms [Roche GS Junior, Ion Torrent PGM and Illumina MiSeq] by analysing data with commonly used assembly and analysis pipelines.” To do this, they chose a strain from the outbreak of food-borne illness caused by Shiga-toxin-producing E. coli O104:H4, which caused a lot of trouble in Germany about a year ago. The study is unique in that it is focuses on the use of these instruments for de novo sequencing, not resequencing.
First, they used the ‘big brother’ of the GS Junior, the GS FLX, to generate a reference genome (combining long reads obtained using the GS FLX+, and mate pairs using Titanium chemistry). Then, the same strains were sequenced on the benchtop instruments, and these reads were compared to the reference assembly. The reads were both compared directly, and after assembly with a few commonly used programs.
This is the first study for which the same strain was sequenced on three different instrument, which makes for an interesting comparison. So, what did they find? Ignoring run costs, throughput and all that for the moment, I’ll focus on the quality of the reads and assemblies first. The main findings regarding reads were:
- the reads from all platforms covered the genome very evenly
- read quality scores, recalculated based on alignment to the reference, showed the MiSeq to be superior
- Ion PGM data showed underestimated raw read quality scores, the GS Junior and MiSeq slightly overestimated them
- Ion Torrent showed about four times more indel errors (when calculated per 100 bases) compared to the 454, these errors were practically absent from MiSeq data
- Ion showed significantly more homopolymer errors than 454
Many people (including me) skip the supplementary sections of paper, but in this case I suggest to pay particular attention to supplementary figure 3 for a comparison of indel errors relative to homopolymer length. Here, Ion Torrent doesn’t look so good for longer ones.
Having reads of with errors is one thing, but in a way more interesting is if there exists software that can overcome, or compensate for, the errors. The article describes thorough comparisons of assemblies, using Velvet, MIRA, CLC Assembly Cell, and newbler, with all the data that were generated. Here the main findings were:
- the least fragmented assemblies were generated by combining two GS Junior runs (assembly with MIRA and newbler), or by using the paired end information in the MiSeq reads (assembly with CLC or Velvet)
- assemblies with Ion Torrent data lagged behind and were much more fragmented
- other combinations (single 454 runs, not using paired end information) resulted in fragmented assemblies as well
- GS Junior based assemblies covered more of the reference genome then the others
- the newbler program (GS Junior and Ion PGM data) showed the least amount of misassemblies
- although MIRA contigs covered more of the reference, they showed more misassemblies
- Ion PGM based assemblies showed many more gaps due to homopolymer errors, even for short ones (2-3 mers), then GS Junior based assemblies
- newbler showed less homopolymer-associated gaps for 454 data then MIRA
- with Ion PGM data, newbler handles correcting read errors in homopolymer tracts >3 bases better than MIRA, for short homopolymer tracts, the situation is reverse
- homopolymer-associated gaps were basically absent in the MiSeq based assemblies
- gaps not associated with homopolymers were present in about equal number regardless of dataset or assembler
One potential problem with these analysis is that the reference genome was generated using 454 technology and newbler, which could potentially give the GS Junior, especially its newbler assemblies, and advantage. When I asked Nick Loman about this, he pointed to the supplementary tables 1-3, which show results of the mapping of reads against two other reference genomes, one generated with PacBio+Illumina, and the other with Sanger reads. The indel error rates for these mappings are very similar, indicating that the choice of reference should not mater here. I tend to aree, but it would still be interesting to check the homopolymer associated gaps of the assemblies against these other references. Finally, unfortunately, the runtimes of the different assemblies were not included in the article, this would have been useful information for those of us interested in the fastest results.
Perhaps more important than the correctness of an assembly is the biological value it has. To that end, the assemblies generated were tested for the presence of 31 pathogenetically important protein coding genes, and 7 genes used for MLST typing:
- the MiSeq-based assemblies were more complete (at least 92% of the genes found full-length) than those based on the other technologies (less than 87% found)
- there were clear differences between the different assembly programs in how well they reconstructed gene space; for example, Velvet was much worse to assemble a certain type of gene with Illumina data relative to CLC and MIRA
So, which instrument is best? As usual, there can not be a single winner, as what is defined is ‘best’ is dependent on the requirements one has. The authors also don’t pick a winner, which I fully understand. For my own, and the readers of this blog’s sake, I though to summarise the specs of the instruments, and the findings of the paper, in a table. I placed an ‘X’ for each instrument that performed ‘best’ for the particular feature:
- if you want fast and cheap reads from a cheap instrument, buy an Ion PGM (but you’ll get ‘dirty’ data)
- if you lot’s of high quality, short read data without too much hassle in the lab, choose a MiSeq (but be prepared to wait for the data)
- if you want long reads for high quality assemblies, use a GS Junior (but prepared to pay a lot)
I applaud Nick and coauthors for making all raw data, assemblies, analysis scripts and results available online: through the Sequence Read Archive, and a github repository.