As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument (different from last time) seems to be missing, hang on, I’m coming back to that…
Notable changes from the June 2015 edition
I added the Illumina MiniSeq
I added the Oxford Nanopore MinION. The read length for this instrument was based on the specifications for maximal output and number of reads from the company’s website. The two data points represent ‘regular’ and ‘fast’ modes.
I added the IonTorrent S5 and S5XL. You may notice that the line for this instrument has a downward slope, this is due to the fact that the 400 bp reads are only available on the 520 and 530 chip, but not the higher throughput 540 chip, making the maximum throughput for this read length lower than for the 200 bp reads.
As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…
I attended, for the first time, the Advances in Genome Biology and Technology (AGBT) meeting in Florida. With this post, I intend to summarise my experiences of the meeting. I will not cover everything that happened at the meeting, but focus of the areas of my own interest.
With this post I present a figure I’ve been working on for a while now. With it, I try to summarise the developments in (next generation) sequencing, or at least a few aspects of it. I’ve been digging around the internet to find the throughput metrics for the different platforms since their first instrument version came out. I’ve summarised my findings in the table at the end of this post. Then, I visualised the results by plotting throughput in raw bases versus read length in the graph below.
A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!
Option 1: bits and pieces
Libraries: paired end or single end sequencing
Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
Image from Wikimedia Commons. (Buzz was a low-cost airline based at London Stansted operating services to Europe. It was sold to Ryanair.)
I am not attending the American Society of Human Genetics meeting in San Fransisco, but can’t escape the buzz it creates on twitter (hashtag #ashg2012). Strikingly, it is almost another AGBT when it comes to announcements from companies selling sequencing instrument. All of them had something new to bring to the floor. This post summarizes what I picked up from twitter and a few websites, and I give a bit of my perspectives on the respective announcements. I am focussing on technology improvements, especially with regard to read lengths, not so much on applications such as cancer resequencing panels.
After having given an overview of the PacBio error-correction (PacBioToCA) pipeline of Koren et al (see previous blog post), it was interesting to see another paper coming out describing combining PacBio and Illumina for assembling bacterial genomes: Ribeiro et al, “Finished bacterial genomes from shotgun sequence data“, Genome research accepted preprint. The authors are all from the Broad Institute. David Jaffe was so kind as to provide me with the supplementary material file, which so far is not yet available online. In this post, I will summarise the ALPATHS_LG paper, and contrast the approach with the PacBioToCA pipeline.
The Broad Institute of MIT/Harvard is an impressive genome centre. Sorry, ‘Genomic Medicine Center’. One of their achievements in recent years is an optimised pipeline for assembly of (small and) large genomes based on short read (Illumina) data. The software program they developed, ALLPATHS_LG, combined with a kind of special type of ‘recipe’, has proven to be extremely succesful. The Broad keeps churning out very respectable assemblies of, among other groups, large complex eukaryotes. In order to be able to handle large amounts of genome projcts, I get the impression that the Broad Institute aims for standardisation and optimisation of protocols and, what they call ‘recipes’. The idea being that, if you follow their recipes and use their programs, you are more or less guaranteed an optimal result. In the case of ALLPATHS_LG, it also means, however, that one is required to have at least one Illumina jumping (‘mate pair’) library and a short-insert (‘paired-end’) Illumina library for which the insert size is shorter than twice the read length. Paired end libraries with slightly larger insert sizes are standard, making the ALLPATHS_LG requirements kind of uncommon. It also means not many projects are only able to compare other programs with ALLPATHS without generating an extra read dataset (I speak from experience here…).
The Ribeiro et al paper is entirely focussing on microbial size genomes, a big difference with the Koren et al (PacBioToCA) paper. The short-read recipe for ALLPATHS_LG remains unchanged: 50x coverage short-insert (up to 220bp) paired reads and 50x coverage jumping (2-10kb) reads. This is then supplemented with 50x PacBio reads, with library insert sizes between 1 and 3 kb (although for the paper, they also created two larger insert libraries for two of the strains, at 6 and 10kb respectively). In total, there are three size ranges represented this way: short – overlapping paired reads, intermediate – PacBio reads, long – illumina mate pairs. Finally, the short reads have high quality relative to the PacBio reads.
Nick Loman was kind enough to give me an advanced copy of his paper in Nature Biotechnology entitled “Performance comparison of benchtop high-throughput sequencing platforms” (Loman et al, 2012). I thought to present a quick summary of the paper here and add some comments of my own.
The paper sets out to “compare the performance of three sequencing platforms [Roche GS Junior, Ion Torrent PGM and Illumina MiSeq] by analysing data with commonly used assembly and analysis pipelines.” To do this, they chose a strain from the outbreak of food-borne illness caused by Shiga-toxin-producing E. coli O104:H4, which caused a lot of trouble in Germany about a year ago. The study is unique in that it is focuses on the use of these instruments for de novo sequencing, not resequencing.
First, they used the ‘big brother’ of the GS Junior, the GS FLX, to generate a reference genome (combining long reads obtained using the GS FLX+, and mate pairs using Titanium chemistry). Then, the same strains were sequenced on the benchtop instruments, and these reads were compared to the reference assembly. The reads were both compared directly, and after assembly with a few commonly used programs.
For three platforms, reads longer than the commercially available, and/or from not-yet released instruments, have become accessible online. With online, I mean that we all can download these data to have a look at:
1) MiSeq 2x 150 bases runs
As part of the German E. Coli (EHEC) ‘Crowdsourcing Project’, Illumina sequenced fie strains for the UK Health Protection Agency, the fastq files can be downloaded from http://www.hpa-bioinformatics.org.uk/lgp/genomes. These are the first data in the public domain from a MiSeq!
See also this post on GenomeWeb.
2) IonTorrent 316 chip
Keith Robison shares a bit of info on data from an Ion 316 chip ion his ‘Omics! Omic!’ blog: “1.69M reads, with 1.53M of those >=50 bp long and 1.07M 100bp or longer”:
I downloaded the run files, and quickly looked at the read length distribution of the trimmed reads in the sff file (which listed 260 flows, 40 more than the file I analyzed in my previous post), showing a peak exactly one base longer at 109 bases. So, many more reads but not much gain in length (yet). Note the strange shape of the peak: