As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument (different from last time) seems to be missing, hang on, I’m coming back to that…
Notable changes from the June 2015 edition
I added the Illumina MiniSeq
I added the Oxford Nanopore MinION. The read length for this instrument was based on the specifications for maximal output and number of reads from the company’s website. The two data points represent ‘regular’ and ‘fast’ modes.
I added the IonTorrent S5 and S5XL. You may notice that the line for this instrument has a downward slope, this is due to the fact that the 400 bp reads are only available on the 520 and 530 chip, but not the higher throughput 540 chip, making the maximum throughput for this read length lower than for the 200 bp reads.
I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
get 100x HiSeq paired end regular insert
run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…
Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.
Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.
I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.
As it is out in the open that I was one of the reviewers of the ‘HGAP’ paper, I though I could as well make my review publicly available.
I have posted the review report (from February 2013) online at Publons. The review was actually done together with a PhD student in the group, Ole Kristian Tørresen (I like to do reviews together with others, it leads to better reviews and is a great learning experience for students!).
I attended, for the first time, the Advances in Genome Biology and Technology (AGBT) meeting in Florida. With this post, I intend to summarise my experiences of the meeting. I will not cover everything that happened at the meeting, but focus of the areas of my own interest.