Developments in high throughput sequencing – July 2016 edition

This is the fifth edition of this visualisation, previous editions were in June 2015, June 2014, October2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument (different from last time) seems to be missing, hang on, I’m coming back to that…

developments_in_high_throughput_sequencing

Notable changes from the June 2015 edition

  • I added the Illumina MiniSeq
  • I added the Oxford Nanopore MinION. The read length for this instrument was based on the specifications for maximal output and number of reads from the company’s website. The two data points represent ‘regular’ and ‘fast’ modes.
  • I added the IonTorrent S5 and S5XL. You may notice that the line for this instrument has a downward slope, this is due to the fact that the 400 bp reads are only available on the 520 and 530 chip, but not the higher throughput 540 chip, making the maximum throughput for this read length lower than for the 200 bp reads.

Continue reading

How to sequence and assemble a large eukaryote genome with long reads in 2015

I get asked about this a lot, so I thought to put together a quick blog post on it.

Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.

Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:

Sequencing and assembly

  • get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
  • get 100x HiSeq paired end regular insert
  • run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
  • alternative is FALCON https://github.com/PacificBiosciences/FALCON
  • run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
  • you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
  • polish using the HiSeq reads with Pilon

Optional:

  • increase contiguity using BioNanoGenomics data
  • create pseudo chromosomes using a linkage map (software?)

 

Continue reading

Developments in high throughput sequencing – June 2015 edition

This is the fourth edition of this visualisation, previous editions were in June 2014, October 2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…

Continue reading

My review of “Long-read, whole-genome shotgun sequence data for five model organisms”

Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.

Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.

I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.

Continue reading

Our review of “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”, aka the HGAP paper

As it is out in the open that I was one of the reviewers of the ‘HGAP’ paper, I though I could as well make my review publicly available.

I have posted the review report (from February 2013) online at Publons. The review was actually done together with a PhD student in the group, Ole Kristian Tørresen (I like to do reviews together with others, it leads to better reviews and is a great learning experience for students!).

Here are the first few paragraphs. Enjoy!

Continue reading

Developments in next generation sequencing – June 2014 edition

This is the third edition of this visualisation, previous editions were in October 2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale:

Developments in next generation sequencing June 2014

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue reading