Developments in high throughput sequencing – July 2016 edition

This is the fifth edition of this visualisation, previous editions were in June 2015, June 2014, October2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument (different from last time) seems to be missing, hang on, I’m coming back to that…

developments_in_high_throughput_sequencing

Notable changes from the June 2015 edition

  • I added the Illumina MiniSeq
  • I added the Oxford Nanopore MinION. The read length for this instrument was based on the specifications for maximal output and number of reads from the company’s website. The two data points represent ‘regular’ and ‘fast’ modes.
  • I added the IonTorrent S5 and S5XL. You may notice that the line for this instrument has a downward slope, this is due to the fact that the 400 bp reads are only available on the 520 and 530 chip, but not the higher throughput 540 chip, making the maximum throughput for this read length lower than for the 200 bp reads.

Continue reading

How to sequence and assemble a large eukaryote genome with long reads in 2015

I get asked about this a lot, so I thought to put together a quick blog post on it.

Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.

Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:

Sequencing and assembly

  • get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
  • get 100x HiSeq paired end regular insert
  • run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
  • alternative is FALCON https://github.com/PacificBiosciences/FALCON
  • run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
  • you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
  • polish using the HiSeq reads with Pilon

Optional:

  • increase contiguity using BioNanoGenomics data
  • create pseudo chromosomes using a linkage map (software?)

 

Continue reading

Developments in high throughput sequencing – June 2015 edition

This is the fourth edition of this visualisation, previous editions were in June 2014, October 2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…

Continue reading

My review of “Long-read, whole-genome shotgun sequence data for five model organisms”

Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.

Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.

I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.

Continue reading

Our review of “Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data”, aka the HGAP paper

As it is out in the open that I was one of the reviewers of the ‘HGAP’ paper, I though I could as well make my review publicly available.

I have posted the review report (from February 2013) online at Publons. The review was actually done together with a PhD student in the group, Ole Kristian Tørresen (I like to do reviews together with others, it leads to better reviews and is a great learning experience for students!).

Here are the first few paragraphs. Enjoy!

Continue reading

Developments in next generation sequencing – June 2014 edition

This is the third edition of this visualisation, previous editions were in October 2013 and December 2012.

As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale:

Developments in next generation sequencing June 2014

 

 

 

 

 

 

 

 

 

 

 

 

 

Continue reading

De novo bacterial genome assembly: a solved problem?

Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:

  • generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
  • pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
  • use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
  • map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool

The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.

Continue reading

Applications for PacBio circular consensus sequencing

I am a fan of the Pacific Biosciences PacBio RS sequencing instrument, as some of the readers of this blog maybe already are aware of. We use data from this instrument in our work, and the Norwegian Sequencing Centre – with which I am affiliated – offers the technology to its users.

PacBio recently increased their read lengths. With this so-called C2XL chemistry, average raw length is now around 4.3-4.5 kbp and maximum read lengths can go over 20kbp. These extra long reads are great for de novo genome sequencing applications, something we are trying ourselves. However, a bit buried in the news about these longer reads is the consequences for PacBio’s so-called Circular Consensus Sequencing, or CCS.

Continue reading

Developments in next generation sequencing – a visualisation

With this post I present a figure I’ve been working on for a while now. With it, I try to summarise the developments in (next generation) sequencing, or at least a few aspects of it. I’ve been digging around the internet to find the throughput metrics for the different platforms since their first instrument version came out. I’ve summarised my findings in the table at the end of this post. Then, I visualised the results by plotting throughput in raw bases versus read length in the graph below.

Developments in next generation sequencing. http://dx.doi.org/10.6084/m9.figshare.100940

Developments in next generation sequencing. http://dx.doi.org/10.6084/m9.figshare.100940

Continue reading