I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
- get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
- get 100x HiSeq paired end regular insert
- run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
- alternative is FALCON https://github.com/PacificBiosciences/FALCON
- run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
- you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
- polish using the HiSeq reads with Pilon
- increase contiguity using BioNanoGenomics data
- create pseudo chromosomes using a linkage map (software?)
In 1986, in a letter to the journal Nature, James Bruce Walsh and Jon Marks lamented that the upcoming human genome sequencing project “violates one of the most fundamental principles of modern biology: that species consist of variable populations of organisms”. They further wrote: “As molecular biologists generally ignore any variability within a population, the individual whose haploid [sic] genome will be chosen will provide the genetic benchmark against which deviants are determined”. They conclude that ” ‘the’ genome of ‘the’ human will be sequenced gel by acrylamide gel”.
We have come a long way when it comes to taking population variation into account in molecular/genetic/genomic studies. But these sentiments, expressed already in 1986, echo some of the trends in the human genetics field: the move away from a single, linear representation of ‘the’ human genome. In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.
The Genome10K meeting is ongoing (I am not attending but following through twitter). Today, there will be a talk by Ian Korf about the feasibility of an Assemblathon 3 contest (see this tweet and the schedule). Earlier the @Assemblathon twitter account asked for a wishlist for an Assemblathon 3 through the hashtag #A3wishlist. With this post I want to share my opinion on what a possible Assemblathon 3 could and/or should be about.
Open source, open data, open course
We recently had the third instalment of the course in Throughput Sequencing technologies and bioinformatics analysis. This course aims to provide students, as well as users of the organising service platforms, basic skills to analyse their own sequencing data using existing tools. We teach both unix command line-based tools, as well as the Galaxy web-based framework.
I coordinate the course, but also teach a two-day module on de novo genome assembly. I keep developing the material for this course, and am increasingly relying on material openly licensed by others. To me, it is fantastic that others are willing to share material they developed openly for others to (re)use. It was hugely inspiring to discover material such as the assembly exercise, and the IPython notebook to build small De Bruijn Graphs (see below). To me, this confirms that ‘opening up’ in science increases the value of material many orders of magnitude. I am not saying that the course would have been impossible without having this material available, but I do feel the course has become much better because of it.
‘Open’ made this course possible
This course used:
- openly available sequencing data released by the sequencing companies (although some of the Illumina reads are behind a – free – login account)
- sequencing data made openly available by individual researchers
- code developed for teaching made available by individual researchers under a permissive license
- open source software programs
(for a full list or resources, see this document).
I am extremely grateful to the authors/providers of these resources, as they greatly benefitted this course!
‘Opening up’ is the least I can do to pay back
In exchange, the very least I can do is making my final course module openly available as well.
The rest of this post describes the material and it’s sources in more detail.
Last month, a new paper appeared in BMC Bioinformatics, entitled “Automated ensemble assembly and validation of microbial genomes”. In it, the authors describe iMetAMOS, a module of the metAMOS package, for bacterial genome assembly. I was one of the reviewers (I signed my review), and post part of my review here. The full review can be found on publons.
iMetAmos workflow. From the paper, doi:10.1186/1471-2105-15-126
I signed my review because I believe in non-anonymous peer review (see Mick Watson’s “reviewer’s oath”).
I made my review available on publons, a platform to post pre- and post-publication peer-review reports after the article has been published, because I believe in open peer-review. EDIT Adam Phillippy, the senior author on the paper, posted the authors response to the review reports they received as a comment to review on publons!
I post the first part of my review here because it nicely summarises the paper and my (favourable) opinion of it. I’ll admit that I wrote these paragraphs of the review report with the idea of posting them to my blog 🙂
I attended, for the first time, the Advances in Genome Biology and Technology (AGBT) meeting in Florida. With this post, I intend to summarise my experiences of the meeting. I will not cover everything that happened at the meeting, but focus of the areas of my own interest.
The one and only Oxford Nanopore talk at AGBT 2014 – with real data
Twitter started buzzing this morning at AGBT because researchers started getting confirmation emails from Oxford Nanopore regarding their application for the MinION Access Program (MAP). This timed well with the first talk discussing some serious data from the platform, by David Jaffe from the Broad Institute, entitled “Assembly of Bacterial Genomes Using Long Nanopore Reads”. Probably no coincidence…
David Jaffe’s highly anticipated talk showed data generated by Oxford Nanopore on their MinION from two bacterial genomes. One was a methylation negative E coli (the fact that it was methylation negative may have been significant, but he didn’t say). The second species was a Scardovia. 5 micrograms of DNA were sent to the company. Library prep consisted of fragmentation and adaptor ligation, basically the classical workflow. Nothing was said about the type of adaptors.
Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:
- generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
- pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
- use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
- map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool
The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.
The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.
First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target
A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!
Option 1: bits and pieces
- Libraries: paired end or single end sequencing
- Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
- Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
- Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
- Pros: fast and cheap, OK for presence/absence of e.g. genes
- Cons: doesn’t give much insight into the genome
- Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead