I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
- get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
- get 100x HiSeq paired end regular insert
- run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
- alternative is FALCON https://github.com/PacificBiosciences/FALCON
- run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
- you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
- polish using the HiSeq reads with Pilon
- increase contiguity using BioNanoGenomics data
- create pseudo chromosomes using a linkage map (software?)
- CEGMA (formally discontinued but still useful)
- BUSCO (we have issues with fish, seems not to be tailored to that group of organisms, developers tell us they are fixing it)
- linkage map? or other map (RAD-tag based). (software?)
- BioNanoGenomics can be used for QC also
- Use a genome browser to get a feeling for your results, e.g. IGV; add assembly, BAM files, annotation, transcripts mapped and browse
- We use MAKER2 for automated annotation
- better with RNA-seq (assembled) data
- best with PacBio IsoSeq data
- filter filter filter
- getting a repeat database is a science in itself, we are still learning
What about Oxford Nanopore data?
Absolutely for bacterial genomes, see the Loman/Quick/Simpson paper in Nature Merhods, or use Spades. For large eukaryote genomes, the potential is there. However, I would wait a bit for higher throughput and tailored software.
Edit: on Twitter, @ZaminIqbal asked
@lexnederbragt What changes to that would you make for smaller genomes Lex?
To which I replied:
- HGAP from smrtanalysis as it does it al in one go
- Circularisation with – need to ask what we use. I now think we follow this wiki from PacBio
I couple of comments:
canu is available: https://github.com/marbl/canu (and http://canu.readthedocs.org/en/latest/index.html). It’s derived from Celera Assembler, but everything concerning short and paired reads have been taken out (canu can be read as “CA new” I guess). Only tested it a little bit, and I have trouble with the consensus part.
Question: If you have 100x coverage of PacBio and 100x of Illumina reads, which set has the least bias in errors? I’m unsure that Pilon would do much good when you have 50+x coverage with PacBio.
Thanks for mentioning Canu. We are also still eagerly awaiting Dazzler. About your question: Illumina will have more systematic biases and regions with low coverage. But I still expect Pilon to have an effect. We are testing this out on a fish genome, so we’ll know soon…
Well, one important thing is to have some idea of the genome size in order to get 100x of any of the two seq techs. Also a karyotype will help to have an idea of the chromosome number.
Having said that, I know that is not trivial to do this at the lab, however I always encourage people to at least try. I database that sometime I use is this
Not sure if they keep it updated but sometimes is helpful.
Great post BTW!
Excellent points. Thanks!
That website, http://www.genomesize.com seems to be down, and has been down for a while.
If the genome size is totally unknown, maybe it could be beneficial to do some Illumina HiSeq sequencing, a lane or something of 100 bp reads, and do some kmer analysis of that to get an estimate of genome size (SGA PreQC for instance). When the genome size is better known (and you’d get some info about repeat structure and heterozygosity also), then the PacBio sequencing strategy could be better planned.
Any idea how to get a karyotype done for a non-model organism that doesn’t have any cell lines? Just finding dividing cells can be difficult.
Sorry, this is not something I know anything about…
Very interesting post.
– Do you have an opinion on using a filtered set of subreads (filtered for length and quality) in the assembly without correcting them?
This could be done by using the h5tools included in the SMRTPortal or Gene Myers Dextractor Module (https://dazzlerblog.wordpress.com/2014/03/22/the-dextractor-module-save-disk-space-for-your-pacbio-projects ). By saving the correction step, the assembly should be much faster.
– Same for quiver: Why use all the raw subreads and not only filter the high qualtiy ones?
1) Assembly of reads with a 15% error rate is possible according to both Gene Myers and the Celera developers, so, yes. But Dazzler is not yet ready, and I don’t know how well it works in Celera.
2) I don’t see a reason why Quiver wouldn’t work just as well after a bit of filtering, but one would have to try. The goal would be to save time, right? Yes, that may work.
Thanks for this post, it’s really helpful. Do you also do the wet lab prep or know what (if any) special measures are taken to avoid cross species contamination? What about any bioinformatic tools for QC of contamination status? I guess one could blast some subset of the fastq and the use of RNAseq on top validates the expressed coding regions.
I don’t do wet lab stuff. I guess using tools like blast to determine the species origin of the contigs from the assembly (or other tools for metagenomic analysis) would be good to determine species of origin.