I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
- get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
- get 100x HiSeq paired end regular insert
- run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
- alternative is FALCON https://github.com/PacificBiosciences/FALCON
- run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
- you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
- polish using the HiSeq reads with Pilon
- increase contiguity using BioNanoGenomics data
- create pseudo chromosomes using a linkage map (software?)
- CEGMA (formally discontinued but still useful)
- BUSCO (we have issues with fish, seems not to be tailored to that group of organisms, developers tell us they are fixing it)
- linkage map? or other map (RAD-tag based). (software?)
- BioNanoGenomics can be used for QC also
- Use a genome browser to get a feeling for your results, e.g. IGV; add assembly, BAM files, annotation, transcripts mapped and browse
- We use MAKER2 for automated annotation
- better with RNA-seq (assembled) data
- best with PacBio IsoSeq data
- filter filter filter
- getting a repeat database is a science in itself, we are still learning
What about Oxford Nanopore data?
Absolutely for bacterial genomes, see the Loman/Quick/Simpson paper in Nature Merhods, or use Spades. For large eukaryote genomes, the potential is there. However, I would wait a bit for higher throughput and tailored software.
Edit: on Twitter, @ZaminIqbal asked
@lexnederbragt What changes to that would you make for smaller genomes Lex?
To which I replied:
- HGAP from smrtanalysis as it does it al in one go
- Circularisation with – need to ask what we use. I now think we follow this wiki from PacBio