I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
- get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
- get 100x HiSeq paired end regular insert
- run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
- alternative is FALCON https://github.com/PacificBiosciences/FALCON
- run quiver for polishing the assembly using ALL raw PacBio reads, see tips here
- you could repeat the polishing if that changes a lot of bases and does not negatively impact validation
- polish using the HiSeq reads with Pilon
Optional:
- increase contiguity using BioNanoGenomics data
- create pseudo chromosomes using a linkage map (software?)