A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!
Option 1: bits and pieces
- Libraries: paired end or single end sequencing
- Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
- Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
- Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
- Pros: fast and cheap, OK for presence/absence of e.g. genes
- Cons: doesn’t give much insight into the genome
- Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead
Option 2: a few gaps remaining
- Libraries: paired end or single end sequencing combined with at least one mate pair library of 8kb distance, optionally a 3-5 kb mate pair library
- Platform: paired end: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; mate pairs: preferably on the 454, otherwise Illumina, demonstrated protocol for Ion PGM
- Bioinformatics: assembly: Illumina data: Velvet, SOAPdenovo; 454 data: Newbler MIRA, Celera; hybrid: Newbler, Celera
- Outcome: down to one or a few scaffolds per chromosome/plasmid with a few to numerous gaps remaining
- Pros: gives a lot of long-range information
- Cons: mate pair libraries more cumbersome to make; more expensive, remaining gaps often due to repeats, which may be exactly the regions of interest (e.g. transposons, rDNA operons)
- Remarks: hybrid assembly is largely uncharted territory; gap closing programs may help (IMAGE2, Soap gap closer, GapFiller, …)
Option 3: closing the gaps
- Libraries: as for option 2, but in addition a long-insert (10kb) PacBio library
- Platform: one of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; plus PacBio RS
- Bioinformatics: assembly: same as option 2; gap closing/finishing: PBJelly; AHA from PacBio can also be used
- Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
- Pros: no mate pairs needed, potentially very complete genome
- Cons: requires multiple libraries, expensive
- Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage
Option 4: The ALLPATHS_LG way
- Libraries: Illumina paired end + Illumina Mate Pair + a long-insert (10kb) PacBio library
- Platform: Illumina HiSeq or MiSeq; PacBio RS
- Bioinformatics: assembly: ALLPATHS_LG
- Outcome: down to one or a few contigs per chromosome/plasmid
- Pros: can yield finished genome
- Cons: complex mixture of libraries, can become expensive
- Remarks: special recipe: the paired end library should have a short insert, such that the forward and reverse read overlap
Option 5: PacBio hybrid
- Libraries: paired end or single end sequencing + a long-insert (10kb) PacBio library
- Platform: as for option 2, but PacBio RS in addition
- Bioinformatics: error correction of the PacBio reads: PacBioToCA, LSC, P_ErrorCorrection from PacBio; assembly: Celera, MIRA
- Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
- Pros: can yield finished genome; only needs two libraries that are simple to produce
- Cons: error-correction can be computationally demanding
- Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage
Option 6: PacBio only
- Libraries: a single, long-insert (10kb) PacBio library
- Platform: PacBio RS
- Bioinformatics: error-correction; HGAp; assembly: Celera or Allora
- Outcome: down to one or a very few contigs per chromosome/plasmid
- Pros: uses a single library, cheap, finished genome of very high quality
- Cons: somewhat experimental, limited examples available, therefore currently a bit more risky
- Remarks: use Quiver for polishing
EDIT: additional tools
- Reader BenK suggested adding optical mapping as a way to improve bacterial genome assemblies. Not a sequencing technology per se, but, as he writes, “Many genomes have undiscovered repetitive regions that mapping unveils.”
- Reader Roy suggested mentioning stand-alone scaffolders: SSPACE, SOPRA, and the reference-based scaffolder ABACUS . See also this SeqWiki howto.
- Sébastien Boisvert, author of Ray, suggested I add his program. Done hereby. The reason that it is not listed above has to do with my inexperience with it with regards to de novo bacterial genome assemblies
- In the same spirit, reads may want to look at the GAGE results for more programs.
Relevant previous blog posts:
Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG
Ion Torrent Mate Pairs and a single scaffold for E coli K12 substr. MG1655
Links to programs mentioned:
ALLPATHS_LG
Celera
GapFiller
HGAp
IMAGE2
LSC
MIRA
Newbler: request here
P_ErrorCorrection, Allora and AHA see here, these need smrtpipe
PacBioToCA (also part of Celera 7.0 and smrtpipe)
PBJelly
Quiver see here
SOAPdenovo
Velvet
Reader Roy suggested adding Spades.
(If you feel I missed a program here, let me know in the comments below!)
You overlook optical mapping, which really does have a significant place in the few/none genomes with no good reference. Many genomes have undiscovered repetitive regions thta mapping unveils.
Good suggestion. Added! Thanks
I’ve played with OpGen data, assuming that’s what you’re referring to, how do you recognize “undiscovered repetitive regions”? Are you talking about repeats that the assembler collapsed, or that caused miss-joins?
SPAdes often outperforms Velvet on Illumina data. And you don’t mention standalone scaffolding solutions such as SSPACE, or reference-based scaffolding such as ABACAS.
Thanks. I am not yet convinced that Spades outperforms other DBG assemblers – I would like to see some independent benchmarking for that. I’ve added the program at the end of the list, though, as it is worth mentioning.
I’ve also added SSPACE and SOPRA, but I can’t seem to find a reference/website for ABACAS…
ABACAS is here:
http://abacas.sourceforge.net/
It’s part of PAGIT:
http://www.sanger.ac.uk/resources/software/pagit/
This is a great summary, thanks! I’ve tried most of the things listed with varying success, but it’s a great summary.
We’ve also been frustrated with the 454 long-reads. The assemblies we get with 454 are beautiful, but we’ve moved away from it because of cost.
I was hopeful that the longer reads from MiSeq would make this a nearly moot point, but, I’ve been playing with assemblies using Illumina 2x250bp data, and am seeing quality issues towards the end of the second read that are not entirely captured by the quality scores. I wouldn’t mind a decline in quality so much, except that it appears that the quality scores given by the sequencer are inflated at the top end.
In my experience Illumina has been very good about read quality, so I’m not sure exactly what’s going on. I hope it’s something specific to our instrument/datasets/early releases and not a developing trend.
@Arjun, this isn’t specific to just your instrument. From what I understand this can be due to the quality of DNA being used. For instance, I know of users submitting data for various microbiome projects (switching to MiSeq for 16s surveying). They ran into this problem when submitting samples from kidney stone, which were apparently lower quality than others, likely due to the extraction method.
> (If you feel I missed a program here, let me know in the comments below!)
Hi !
I think Ray [ http://denovoassembler.sourceforge.net/ ] is worth a click too !
Thanks a ton !
Please see my suggestions:
- read “Genome Project Standards in a New Era of Sequencing” Science, 2009:Vol. 326 no. 5950 pp. 236-237 DOI: 10.1126/science.1180614
- SPAdes is in use by several sequencing centers in USA in single-cell production pipelines. For isolates it is also over performing majority of assemblers.
- QUAST is a good tool to assess assembly quality. Make sense to add it to your list (http://bioinf.spbau.ru/quast)
Thanks. I mentioned QUAST in another, more relevant blog post: http://flxlexblog.wordpress.com/2013/02/26/on-assembly-uncertainty-inspired-by-the-assemblathon2-debate/