How to sequence a bacterial genome at the end of 2012

A potential user (‘customer’) of our sequencing platform asked how to generate reference genomes for his 4 bacterial strains. His question inspired me to write this post. The suggestions below are not absolute, just my thoughts on how one these days could go about sequencing a bacterial genome using one or more of the sequencing platforms. I would appreciate any feedback/suggestions in the comments section!

Option 1: bits and pieces

  • Libraries: paired end or single end sequencing
  • Platform: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior
  • Bioinformatics: assembly: Velvet, SOAPdenovo, Newbler, MIRA, Celera
  • Outcome: up to hundreds of short contigs (with only single-end reads) or contigs + scaffolds (with paired end reads)
  • Pros: fast and cheap, OK for presence/absence of e.g. genes
  • Cons: doesn’t give much insight into the genome
  • Remarks: due to per-run throughput, multiplexing is recommended; data can also be used for mapping against a reference genome instead

Option 2: a few gaps remaining

  • Libraries: paired end or single end sequencing combined with at least one mate pair library of 8kb distance, optionally a 3-5 kb mate pair library
  • Platform: paired end: one or more of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; mate pairs: preferably on the 454, otherwise Illumina, demonstrated protocol for Ion PGM
  • Bioinformatics: assembly: Illumina data: Velvet, SOAPdenovo; 454 data: Newbler MIRA, Celera; hybrid: Newbler, Celera
  • Outcome: down to one or a few scaffolds per chromosome/plasmid with a few to numerous gaps remaining
  • Pros: gives a lot of long-range information
  • Cons: mate pair libraries more cumbersome to make; more expensive, remaining gaps often due to repeats, which may be exactly the regions of interest (e.g. transposons, rDNA operons)
  • Remarks: hybrid assembly is largely uncharted territory; gap closing programs may help (IMAGE2, Soap gap closer, GapFiller, …)

Option 3: closing the gaps

  • Libraries: as for option 2, but in addition a long-insert (10kb) PacBio library
  • Platform: one of Illumina MiSeq or HiSeq, Ion Torrent PGM, 454 GS FLX or GS Junior; plus PacBio RS
  • Bioinformatics: assembly: same as option 2; gap closing/finishing: PBJelly; AHA from PacBio can also be used
  • Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
  • Pros: no mate pairs needed, potentially very complete genome
  • Cons: requires multiple libraries, expensive
  • Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage

Option 4: The ALLPATHS_LG way

  • Libraries: Illumina paired end + Illumina Mate Pair + a long-insert (10kb) PacBio library
  • Platform: Illumina HiSeq or MiSeq; PacBio RS
  • Bioinformatics: assembly: ALLPATHS_LG
  • Outcome: down to one or a few contigs per chromosome/plasmid
  • Pros: can yield finished genome
  • Cons: complex mixture of libraries, can become expensive
  • Remarks: special recipe: the paired end library should have a short insert, such that the forward and reverse read overlap

Option 5: PacBio hybrid

  • Libraries: paired end or single end sequencing + a long-insert (10kb) PacBio library
  • Platform: as for option 2, but PacBio RS in addition
  • Bioinformatics: error correction of the PacBio reads: PacBioToCA, LSC, P_ErrorCorrection from PacBio; assembly: Celera, MIRA
  • Outcome: down to one or a few scaffolds per chromosome/plasmid potentially without gaps remaining
  • Pros: can yield finished genome; only needs two libraries that are simple to produce
  • Cons: error-correction can be computationally demanding
  • Remarks: Quiver can be used to improve the per-base quality given enough PacBio coverage

Option 6: PacBio only

  • Libraries: a single, long-insert (10kb) PacBio library
  • Platform: PacBio RS
  • Bioinformatics: error-correction; HGAp; assembly: Celera or Allora
  • Outcome: down to one or a very few contigs per chromosome/plasmid
  • Pros: uses a single library, cheap, finished genome of very high quality
  • Cons: somewhat experimental, limited examples available, therefore currently a bit more risky
  • Remarks: use Quiver for polishing

EDIT: additional tools

  • Reader BenK suggested adding optical mapping as a way to improve bacterial genome assemblies. Not a sequencing technology per se, but, as he writes, “Many genomes have undiscovered repetitive regions that mapping unveils.”
  • Reader Roy suggested mentioning stand-alone scaffolders: SSPACE, SOPRA, and the reference-based scaffolder ABACUS . See also this SeqWiki howto.
  • Sébastien Boisvert, author of Ray, suggested I add his program. Done hereby. The reason that it is not listed above has to do with my inexperience with it with regards to de novo bacterial genome assemblies
  • In the same spirit, reads may want to look at the GAGE results for more programs.

Relevant previous blog posts:

Combining short and long reads: choosing between PacBioToCA and the new ALLPATHS_LG

Ion Torrent Mate Pairs and a single scaffold for E coli K12 substr. MG1655

Links to programs mentioned:

ALLPATHS_LG
Celera
GapFiller
HGAp
IMAGE2
LSC
MIRA
Newbler: request here
P_ErrorCorrection, Allora and AHA see here, these need smrtpipe
PacBioToCA (also part of Celera 7.0 and smrtpipe)
PBJelly
Quiver see here
SOAPdenovo 
Velvet

Reader Roy suggested adding Spades.

(If you feel I missed a program here, let me know in the comments below!)

About these ads

13 thoughts on “How to sequence a bacterial genome at the end of 2012

  1. You overlook optical mapping, which really does have a significant place in the few/none genomes with no good reference. Many genomes have undiscovered repetitive regions thta mapping unveils.

  2. SPAdes often outperforms Velvet on Illumina data. And you don’t mention standalone scaffolding solutions such as SSPACE, or reference-based scaffolding such as ABACAS.

  3. This is a great summary, thanks! I’ve tried most of the things listed with varying success, but it’s a great summary.

    We’ve also been frustrated with the 454 long-reads. The assemblies we get with 454 are beautiful, but we’ve moved away from it because of cost.

    I was hopeful that the longer reads from MiSeq would make this a nearly moot point, but, I’ve been playing with assemblies using Illumina 2x250bp data, and am seeing quality issues towards the end of the second read that are not entirely captured by the quality scores. I wouldn’t mind a decline in quality so much, except that it appears that the quality scores given by the sequencer are inflated at the top end.

    In my experience Illumina has been very good about read quality, so I’m not sure exactly what’s going on. I hope it’s something specific to our instrument/datasets/early releases and not a developing trend.

    • @Arjun, this isn’t specific to just your instrument. From what I understand this can be due to the quality of DNA being used. For instance, I know of users submitting data for various microbiome projects (switching to MiSeq for 16s surveying). They ran into this problem when submitting samples from kidney stone, which were apparently lower quality than others, likely due to the extraction method.

  4. Please see my suggestions:
    - read “Genome Project Standards in a New Era of Sequencing” Science, 2009:Vol. 326 no. 5950 pp. 236-237 DOI: 10.1126/science.1180614
    - SPAdes is in use by several sequencing centers in USA in single-cell production pipelines. For isolates it is also over performing majority of assemblers.
    - QUAST is a good tool to assess assembly quality. Make sense to add it to your list (http://bioinf.spbau.ru/quast)

    • You need sufficient coverage to generate good-enough contigs. It may be possible using only mate pairs, but you will need to sequence a lot of them – I am uncertain whether anyone has ever tried this.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s