I get asked about this a lot, so I thought to put together a quick blog post on it.
Disclaimer: this is the advice I usually give people and is given without warranty. As they say, Your Mileage May Vary.
Main advice: bite the bullet and get the budget to get 100x coverage in long PacBio reads. 50-60x is really the minimum. Detailed advice:
Sequencing and assembly
get 100x PacBio latest chemistry aiming for longest reads (make sure provider has SAGE Blupippin or something similar)
get 100x HiSeq paired end regular insert
run PBcR on the PacBio reads, this is part of Celera. It corrects the longest raw reads, assembles them using Celera (long run time). Make sure to install the latest Celera release which uses the much faster MHAP approach for the correction.
BUSCO (we have issues with fish, seems not to be tailored to that group of organisms, developers tell us they are fixing it)
linkage map? or other map (RAD-tag based). (software?)
BioNanoGenomics can be used for QC also
Use a genome browser to get a feeling for your results, e.g. IGV; add assembly, BAM files, annotation, transcripts mapped and browse
We use MAKER2 for automated annotation
better with RNA-seq (assembled) data
best with PacBio IsoSeq data
filter filter filter
getting a repeat database is a science in itself, we are still learning
What about Oxford Nanopore data?
Absolutely for bacterial genomes, see the Loman/Quick/Simpson paper in Nature Merhods, or use Spades. For large eukaryote genomes, the potential is there. However, I would wait a bit for higher throughput and tailored software.
Edit: on Twitter, @ZaminIqbal asked
@lexnederbragt What changes to that would you make for smaller genomes Lex?
To which I replied:
HGAP from smrtanalysis as it does it al in one go
Circularisation with – need to ask what we use. I now think we follow this wiki from PacBio
At the workshop, I presented a poster based on my recent blog post on “Active learning strategies for bioinformatics teaching” (the first time I turned a blog post of mine into a poster…). The poster can be viewed on FigShare. I managed to make the poster a bit interactive itself, by having a small quiz on it. The results speak for themselves:
A quiz to make a poster on active learning techniques interactive
The more I read about how active learning techniques improve student learning, the more I am inclined to try out such techniques in my own teaching and training.
I attended the third week of Titus Brown’s “NGS Analysis Workshop”. This third week entailed, as one of the participants put it, ‘the bleeding edge of bioinformatics analysis taught by Software Carpentry instructors’ and was a unique opportunity to both learn different analysis techniques, try out new instruction material, as well as experience different instructors and their way of teaching. On top of that the group was just fantastic to hang out with, and we played a lot of volleyball.
I demonstrated some of my teaching and was asked by one of the students for references for the different active learning approaches I used. Rather then just emailing her, I decided to put these in this blog post.
The motivation of turning to active learning techniques is nicely summarised in a post on the ‘communications of the ACM’ blog entitled “Be It Resolved: Teaching Statements Must Embrace Active Learning and Eschew Lecture”. I highly recommended reading it and checking out the references mentioned. I am by no means an expert in the area, and simply am learning by doing. I have no ways to measure whether the techniques I use are beneficial, but student responses strongly encourage me to keep applying them. My teaching is also very much influenced by my being a Software Carpentry instructor.
The following describes what I do in the de novo genome assembly module of the ‘High Throughput Sequencing technologies and bioinformatics analysis’ course I organise (link to materials). I used part of that module for the NGS Analysis Workshop (link).
As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…
In general, when one needs high-performance compute (HPC) infrastructure, a (group of) researcher(s) can purchase these and locate them in or around the office, or use a cloud solution. Many, if not most, universities offer a computer cluster for their researchers’ analysis needs. We chose a hybrid model between the universitys HPC infrastructure and setting up one ourselves. In other words, our infrastructure is a mix of self-owned, and shared resources that we either apply for, or rent.
In 1986, in a letter to the journal Nature, James Bruce Walsh and Jon Marks lamented that the upcoming human genome sequencing project “violates one of the most fundamental principles of modern biology: that species consist of variable populations of organisms”. They further wrote: “As molecular biologists generally ignore any variability within a population, the individual whose haploid [sic] genome will be chosen will provide the genetic benchmark against which deviants are determined”. They conclude that ” ‘the’ genome of ‘the’ human will be sequenced gel by acrylamide gel”.
We have come a long way when it comes to taking population variation into account in molecular/genetic/genomic studies. But these sentiments, expressed already in 1986, echo some of the trends in the human genetics field: the move away from a single, linear representation of ‘the’ human genome. In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.
The Genome10K meeting is ongoing (I am not attending but following through twitter). Today, there will be a talk by Ian Korf about the feasibility of an Assemblathon 3 contest (see this tweet and the schedule). Earlier the @Assemblathon twitter account asked for a wishlist for an Assemblathon 3 through the hashtag #A3wishlist. With this post I want to share my opinion on what a possible Assemblathon 3 could and/or should be about.