On the Oxford nanopore MinIon Access Program
Nothing like a fancy new washing machine (source: wikimedia commons)
Let’s say you hear about this company that claims to have a revolutionizing clothes washer, and you get to try it out! For a refundable $1000, they will ship you a washer, and they have plenty of their special washing powder available, which they will happily sell to you. The thing is, though, the company hasn’t shown anyone how good it actually works, how fast it works, and how clean your clothes will become. Not just that, once you have the washer, you’ll need to first use it on dirty clothers they
will send you, and you’ll have to send them back the results. Also:
Only after achieving “consistent and satisfactory performance” with the test samples will participants be allowed to run their own [clothes]
Would you buy that washing machine? I guess not.
About a year ago, I attended a Software Carpentry Bootcamp. Software Carpentry aims ‘to make scientists more productive, and their work more reliable, by teaching them basic computing skills’. As I described in a previous blogpost, attending the bootcamp changed many aspects of the way I work. I also decided to become a Software Carpentry instructor. Together with Karin Lagesen, we recently taught a BootCamp in Oslo. Our group at the Centre for Ecological and Evolutionary Synthesis (CEES) is working on different aspects related to fish genomics. This more or less started with our leading role in the project for the sequencing and assembly of the genome of Atlantic cod. As part of that project, many group members had to learn basic unix shell commands, making small pipelines, running programs, and sometimes a bit of scripting. Now, we see many more people at the CEES are moving into bioinformatics, often as a result of them starting to get high-throughput sequencing data. Thus, we have many self-taught bioinformaticians at the centre, in other words, the ideal target audience for Software Carpentry.
To help the bioinformatics work in our group apply the principles of Software Carpentry, we are going to have an ‘extended’ workshop, spread over several weeks with one half-day session each week. In this post, I will present the intended subjects. Continue reading
In December last year, I posted a visualisation of the developments in high-throughput sequencing on this blog. In this field the technologies change rapidly, so it is about time for an update. Here is the October 2013 installment. Full run throughput in gigabases (billion bases) is plotted against single-end read length, both on a log scale:
This post originally appeared on the Software Carpentry blog.
I am a biologist with no formal training in Computational Science. A couple of years ago, the increasing size of my data forced me to stop using Excel, and switch to the Unix command line and scripting for my analyses. In the process, I learned what I needed by mostly by just doing it, but also from books, websites, and of course google.
Almost exactly one year ago, I attended a Software Carpentry bootcamp. I had heard about Software Carpentry and its bootcamps through twitter, started following their blog and became convinced that this was something I wanted to attend. At some point, I fired off an email to Software Carpentry asking what it would take to have a bootcamp at our university, the University of Oslo in Norway. The answer came down to ‘get us a room and advertise the event, and we’ll provide teachers’. This, in fact, was what happened, and as teachers, we got Mr Software Carpentry himself, Greg Wilson, who taught together with a local teacher (Hans Petter Langtangen).
Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:
- generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
- pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
- use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
- map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool
The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.
PacBio sequencing is all about looong reads, especially in relation to de novo sequencing. A few things are needed to get the longest reads possible:
- the longer the enzyme is active on the template, the longer the raw read will be – PacBio calls these ‘Polymerase reads’
- a library for pacBio sequencing consist of circular molecules, with the target insert between two hairpin adaptors, allowing the enzyme to ‘go around’ and sequence the opposite strand once it reaches the end of the insert. See my previous post on this here. It then follows that the longer the template used for library preparation, the smaller the chance the polymerase goes around the hairpin, leading to longer uniquely sequenced template – PacBio calles these ‘reads of insert’ – and these represent the most useful reads for de novo sequencing applications
- finally, the distribution of sizes of the library has an influence: any high-throughput sequencing technology, as well as PCR, has problems with ‘preferential treatment’ of smaller fragments. With PacBio sequencing, shorter molecules tend to load preferentially into the wells of the SMRTCell (‘chip’). It then makes sense to try to reduce the shoulder of shorter fragments for the final library preparation.
Recently, PacBio and Sage Science announced a co-marketing partnership for the BluePippin. This instrument allows for tight size selection of DNA samples, effectively making the peak of the size distribution much more narrow. With regard to PacBio sequencing, a narrow peak lessens the problem of preferential loading of short fragments, leading to much longer ‘reads of insert’. A demonstration can be seen on this poster (I think I know which fish they used for that one plot…).
The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.
First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target