As before, full run throughput in gigabases (billion bases) is plotted against single-end read length for the different sequencing platforms, both on a log scale. Yes, I know a certain new instrument seems to be missing, hang on, I’m coming back to that…
I work for the Norwegian High-Throughput Sequencing Centre (NSC), but at the Centre for Ecological and Evolutionary Synthesis (CEES). At CEES, numerous researchers run bioinformatic analyses, or other computation-heavy analyses, for their projects. With this post, I want to describe the infrastructure we use for calculations and storage, and the reason why we chose to set these up the way we did.
In general, when one needs high-performance compute (HPC) infrastructure, a (group of) researcher(s) can purchase these and locate them in or around the office, or use a cloud solution. Many, if not most, universities offer a computer cluster for their researchers’ analysis needs. We chose a hybrid model between the universitys HPC infrastructure and setting up one ourselves. In other words, our infrastructure is a mix of self-owned, and shared resources that we either apply for, or rent.
In 1986, in a letter to the journal Nature, James Bruce Walsh and Jon Marks lamented that the upcoming human genome sequencing project “violates one of the most fundamental principles of modern biology: that species consist of variable populations of organisms”. They further wrote: “As molecular biologists generally ignore any variability within a population, the individual whose haploid [sic] genome will be chosen will provide the genetic benchmark against which deviants are determined”. They conclude that ” ‘the’ genome of ‘the’ human will be sequenced gel by acrylamide gel”.
We have come a long way when it comes to taking population variation into account in molecular/genetic/genomic studies. But these sentiments, expressed already in 1986, echo some of the trends in the human genetics field: the move away from a single, linear representation of ‘the’ human genome. In this post I will provide some background, explain the reasons for moving towards graph-based representations, and indicate some challenges associated with this development.
The Genome10K meeting is ongoing (I am not attending but following through twitter). Today, there will be a talk by Ian Korf about the feasibility of an Assemblathon 3 contest (see this tweet and the schedule). Earlier the @Assemblathon twitter account asked for a wishlist for an Assemblathon 3 through the hashtag #A3wishlist. With this post I want to share my opinion on what a possible Assemblathon 3 could and/or should be about.
Earlier this week, the first paper was published describing the use of Oxford Nanopore MinION data to solve a biological question. The paper, entitled “MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island” came out in Nature Biotechnology (ReadCube link).
I was a reviewer for this manuscript. I have posted my two (signed) review reports on publons. As data and code were made available by the authors (as it should be), I made a (mostly successful) effort to reproduce the computational part of the paper. After I was done with the review report of the second version I could not help myself to have a further look at some of the results. This led to me sending some plots to the authors, and one of these plots ended up becoming figure 1. This was a lot of fun to see in the final version.
Below are some excerpts of the review reports.
Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.
Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.
I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.
Open source, open data, open course
We recently had the third instalment of the course in Throughput Sequencing technologies and bioinformatics analysis. This course aims to provide students, as well as users of the organising service platforms, basic skills to analyse their own sequencing data using existing tools. We teach both unix command line-based tools, as well as the Galaxy web-based framework.
I coordinate the course, but also teach a two-day module on de novo genome assembly. I keep developing the material for this course, and am increasingly relying on material openly licensed by others. To me, it is fantastic that others are willing to share material they developed openly for others to (re)use. It was hugely inspiring to discover material such as the assembly exercise, and the IPython notebook to build small De Bruijn Graphs (see below). To me, this confirms that ‘opening up’ in science increases the value of material many orders of magnitude. I am not saying that the course would have been impossible without having this material available, but I do feel the course has become much better because of it.
‘Open’ made this course possible
This course used:
- openly available sequencing data released by the sequencing companies (although some of the Illumina reads are behind a – free – login account)
- sequencing data made openly available by individual researchers
- code developed for teaching made available by individual researchers under a permissive license
- open source software programs
(for a full list or resources, see this document).
I am extremely grateful to the authors/providers of these resources, as they greatly benefitted this course!
‘Opening up’ is the least I can do to pay back
In exchange, the very least I can do is making my final course module openly available as well.
The rest of this post describes the material and it’s sources in more detail.