Make Newbler open source: the Roche response and the future of Newbler

Earlier this year, I started a petition to ask Roche/454 Life Sciences to make the Newbler software (gsAssembly, gsMapper and Amplicon Variant Analyzer) open source. See this post for the background to the petition.

Source: Wikimedia Commons, by Marcus Quigmire

Source: Wikimedia Commons, by Marcus Quigmire

When I closed the petition, 162 people had signed it, see the PDF on figshare. During the Advances in Genome Biology and Technology (AGBT) meeting in Florida, I handed over the results of the petition to two Roche representatives, Dan Zabrowski, Head of Roche Sequencing Unit and Paul Schaffer, Vice President of Roche 454 Sequencing Business, see my blog post on the conversation I had with them.

Dan Zabrowski and Paul Schaffer promised me an official Roche response, and here it is (exclusively released through this blog): Continue reading

Looking back at AGBT 2014

Dinner at AGBT.jpg

I attended, for the first time, the Advances in Genome Biology and Technology (AGBT) meeting in Florida. With this post, I intend to summarise my experiences of the meeting. I will not cover everything that happened at the meeting, but focus of the areas of my own interest.

Continue reading

Make Newbler open source: petition results and the meeting with Roche

A couple of weeks ago, I started a petition to ask Roche/454 Life Sciences to make the Newbler software (gsAssembly, gsMapper and Amplicon Variant Analyzer) open source. See this post for the background to the petition.

The results are in, see the PDF on figshare. 162 people have signed the petition. Many thanks to all of you!

This week, I attended the Advances in Genome Biology and Technology (AGBT) meeting in Florida, and on Thursday I handed over the results of the petition to two Roche representatives, Dan Zabrowski, Head of Roche Sequencing Unit and Paul Schaffer, Vice President of Roche 454 Sequencing Business.

I had a half hour, very open and interesting discussion with Roche. Roche expressed their appreciation of the fact that we as a community voiced our concerns and wishes around Newbler. Roche occasionally picks up signals from researchers, but a petition like this was very useful for them as a much stronger signal of what we think about one of their products.

Dan Zabrowski told me Roche is committed to fully support access to the Newbler software even after the 454 Life Sciences shutdown. They will take the request for open source access to the code seriously, and promised to come with an official response somewhere in the coming weeks. They did not hint at what that response would be, which is understandable.

I want to thank Dan Zabrowski and Paul Schaffer for giving me time to explain the background and hand them the results. I also again want to thank all of you who signed the petition. We may collectively have made a difference. Keep an eye out on my twitter feed and this blog for the official Roche response!

The one and only Oxford Nanopore talk at AGBT 2014 – with real data

The one and only Oxford Nanopore talk at AGBT 2014 – with real data

Twitter started buzzing this morning at AGBT because researchers started getting confirmation emails from Oxford Nanopore regarding their application for the MinION Access Program (MAP). This timed well with the first talk discussing some serious data from the platform, by David Jaffe from the Broad Institute, entitled “Assembly of Bacterial Genomes Using Long Nanopore Reads”. Probably no coincidence…

The MinION from Oxford Nanopore. Source https://www.nanoporetech.com. Unshure about copyright...

The MinION from Oxford Nanopore. Source https://www.nanoporetech.com. Unsure about copyright…

David Jaffe’s highly anticipated talk showed data generated by Oxford Nanopore on their MinION from two bacterial genomes. One was a methylation negative E coli (the fact that it was methylation negative may have been significant, but he didn’t say). The second species was a Scardovia. 5 micrograms of DNA were sent to the company. Library prep consisted of fragmentation and adaptor ligation, basically the classical workflow. Nothing was said about the type of adaptors.

Continue reading

Make Newbler Open Source: an update

My petition to ask Roche/454 Life Science to make the Newbler software open source, as announced on this blog, has so far yielded 124 signatures. Thanks to all who signed!

It is not to late to add your signature!

Please do this before Thursday February 13th. On that day, at the Advances in Genome Biology and Technology (AGBT) meeting, I have been given the opportunity to hand over the petition results to two Roche representatives: Dan Zabrowski, Head of Roche Sequencing Unit and Paul Schaffer, Vice President of Roche 454 Sequencing Business. The fact that Roche opens up for me meeting these people makes me very happy!

Finally, here are some of the comments added by people who signed the petition. I anonymised them, but if you see your comment and would rather not I repeat it here, let me know (lex dot nederbragt at ibv dot uio dot no). Thanks for these wonderful comments on the Newbler assembly:

Roche has an excellent opportunity of publicit their commitment with science making their fantastic product Newbler open-source

I would like to runProject forever!

Newbler source code is very valuable to the sequencing community. Roche will be doing a service to the scientific community by making the Newbler source code available after the 454 shutdown.
Goodwill from making this OS is best value Roche can get for this code
This is way overdue. James Knight has done wonderful work but Newbler may fall through the cracks if it is not opened up to the community.

The newbler is great assembler for paired end data.
Please make it opensource, and I would be happy to contribute to it’s development.

Open source is a friend to science aiding reproducibility and allowing future developers to learn from existing work. Please do not limit science by preventing our access.

The Newbler suite has been remarkably useful and should continue to be so. I hope Jim Knight sees this and pushes for his creation to be made open source

I strongly support this petition and note there is a precedent for a commercially-developed assembler to be made publicly available under an open source license – the Celera assembler – much to the benefit of the scientific community.

Newbler was the basis for contig assembly of MiSeq data for me. An excellent piece of software.

Newbler was always my favorite thing about 454. It deserves to live on.

Make Newbler open source

(Cross-posted at contig.wordpress.com)

Typical newbler assembly output

Typical Newbler assembly output

The Newbler assembler and mapper (gsAssembler, gsMapper) was developed especially for working with the reads from the Roche/454 Life Science sequencing technology. It is one of the best programs to deal with this type of data, scoring well in the assemblathon 2 competition. Newbler has been used for many large and small genome assemblies (numerous bacteria, Atlantic cod, bonobo, tomato, to name a few). Recently, Newbler has added support for using multiple sequencing technologies, making it one of the few hybrid assembly programs available. At the Advances in Genome Biology and Technology (AGBT) in 2013, Roche announced having used the Newbler program with a hybrid 454 and Illumina dataset to improve upon the human genome.

However, the Newbler program is not open source. Luckily, researchers only need to fill out an online form to get a free copy of the software. Still, this has hampered the wide-spread adoption of this program. Newbler, for example, was not included in assembly evaluations like GAGE and GAGE-B. That Roche/454 does not want to make the source code for Newbler available is partly understandable from a commercial standpoint: at least one competitor technology (Life Tech/Ion Torrent) with a similar sequencing error-model could benefit from access to the code. In fact, in a blog post, I showed Newbler to be superior to an open-source program when assembling Ion Torrent mate-pair data.

More worringly is that the hundreds of projects that used Newbler as part of the analysis are fundamentally irreproducible without the source code for each of the different versions. This is especially the case for projects, such as the Atlantic cod genome project, that have been given access to development versions of the code, incorporating elements not available to the general community.

Last October, Roche announced it will shutdown its 454 sequencing business in mid-2016. Whatever one may feel about this decision, this further strengthens the argument for Roche/454 to make the Newbler source code open source. After the 454 shutdown, Newbler is otherwise likely to disappear too, meaning that large swathes of the literature cannot be recapitulated from the raw data. Also, long after the 454 shutdown, many researchers will have to process their 454 sequencing data, and many may still want to rely on Newbler for that purpose.

There are several other reasons why I feel the research community should be given access to the source code of Newbler. Newbler represents a very valuable contribution to the field of genome assembly and mapping. Software developers can learn from the algorithms and implementations of the Newbler code, opening up for reusing these in other programs. Also, there is the hope that developers will improve upon the program, for example by adding support for other sequencing technologies, or assembling with reads longer than the current maximum of 2 kbp.

So I hereby ask the readers of this blog for help: I have set up an online petition asking for Roche/454 to make the Newbler source code available at the latest at the time of the 454 shutdown. Please sign the petition here. Additionally, spread the word (e.g., on twitter or your own blog). Thanks in advance!

I intend to hand over the results of the petition to a Roche representative at the Advances in Genome Biology and Technology (AGBT) meeting (February 12-15, 2014).

Finally, feel free to use the comments to tell me about your Newbler experiences!

(Thanks to Nick Loman for his constructive comments on an earlier version of this post)

A Software Carpentry-inspired workshop to improve the way we do bioinformatics in our group

About a year ago, I attended a Software Carpentry Bootcamp. Software Carpentry aims ‘to make scientists more productive, and their work more reliable, by teaching them basic computing skills’. As I described in a previous blogpost, attending the bootcamp changed many aspects of the way I work. I also decided to become a Software Carpentry instructor. Together with Karin Lagesen, we recently taught a BootCamp in Oslo. Our group at the Centre for Ecological and Evolutionary Synthesis (CEES) is working on different aspects related to fish genomics. This more or less started with our leading role in the project for the sequencing and assembly of the genome of Atlantic cod. As part of that project, many group members had to learn basic unix shell commands, making small pipelines, running programs, and sometimes a bit of scripting. Now, we see many more people at the CEES are moving into bioinformatics, often as a result of them starting to get high-throughput sequencing data. Thus, we have many self-taught bioinformaticians at the centre, in other words, the ideal target audience for Software Carpentry.

To help the bioinformatics work in our group apply the principles of Software Carpentry, we are going to have an ‘extended’ workshop, spread over several weeks with one half-day session each week. In this post, I will present the intended subjects. Continue reading

De novo bacterial genome assembly: a solved problem?

Pacific Biosciences published a paper earlier this year on an approach to sequence and assemble a bacterial genome leading to a near-finished, or finished genome. The approach, dubbed Hierarchical Genome Assembly Process (HGAP), is based on only PacBio reads without the need for short-reads. This is how it works:

  • generate a high-coverage dataset of the longest reads possible, aim for 60-100x in raw reads
  • pre-assembly: use the reads from the shorter part of the raw read length distribution, to error-correct the longest reads, set the cutoff in such a way so that the longest reads make up about 30x coverage
  • use the long, error-corrected reads in a suitable assembler, e.g. Celera, to produce contigs
  • map the raw PacBio reads back to the contigs to polish the final sequence (rather, recall the consensus using the raw reads as evidence) with the Quiver tool

The approach is very well explained on this website. As an aside, the same principle can now be used with the PacBioToCA pipeline.

Continue reading

Longing for the longest reads: PacBio and BluePippin

PacBio sequencing is all about looong reads, especially in relation to de novo sequencing. A few things are needed to get the longest reads possible:

  • the longer the enzyme is active on the template, the longer the raw read will be – PacBio calls these ‘Polymerase reads’
  • a library for pacBio sequencing consist of circular molecules, with the target insert between two hairpin adaptors, allowing the enzyme to ‘go around’ and sequence the opposite strand once it reaches the end of the insert. See my previous post on this here. It then follows that the longer the template used for library preparation, the smaller the chance the polymerase goes around the hairpin, leading to longer uniquely sequenced template – PacBio calles these ‘reads of insert’ – and these represent the most useful reads for de novo sequencing applications
  • finally, the distribution of sizes of the library has an influence: any high-throughput sequencing technology, as well as PCR, has problems with ‘preferential treatment’ of smaller fragments. With PacBio sequencing, shorter molecules tend to load preferentially into the wells of the SMRTCell (‘chip’). It then makes sense to try to reduce the shoulder of shorter fragments for the final library preparation.

Recently, PacBio and Sage Science announced a co-marketing partnership for the BluePippin. This instrument allows for tight size selection of DNA samples, effectively making the peak of the size distribution much more narrow. With regard to PacBio sequencing, a narrow peak lessens the problem of preferential loading of short fragments, leading to much longer ‘reads of insert’. A demonstration can be seen on this poster (I think I know which fish they used for that one plot…).

Continue reading

On assembly uncertainty (inspired by the Assemblathon2 debate)

The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the  homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.

First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):

An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target

Continue reading