Thoughts on a possible Assemblathon3

The Genome10K meeting is ongoing (I am not attending but following through twitter). Today, there will be a talk by Ian Korf about the feasibility of an Assemblathon 3 contest (see this tweet and the schedule). Earlier the @Assemblathon twitter account asked for a wishlist for an Assemblathon 3 through the hashtag #A3wishlist. With this post I want to share my opinion on what a possible Assemblathon 3 could and/or should be about.

First a bit of background. The Assemblathon 1 and 2 (see http://assemblathon.org/) were open competitions where participants were given reads and asked to submit one or more de novo assemblies. The assemblies were judged based on a comparison to a ground truth (Assemblathon 1, the reference genome from which the simulated reads were drawn) or by using the original reads supplemented with orthogonal data from the same genomes (Assemblathon 2, where no reference genome was available and real sequencing data was supplied). Similar assembly ‘competitions’ such as GAGE, GAGE-B and GABenchToB were closed: the authors ran the assemblies and compared them to a reference genome.

Allowing for a bit of oversimplification, all these competitions asked the question: “which assembler works best,” to which they all concluded:

It depends

There is no apparent way to predict, based on the data types or organism, which assembler to use. Also, the ‘perfect’ assembly does not exist, each assembly comes with its flaws. For more on this matter, see the flurry of comments and posts that appeared in the aftermath of the Assemblathon 2 paper.

Some advances have been made in recent years. For example, complete and nearly perfect assemblies for bacterial genomes are now possible. But, in my opinion, the above conclusion still holds for larger genomes assembly projects, especially if short read data is used.

Of course, a lot has been learned from running these competitions, and I am very grateful for the effort that went into them. However, I do not think we need another such competition. If there is going to be an Assemblathon 3, there are of course obvious datasets to try (PacBio, Oxford Nanopore, synthetic long reads) in different combinations using updated and new assembly software tools. But I fear that the conclusion(s) from such an effort would be more or less the same as before: it depends.

When I consider the matter of choosing between assemblies, I usually take the position of the researchers running the assemblies. They have paid a lot of money to obtain a good sequence dataset, and now want to turn that into an assembly that is of high enough quality to enable them to answer the biological question that motivated the project. I would suggest anyone of those researchers to run their own mini-assemblathon: use different programs with different combinations to generate multiple assemblies. I wish this wasn’t needed but we just haven’t come that far in this field yet.

However, these researchers do not have a reference genome to compare their assemblies to and thereby find out which one is best. What they do have is all the data that went into the assembly, and perhaps some orthogonal data (transcriptome reads, optical mapping data, linkage map data, etc). The question then becomes how such a researcher is able to choose which assembly (or assemblies) to continue with.

I have previously written about tools to use in to help answer this question, which will help compare assemblies based on several metrics. Some new tools could be added to the list, but the rest of the post is still relevant. Two recent additions that should be mentioned are workflow programs that perform pre-assembly steps, assembly, and post assembly validation/improvement all in one go, taking a lot of the burden away from us researchers. These are iMetAMOS and RAMPART.

I believe it is here the Assemblathon 3 could make a contribution. By switching the focus from the assembly developers to the assembly users, Assemblathon 3 could help to answer the question:

How to choose the ‘right’ assembly from a set of generated assemblies

The definition of ‘right’ depends on the biological question (is presence/absence of genes enough, or is long-range contiguity needed, e.g. for comparative genomics). Assemblathon 3 could make a competition around the methods to choose between assemblies: supply a set of assemblies generated using the most popular assemblers, supply the raw data and a set of orthogonal datasets, and let the participants predict which assembly is ‘best’ on different combinations of metrics. Assembly improvement could perhaps be added to this as well (e.g. breaking assemblies at areas conflicting with paired end mappings, ‘polishing’ bases).

With this, I hope the outcome from an eventual new Assemblathon could be a concrete set of recommendations that will help researchers get the best out of their assembly projects.

2 thoughts on “Thoughts on a possible Assemblathon3

  1. Hi Lex,

    great overview. I think including methods to choose the “best-suited” assembly is a great idea. Some hints to what “kind” of raw data (long-read, short-read) you need and how “complete” an assembly has to be for different research questions would be very useful. As Titus put it (http://ivory.idyll.org/blog/thoughts-on-assemblathon-2.html), give an ideal metric/outcome (as far as this is suitable) for a specific research question and choose the assembly that fits best to this.

    But I still think the “whole” workflow that should go into an assembly is important. From QC, contamination control, trimming/adapter removal, error correction, (downsampling), assembly(ies), assembly metrics/correction, to combine different assemblies. Communicating this to biologists is important.

    Not to mention that the field is moving fast and the assembly competitions are outdated even faster (algorithms and new raw data types). That’s why I like http://nucleotid.es/ so much. Continous integration with evolving raw test data sets is the way to go. Assembly software developers can then provide their software for evaluation.

    From a biologists perspective it is very daunting to see the overwhelming options how to do the assembly workflow or even the assembly itself (parameter sweep? what’s that?). The biologists I know will stick to one assembler with standard parameters and (hopefully) one data cleaning software that somebody showed them how to use, and that’s it (although iMetAMOS and RAMPART were a big step forward). Thus, asking researches to do their own mini-assemblathon is not a reality (maybe I’m wrong). After all, most importantly as Karl Popper said, your analysis has to be reproducible.

    • Some good thoughts. The many factors influencing assembly make it even harder to justify a competition. The community probably would benefit more from some well-established best practices…

Leave a reply to lexnederbragt Cancel reply