The release of the assemblathon2 paper via arXiv, the blog post by Titus Brown and his posting of his review of the paper sparked a discussion in the comments section of Titus’ blog, as well as on the homolog_us blog. With this blog post, I’d like to chip in with a few remarks of my own.
First, I agree with all of what Titus Brown said (‘you know, what he said’). One of his main take-home messages from the Assemblathon2 is the uncertainty associated with any assembly, and how this needs to be communicated better. When I give presentations to introduce ‘this thing called assembly’, I often start out with a quote from the Miller et al. 2010 paper in Genomics (‘Assembly algorithms for next-generation sequencing data’):
An assembly is a hierarchical data structure that maps the sequence data to a putative reconstruction of the target
I stress this aspect of the assembly being a ‘putative reconstruction’ first. I then move on to explain the hierarchy from reads to contigs to scaffolds (and perhaps superscaffolds/pseudochromosomes). In the future, I intend stress the fact that an assembly is a hypothesis of the genome more than I already do. Like Titus Brown said:
[…] each assembly is a computational hypothesis, developed from noisy data using approximate computation, and that this assembly must be treated with skepticism
This brings me to my main point: I think we are more then ever positioned to quantify the uncertainty of an assembly, and report both globally and locally how trustworthy it is. The Assemblathon2 mostly used data not involved in building the assembly, such as fosmid sequenced and optical maps. Titus Brown rightly points out in his review that RNA-seq (transcriptomics) data could also be used. I would like to add another aspect: checking that the assembly you obtain is in fact in correspondence with the data you used to built it. Do the reads agree with the assembly? As Titus says, did all reads actually end up being used?
The following tools are already available for this:
- REAPR, used for the Assemblathon2 validations, is ‘a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison’. Link
- FRCBAM extends the FRC approach, detecting and quantifying ‘features’ possibly indicating assembly problems, based on mapping reads back to the assembly. For their paper on FRCBAM, the authors actually tested their approach on the Assemblathon2 results. Link, preprint on arXiv
- ALE (‘Assembly Likelihood Evaluation’) uses aligned reads and k-mer distributions to determine a likelihood score. ALE is meant as a comparative measure, i.e. to compare different assemblies by their ALE score. Link, paper
- CGAL is another likelihood based method. Link, paper
Note that all these tools start by (having you) mapping back one or more of the input read datasets to the assembled sequences, and then using different methods/algorithms to score fidelity of the sequences, and/or detect possible misassemblies. Locally aligned read depth, insert sizes between paired reads, orientations of the pairs are all indicators used by these approaches. Most importantly, none of these tools need a reference sequence for the task!
Notably, many of these tools are able to generate information on assembly uncertainty along the assembled sequences, information which can be transformed into genomic tracks (e.g., through gff files) to be used in a genome browser. This would allow for supplying these tracks with a finished assembly.
As some of the readers of this blog may be aware, we are working on a new version of the Atlantic cod genome. We intend to try out these tools (we haven’t gotten very far with this yet), and see if we can generate tracks in an appropriate format. Hopefully, we will be able to develop a way to integrate the information of all the tracks into a final one with a summary ‘assembly (un)certainty’ score along the assembled sequences. When this score drops below a certain threshold, we intend to break the assembly (REAPR has implemented this already) to improve the final assembly.
On important caveat of this approach is that the tools describe depend on mapping reads back to the assembly. This poses its own problems, as mapping, like assembly, is far from a solved problem. Nor can, for that matter, one mapping exercise serve all tools. Not only have we had two assemblathons, and other such projects (GAGE, dnGASP), now two or perhaps more alignment competitions are starting (Alignathon and Variathon). So I expect some fiddling with the alignments to get all the tools to work properly (not looking forward to that 😦 ).
In conclusion, I think we are more than able to report on uncertainties – and obviously certainties – associated with the hypothesis we call the assembly. I think it is time this reporting, for example in the form of annotation tracks visualisable in a genome browser, becomes standard for new assemblies. It must be said, though, that whether people actually are going to use this information remains to be seen. How many programs take per-base quality values of the reference into account when mapping reads, or of gene sequences when generating phylogenies (not my field, so I may be very off here)?
By the way, for certain reference genomes (human, mouse and zebrafish), tracks with clone-derived assembly problems are already available, see this blog post from The Genome Reference Consortium.
Let’s hope that by using methods to explicitly report assembly uncertainties, we can make the end-users of our ‘reference’ assemblies more aware of the strengths and weaknesses of the material they work with.
(Note that I left out QUAST and Amosvalidate+FRC, the first because it is not very useful without a reference sequence, the others because these require a file provided by the assembler describing the read layout, and there are very few assemblers able to generate this file)
Very nice post & list of programs. I am also thinking hard about how to use a diginorm-based k-mer approach and/or reference-free error correction to estimate proportion of “information” present in the assembly, which would quantify sensitivity. I look forward to seeing what you turn up with your approaches!
Perhaps somewhat on the lines of what Titus is suggestion, after hearing the nice talk at AGBT on reference-free, assembly-free variant detection (by comparing kmer tables between two or more datasets), I was wondering if a useful tool would look for cases of kmers found in the reads but not in the assembly. Frequency of the kmer matters of course, and that would be an important component.
I like the idea!
We actually have it working for genome assembly; it’s pretty trivial there, although heterozygosity and repeats combined throw a spanner into it. The real trick is to get it working for mRNAseq and metagenomics, which is where digital normalization comes in handy. If I get the free time in the next few days, I’ll post some instructions and graphs, tho.
Good post Lex ! Can we classify it as thoughts on thoughts on thoughts that Titus wanted to see? 🙂
> I was wondering if a useful tool would look for cases of kmers found in the reads but not in the assembly.
I think counting k-mers in the reads is the computationally challenging problem and we covered few approaches here –
Once the k-mers are counted in reads, comparing with assembly is straightforward.
You might find useful KmerCounter, http://sourceforge.net/p/kmercounter/code/ci/ba4a13b3280fada5f96b22b37bee01f404e0e563/tree/, a 16mer counter using a big hash. Pros: only about 16Gb of memory required to keep the hash in memory and extremely fast as it just needs to parse linearly input data and store it into memory. Cons: only 16mers can be used due to the hash construction.
I find it extremely useful to get a quick overview of my data, at least when the reference genome is smaller than 4Gb=4^16, otherwise I get bias on kmers distribution due to lack of kmers space.