Overcoming the second generation sequencing mindset
July 13, 2011
I recieved an email today via the assemblathon mailing list. Erich Jarvis, from the Howard Hughes Medical Institute, wrote that he had asked somebody at Pacific Biosciences for some feedback on my previous post regarding the PacBio parrot reads.
There are a few things in the response worth repeating here.
A quick summary for the impatient: A set of in total more than 4 million PacBio reads, up to 17kb in length, is available as part of the assemblathon. Most of the reads are short (peaking at 600-700 or 950 bases), but a significant fraction is very long. The average read quality metrics show no reads above Q11.
The assemblathon is a ‘competition’ to enhance genome assembly. After the first round earlier this year, which was solely based on simulated reads, now assemblathon 2 has started, solely based on real reads. Researchers are asked to come up with their best assemblies, which will be compared against eachother using several metrics.
Two of the assemblathon 2 genomes have only Illumina sequences, but for the third one, a parrot, there are reads available from Illumina, 454 and PacBio.
As we are about to get our PacBio RS instrument later this year, I decided to have a look at the data available for the parrot.
First some info from the readme file:
“Two libraries were created at 7.5kb and 13kb insert sizes. The 7.5kb library was sequenced at 45 and 90 minutes and the 13kb library was only sequenced at 90 minute movies. The raw reads are filtered on a ReadQuality metric split on the adapter region to generate ‘subreads’.”
I guess this is for those reads where the sequencing goes around the SMRTBell and reads the same molecule twice. It further states that reads are only included if they are 100 bp (the 7.5 kb insert runs) or 500 bp (the 13 kb insert run) long, and had an internally developed (it seems) read quality parameter of at least 0.75 ‘raw single pass accuracy.’

