I recieved an email today via the assemblathon mailing list. Erich Jarvis, from the Howard Hughes Medical Institute, wrote that he had asked somebody at Pacific Biosciences for some feedback on my previous post regarding the PacBio parrot reads.
There are a few things in the response worth repeating here.
First, it was pointed out that, with an per-read accuracy of 85%, phred quality values of 8-11 are to be expected. I had already realised this when I considered that a phred score of 10 means a 1:10 chance of a base being wrong, i.e. a 90% accuracy. For a phred score of 8, these numbers are 1 in 10 to the power of 0.8 error possibility, this is 1:6.3, or 16%, or 84% accuracy.
Second, the errors are spread randomly over a read, meaning that a little extra coverage (5x and more) should get rid of them.
Third, the PacBio person wrote:
It is a bit unfortunate that existing 2nd generation tools do not work well with this error profile and the short-read/high-accuracy mindset takes a while to adjust to the characteristic of our reads.
I could not agree more with that second part. In a way, I deliberately took a naive look at the PacBio reads, with the mindset of second-generation reads (I love the ‘short-read/high accuracy’ description…). The PacBio data is in a different league, a totally different error model, but potentially very long reads. There also is a very nice option to obtain high-quality reads using circular consensus sequencing, as shown for the E coli data just released by PacBio.
So, new tools are needed to work with these data, and I am eagerly looking forward to get the PacBio software installed (which will have to wait till after my holidays).
Finally, putting my ‘sequencing core facility hat’ on, we already are getting questions from our users along the lines of “when this machine finally arrives, can I then use it for this and this project?”. Often, the expectations are unrealistically high. It will be my and my colleagues’ task to guide these people into a direction that will get them the best results for their money.
As with second-generation sequencing, getting the data will be relatively easy (and cheap), making sense out of it will be something different alltogether…