Overcoming the second generation sequencing mindset

(Picture found on rockstar-pickup.com, of all places)

I recieved an email today via the assemblathon mailing list. Erich Jarvis, from the Howard Hughes Medical Institute, wrote that he had asked somebody at Pacific Biosciences for some feedback on my previous post regarding the PacBio parrot reads.

There are a few things in the response worth repeating here.

First, it was pointed out that, with an per-read accuracy of 85%, phred quality values of 8-11 are to be expected. I had already realised this when I considered that a phred score of 10 means a 1:10 chance of a base being wrong, i.e. a 90% accuracy. For a phred score of 8, these numbers are 1 in 10 to the power of 0.8 error possibility, this is 1:6.3, or 16%, or 84% accuracy.

Second, the errors are spread randomly over a read, meaning that a little extra coverage (5x and more) should get rid of them.

Third, the PacBio person wrote:

It is a bit unfortunate that existing 2nd generation tools do not work well with this error profile and the short-read/high-accuracy mindset takes a while to adjust to the characteristic of our reads.

I could not agree more with that second part. In a way, I deliberately took a naive look at the PacBio reads, with the mindset of second-generation reads (I love the ‘short-read/high accuracy’ description…). The PacBio data is in a different league, a totally different error model, but potentially very long reads. There also is a very nice option to obtain high-quality reads using circular consensus sequencing, as shown for the E coli data just released by PacBio.

So, new tools are needed to work with these data, and I am eagerly looking forward to get the PacBio software installed (which will have to wait till after my holidays).

Finally, putting my ‘sequencing core facility hat’ on, we already are getting questions from our users along the lines of “when this machine finally arrives, can I then use it for this and this project?”. Often, the expectations are unrealistically high. It will be my and my colleagues’ task to guide these people into a direction that will get them the best results for their money.

As with second-generation sequencing, getting the data will be relatively easy (and cheap), making sense out of it will be something different alltogether…

4 thoughts on “Overcoming the second generation sequencing mindset”

I’m not entirely sure the rather provocative language used in these blog posts, and by Lawrence, would get the full approval of PacBio’s marketing department! (I realise Lex is nothing to do with PacBio).

Given that “short read/high accuracy” second-generation sequencer owners are going to form the majority of PacBio’s market, I’d recommend concentrating less on telling them they have it all wrong and more on changing that 85% accuracy statistic. Just a tip.

I’m not sure “short read” is bad, and I’m pretty dam sure there’s nothing wrong with “high accuracy”.

I feel a little like there is a Jedi mind trick going on here:

User: I’m worried about that 15% error rate
Jedi/PacBio: You don’t need to worry about the 15% error rate
User: I don’t need to worry about the 15% error rate

Sure, I get that I can just go round the circle a few extra times to iron out those pesky old indels, but that has to mean my costs are going to go up.

And one final thing. I’m all for “technology aware” software, as suggested here and by PacBio in their paper in 2010. But a fastq file is a really rubbish representation of what just went on in the machine. If PacBio want technology-aware software, I suggest giving users open access to the raw signal coming out of the machine, and open-source all of the code they use in there to convert the signal to bases and qualities. Release all their data, good and bad. Give the very keen and very enthusiastic community the opportunity to write software that works on the raw signal. Will that ever happen? What would we find buried in that code if it was released?

flxlex says:

July 13, 2011 at 22:17

Good points. I am not trying to say ‘they have it wrong’, just that ‘hey, it’s different, let’s see what we can do with it’.

I agree that PacBio should give us more ‘raw’ data, which they partially have done for the E coli reads (bas.hd5 files). Perhaps this is what we need for the assemblathon parrot data as well?

A lot of the source code is available online (through their devnet), I haven’t looked if signal processing is among it, but it wouldn’t surprise me.

Reply

Thanks, Lex, for responding to my feedback and providing a forum for discussion. I have included my full email below to put my quote in context.

We agree with your suggestion that it would benefit both Assemblathon participants and the sequencing community to release the more ‘raw’ bas.h5 files for these runs and I am working on getting them to the organizers.

As you mentioned, PacBio’s software – both binaries and source – is available for download from http://www.pacbiodevnet.com along with additional datasets. We have Python, Java, and R APIs for accessing the bas.h5 file programmatically and we are working on providing detailed tutorials in the near future. The bas.h5 files include both kinetic data for base incorporations and separate QV values for insertions, deletions, and mismatches.

The reason behind providing FASTQ for the Assemblathon was to give users a lower barrier of entry for incorporating our data with current assembly tools. We look forward to engaging and enabling researchers in these types of projects and hope this additional information will prove useful in the Assemblathon!

Regards,
Lawrence Lee
Senior Bioinformatics Scientist, Pacific Biosciences

Original Email:
=========================
From: Lawrence Lee
Sent: Wednesday, July 06, 2011 5:22 PM
To: ‘Erich Jarvis’
Subject: RE: [assemblathon] A basic analysis of Assemblathon 2 PacBio parrot reads

Hi Erich,

In the last table I provided on the README I broke down the accuracy/error mapping rates for a 2kb lambda library. It shows that we have about 85% single pass accuracy with the majority of the errors being insertions. QVs of 8-11 are consistent with this single-pass 85-88% accuracy. These errors are random in location and wash out quickly with coverage, and long reads with even lower single pass accuracy map extremely well to the reference.

It is a bit unfortunate that existing 2nd generation tools do not work well with this error profile and the short-read/high-accuracy mindset takes a while to adjust to the characteristic of our reads. If there is anything you think would help your assemblathon participants make the most of our data, please let us know and we can try to put some more info out there.

Thanks,
Lawrence
==========================

flxlex says:

July 14, 2011 at 11:00

Thanks! Looking forward to the ‘raw’ data,

Lex

Reply

Mick Watson says:

July 13, 2011 at 22:07

I’m not entirely sure the rather provocative language used in these blog posts, and by Lawrence, would get the full approval of PacBio’s marketing department! (I realise Lex is nothing to do with PacBio).

Given that “short read/high accuracy” second-generation sequencer owners are going to form the majority of PacBio’s market, I’d recommend concentrating less on telling them they have it all wrong and more on changing that 85% accuracy statistic. Just a tip.

I’m not sure “short read” is bad, and I’m pretty dam sure there’s nothing wrong with “high accuracy”.

I feel a little like there is a Jedi mind trick going on here:

User: I’m worried about that 15% error rate
Jedi/PacBio: You don’t need to worry about the 15% error rate
User: I don’t need to worry about the 15% error rate

Sure, I get that I can just go round the circle a few extra times to iron out those pesky old indels, but that has to mean my costs are going to go up.

And one final thing. I’m all for “technology aware” software, as suggested here and by PacBio in their paper in 2010. But a fastq file is a really rubbish representation of what just went on in the machine. If PacBio want technology-aware software, I suggest giving users open access to the raw signal coming out of the machine, and open-source all of the code they use in there to convert the signal to bases and qualities. Release all their data, good and bad. Give the very keen and very enthusiastic community the opportunity to write software that works on the raw signal. Will that ever happen? What would we find buried in that code if it was released?

- flxlex says:
  
  July 13, 2011 at 22:17
  
  Good points. I am not trying to say ‘they have it wrong’, just that ‘hey, it’s different, let’s see what we can do with it’.
  
  I agree that PacBio should give us more ‘raw’ data, which they partially have done for the E coli reads (bas.hd5 files). Perhaps this is what we need for the assemblathon parrot data as well?
  
  A lot of the source code is available online (through their devnet), I haven’t looked if signal processing is among it, but it wouldn’t surprise me.
  
Lawrence Lee says:

July 14, 2011 at 10:51

Thanks, Lex, for responding to my feedback and providing a forum for discussion. I have included my full email below to put my quote in context.

We agree with your suggestion that it would benefit both Assemblathon participants and the sequencing community to release the more ‘raw’ bas.h5 files for these runs and I am working on getting them to the organizers.

As you mentioned, PacBio’s software – both binaries and source – is available for download from http://www.pacbiodevnet.com along with additional datasets. We have Python, Java, and R APIs for accessing the bas.h5 file programmatically and we are working on providing detailed tutorials in the near future. The bas.h5 files include both kinetic data for base incorporations and separate QV values for insertions, deletions, and mismatches.

The reason behind providing FASTQ for the Assemblathon was to give users a lower barrier of entry for incorporating our data with current assembly tools. We look forward to engaging and enabling researchers in these types of projects and hope this additional information will prove useful in the Assemblathon!

Regards,
Lawrence Lee
Senior Bioinformatics Scientist, Pacific Biosciences

Original Email:
=========================
From: Lawrence Lee
Sent: Wednesday, July 06, 2011 5:22 PM
To: ‘Erich Jarvis’
Subject: RE: [assemblathon] A basic analysis of Assemblathon 2 PacBio parrot reads

Hi Erich,

In the last table I provided on the README I broke down the accuracy/error mapping rates for a 2kb lambda library. It shows that we have about 85% single pass accuracy with the majority of the errors being insertions. QVs of 8-11 are consistent with this single-pass 85-88% accuracy. These errors are random in location and wash out quickly with coverage, and long reads with even lower single pass accuracy map extremely well to the reference.

It is a bit unfortunate that existing 2nd generation tools do not work well with this error profile and the short-read/high-accuracy mindset takes a while to adjust to the characteristic of our reads. If there is anything you think would help your assemblathon participants make the most of our data, please let us know and we can try to put some more info out there.

Thanks,
Lawrence
==========================

- flxlex says:
  
  July 14, 2011 at 11:00
  
  Thanks! Looking forward to the ‘raw’ data,
  
  Lex

In between lines of code

Biology, sequencing, bioinformatics and more

Overcoming the second generation sequencing mindset

4 thoughts on “Overcoming the second generation sequencing mindset”

Leave a reply to Mick Watson Cancel reply

In between lines of code

Biology, sequencing, bioinformatics and more

Share this:

Related

4 thoughts on “Overcoming the second generation sequencing mindset”

Leave a reply to Mick Watson Cancel reply