The longest reads yet: A look at PacBio’s contribution to the assemblathon

A quick summary for the impatient: A set of in total more than 4 million PacBio reads, up to 17kb in length, is available as part of the assemblathon. Most of the reads are short (peaking at 600-700 or 950 bases), but a significant fraction is very long. The average read quality metrics show no reads above Q11.

The assemblathon is a ‘competition’ to enhance genome assembly. After the first round earlier this year, which was solely based on simulated reads, now assemblathon 2 has started, solely based on real reads. Researchers are asked to come up with their best assemblies, which will be compared against eachother using several metrics.

Two of the assemblathon 2 genomes have only Illumina sequences, but for the third one, a parrot, there are reads available from Illumina, 454 and PacBio.

As we are about to get our PacBio RS instrument later this year, I decided to have a look at the data available for the parrot.

First some info from the readme file:

“Two libraries were created at 7.5kb and 13kb insert sizes. The 7.5kb library was sequenced at 45 and 90 minutes and the 13kb library was only sequenced at 90 minute movies. The raw reads are filtered on a ReadQuality metric split on the adapter region to generate ‘subreads’.”

I guess this is for those reads where the sequencing goes around the SMRTBell and reads the same molecule twice. It further states that reads are only included if they are 100 bp (the 7.5 kb insert runs) or 500 bp (the 13 kb insert run) long, and had an internally developed (it seems) read quality parameter of at least 0.75 ‘raw single pass accuracy.’

There were three files, two for the 7.5 kb inserts, with 45 and 90 minutes movie times, and one for the 13 kb insert size, with 90 minutes movie times. Read numebrs were 1.1 million, almost 0.5 million and 2.6 million respectively.

The readme states that the 13 kb insert library was their first attempt at this size, and that they saw ‘a much higher fraction of smaller inserts from the 13kb library than the 7.5kb library’. Also with PacBio sequencing, shorter fragments outcompete longer ones, so unfortunately they had a somewhat high amount of short reads.

“The mean raw readlengths (reads not split on adapters) for the 7.5kb and 13kb libraries are 2349bp and 3092bp respectively.”

I plotted the length distribution of the three runs:

The plot shows a peak at 600-700 bases for the 7.5 kb libraries, and around 950 for the 13 kb insert library. Beyond that, the read length slowly declines. The maximum read lengths are 7618, 16947 and 15548 bases, respectively. Yes, those are reads of almost 17 kb! (for 90 minute movies, that is).

The files are in fastq format, so I decided to run the FastQC program on these. This very useful program generates a report (actually, an html page) with a bunch of metrics. Here there was a surprise. The readme file states

“Filtered subreads are contained in a FASTQ files with Phred33 encoding.”

The per-sequence (mean) quality score distributions showed peaks at phred score 7 or 8, for example for the 13 kb insert library:

The quality is very even across the read, again showing the graph for the 13 kb insert run:

What is going on here? Are these quality scores really that low for these reads? An earlier report on the Haiti cholera PacBio dataset (with much shorter reads) showed how the shortest reads had much lower quality than the long ones. The same effect does not seem to be the case here. My own quick check of the average quality scores showed these  never get over Q11, as the graphs above also show. Alo, I could not find an obvious relation between read length and average quality. Either these reads do indeed have such low qualities, or the PacBio quality score calculations are not really comparable to those of other platforms.

I made the full fastqc reports available here:
7.5 kb insert library, 45 minute movie
7.5 kb insert library, 90 minute movie
13 kb insert library, 90 minute movie
(If someone can help me make sense of the kmer content plot of the 7.5 kb insert, 45 minute run, please let me know!)

So, in conclusion, this is the largest PacBio dataset availble online that I know of, and the read lengths look promising, although with a very large proportion of short reads. Well, short: these are peaking at Sanger read length, which is quite a feat already. I need some more background information on the quality scoring calculations to say anything meaningful on the low average scores for these reads, however.

Lack of a reference genome does not allow for precise mapping quality measurements, but once assemblies of the parrot data using the other platforms are available, one can start comparing these PacBio reads to them.

Code used

fastqc

fastqc --noextract PacBio_7.5kb_90min_100bp_0.75RQ.fastq

Read length distributions:

cat PacBio_13kb_90min_500bp_0.75RQ.fastq | awk 'NR%4==2 {cnt[length($0)]++}END{for (x in cnt){print x"\t"cnt[x]}}'|sort -n -k1,1 > PacBio_13kb_90min_500bp_length.tsv

R code:

dat1 = read.table("PacBio_7.5kb_45min_100bp_length.tsv", header=F, sep="\t")
dat2 = read.table("PacBio_7.5kb_90min_100bp_length.tsv", header=F, sep="\t")
dat3 = read.table("PacBio_13kb_90min_500bp_length.tsv", header=F, sep="\t")

main = "Read length distribution"
xlab = "Read length"
ylab = "Count"

peak1=dat1$V1[which(dat1$V2==max(dat1$V2))]
peak2=dat2$V1[which(dat2$V2==max(dat2$V2))]
peak3=dat3$V1[which(dat3$V2==max(dat3$V2))]

pdf(file="readlen.pdf")
plot(dat3$V1,dat3$V2,type="l", main=main, xlab=xlab, ylab=ylab,xlim=c(0,6000))
lines(dat2$V1,dat2$V2,col="blue")
lines(dat1$V1,dat1$V2,col="red")
text(2000,1700,pos=4,labels=paste("PacBio_13kb_90min_500bp, peak length: ",round(peak3)))
text(2000,1600,pos=4,labels=paste("PacBio_7.5kb_45min_100bp, peak length: ",round(peak1)), col="red")
text(2000,1500,pos=4,labels=paste("PacBio_7.5kb_90min_100bp, peak length: ",round(peak2)), col="blue")
dev.off()

5 thoughts on “The longest reads yet: A look at PacBio’s contribution to the assemblathon

  1. The FastQ format quality scores do a really crude job representing PacBio error, their hd5-encoded format can give probabilities of insertion and deletion error at each site in addition to substitution error. I hope that information becomes available for the Parrot dataset.

    • How am I to understand this, the probability that there should be an insertion at that position in addition to the reported base? Before the reported base or after?

Leave a comment