Two days ago, a paper appeared in Nature Scientific Data by Kristi Kim et al, titled “Long-read, whole-genome shotgun sequence data for five model organisms”. This paper describes the release of whole-genome PacBio data by Pacific Biosciences and others, for five model organisms, Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster, using quite recent chemistries.
Beyond the datasets described in the paper, Pacific Biosciences also released whole-genome data for the human genome, and very recently, for Caenorhabditis elegans using the latest P6/C4 chemistry. Check out PacBio devnet, also for data for other applications.
I think it is fantastic that Pacific Biosciences releases these datasets as a service to the community – and obviously to showcase their technology. Company-generated data often represents the best possible data, as it is done by people with very much experience with the technology. It remains to be seen if ‘regular’ owners of PacBio RS II instrument can reach the same level of data quality. Nonetheless, these datasets are very helpful for teaching (see my previous blog post), comparisons with other technologies (I wish a I could make time to throughly compare PacBio data to Moleculo data available from the same species), as well as development of new software applications.
I was a reviewer for the Nature Scientific Data paper. My full review report can be found on publons. Here I reprint the first paragraph.
I am happy to say that the authors addressed all the points that I raised in my review report in the final published version.
This Data Descriptor manuscript describes eight PacBio sequence datasets from five model organisms sequenced using the latest chemistries. The data is already available for the research community, the manuscript provides the necessary background for understanding how the data was generated and analysed. These sequencing datasets represent a highly valuable contribution to the community. Tools for working with the data from the PacBio RS are being developed at an increasing rate, and testing these tools requires high-quality data in combination with a (very close) reference genome sequence. The data described in this manuscript provide just that. I applaud Pacific Biosciences, and the authors and research groups involved, in releasing these variable datasets without restrictions. Such releases greatly speed up research, greatly enable the development and testing of new software and applications, and are a fantastic tool for teaching purposes. I hope other companies in a similar position follow suit.
Continue reading at publons.