[ccp4bb] image compression

James Holton Mon, 07 Nov 2011 09:31:16 -0800

At the risk of sounding like another "poll", I have a pragmatic questionfor the methods development community:

Hypothetically, assume that there was a website where you could downloadthe original diffraction images corresponding to any given PDB file,including "early" datasets that were from the same project, but becauseof smeary spots or whatever, couldn't be solved. There might even bedatasets with "unknown" PDB IDs because that particular project neverdid work out, or because the relevant protein sequence has been lost.Remember, few of these datasets will be less than 5 years old if we tryto allow enough time for the original data collector to either solve itor graduate (and then cease to care). Even for the "final" dataset,there will be a delay, since the half-life between data collection andcoordinate deposition in the PDB is still ~20 months. Plenty of time toforget. So, although the images were archived (probably named "test"and in a directory called "john") it may be that the only way to figureout which PDB ID is the "right answer" is by processing them andcomparing to all deposited Fs. Assume this was done. But there willalways be some datasets that don't match any PDB. Are thoseinteresting? What about ones that can't be processed? What about onesthat can't even be indexed? There may be a lot of those!(hypothetically, of course).

Anyway, assume that someone did go through all the trouble to make thesedatasets "available" for download, just in case they are interesting,and annotated them as much as possible. There will be about 20 datasetsfor any given PDB ID.

Now assume that for each of these datasets this hypothetical website hastwo links, one for the "raw data", which will average ~2 GB per wedge(after gzip compression, taking at least ~45 min to download), and asecond link for a "lossy compressed" version, which is only ~100MB/wedge (2 min download). When decompressed, the images will visuallylook pretty much like the originals, and generally give you very similarRmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all otherstatistics when processed with contemporary software. Perhaps a bitworse. Essentially, lossy compression is equivalent to adding noise tothe images.

Which one would you try first? Does lossy compression make it easier tohunt for "interesting" datasets? Or is it just too repugnant to have"modified" the data in any way shape or form ... after the detectormanufacturer's software has "corrected" it? Would it suffice to simplysupply a couple of "example" images for download instead?


-James Holton
MAD Scientist

[ccp4bb] image compression

Reply via email to