Re: [ccp4bb] image compression

Herbert J. Bernstein Mon, 07 Nov 2011 10:01:48 -0800

This is a very good question.  I would suggest that both versions
of the old data are useful.  If was is being done is simple validation
and regeneration of what was done before, then the lossy compression
should be fine in most instances.  However, when what is being
done hinges on the really fine details -- looking for lost faint
spots just peeking out from the background, looking at detailed
peak profiles -- then the lossless compression version is the
better choice.  The annotation for both sets should be the same.
The difference is in storage and network bandwidth.


Hopefully the fraud issue will never again rear its ugly head,
but if it should, then having saved the losslessly compressed
images might prove to have been a good idea.

To facilitate experimentation with the idea, if there is agreement
on the particular lossy compression to be used, I would be happy
to add it as an option in CBFlib.  Right now all the compressions
we have are lossless.

Regards,
  Herbert


=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 y...@dowling.edu
=====================================================

On Mon, 7 Nov 2011, James Holton wrote:

At the risk of sounding like another "poll", I have a pragmatic question forthe methods development community:
Hypothetically, assume that there was a website where you could download theoriginal diffraction images corresponding to any given PDB file, including"early" datasets that were from the same project, but because of smeary spotsor whatever, couldn't be solved. There might even be datasets with "unknown"PDB IDs because that particular project never did work out, or because therelevant protein sequence has been lost. Remember, few of these datasetswill be less than 5 years old if we try to allow enough time for the originaldata collector to either solve it or graduate (and then cease to care). Evenfor the "final" dataset, there will be a delay, since the half-life betweendata collection and coordinate deposition in the PDB is still ~20 months.Plenty of time to forget. So, although the images were archived (probablynamed "test" and in a directory called "john") it may be that the only way tofigure out which PDB ID is the "right answer" is by processing them andcomparing to all deposited Fs. Assume this was done. But there will alwaysbe some datasets that don't match any PDB. Are those interesting? Whatabout ones that can't be processed? What about ones that can't even beindexed? There may be a lot of those! (hypothetically, of course).
Anyway, assume that someone did go through all the trouble to make thesedatasets "available" for download, just in case they are interesting, andannotated them as much as possible. There will be about 20 datasets for anygiven PDB ID.
Now assume that for each of these datasets this hypothetical website has twolinks, one for the "raw data", which will average ~2 GB per wedge (after gzipcompression, taking at least ~45 min to download), and a second link for a"lossy compressed" version, which is only ~100 MB/wedge (2 min download).When decompressed, the images will visually look pretty much like theoriginals, and generally give you very similar Rmerge, Rcryst, Rfree,I/sigma, anomalous differences, and all other statistics when processed withcontemporary software. Perhaps a bit worse. Essentially, lossy compressionis equivalent to adding noise to the images.
Which one would you try first? Does lossy compression make it easier to huntfor "interesting" datasets? Or is it just too repugnant to have "modified"the data in any way shape or form ... after the detector manufacturer'ssoftware has "corrected" it? Would it suffice to simply supply a couple of"example" images for download instead?
-James Holton
MAD Scientist

Re: [ccp4bb] image compression

Reply via email to