This is a very good question. I would suggest that both versions
of the old data are useful. If was is being done is simple validation
and regeneration of what was done before, then the lossy compression
should be fine in most instances. However, when what is being
done hinges on the really fine details -- looking for lost faint
spots just peeking out from the background, looking at detailed
peak profiles -- then the lossless compression version is the
better choice. The annotation for both sets should be the same.
The difference is in storage and network bandwidth.
Hopefully the fraud issue will never again rear its ugly head,
but if it should, then having saved the losslessly compressed
images might prove to have been a good idea.
To facilitate experimentation with the idea, if there is agreement
on the particular lossy compression to be used, I would be happy
to add it as an option in CBFlib. Right now all the compressions
we have are lossless.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
y...@dowling.edu
=====================================================
On Mon, 7 Nov 2011, James Holton wrote:
At the risk of sounding like another "poll", I have a pragmatic question for
the methods development community:
Hypothetically, assume that there was a website where you could download the
original diffraction images corresponding to any given PDB file, including
"early" datasets that were from the same project, but because of smeary spots
or whatever, couldn't be solved. There might even be datasets with "unknown"
PDB IDs because that particular project never did work out, or because the
relevant protein sequence has been lost. Remember, few of these datasets
will be less than 5 years old if we try to allow enough time for the original
data collector to either solve it or graduate (and then cease to care). Even
for the "final" dataset, there will be a delay, since the half-life between
data collection and coordinate deposition in the PDB is still ~20 months.
Plenty of time to forget. So, although the images were archived (probably
named "test" and in a directory called "john") it may be that the only way to
figure out which PDB ID is the "right answer" is by processing them and
comparing to all deposited Fs. Assume this was done. But there will always
be some datasets that don't match any PDB. Are those interesting? What
about ones that can't be processed? What about ones that can't even be
indexed? There may be a lot of those! (hypothetically, of course).
Anyway, assume that someone did go through all the trouble to make these
datasets "available" for download, just in case they are interesting, and
annotated them as much as possible. There will be about 20 datasets for any
given PDB ID.
Now assume that for each of these datasets this hypothetical website has two
links, one for the "raw data", which will average ~2 GB per wedge (after gzip
compression, taking at least ~45 min to download), and a second link for a
"lossy compressed" version, which is only ~100 MB/wedge (2 min download).
When decompressed, the images will visually look pretty much like the
originals, and generally give you very similar Rmerge, Rcryst, Rfree,
I/sigma, anomalous differences, and all other statistics when processed with
contemporary software. Perhaps a bit worse. Essentially, lossy compression
is equivalent to adding noise to the images.
Which one would you try first? Does lossy compression make it easier to hunt
for "interesting" datasets? Or is it just too repugnant to have "modified"
the data in any way shape or form ... after the detector manufacturer's
software has "corrected" it? Would it suffice to simply supply a couple of
"example" images for download instead?
-James Holton
MAD Scientist