At the risk of sounding like another "poll", I have a pragmatic question
for the methods development community:
Hypothetically, assume that there was a website where you could download
the original diffraction images corresponding to any given PDB file,
including "early" datasets that were from the same project, but because
of smeary spots or whatever, couldn't be solved. There might even be
datasets with "unknown" PDB IDs because that particular project never
did work out, or because the relevant protein sequence has been lost.
Remember, few of these datasets will be less than 5 years old if we try
to allow enough time for the original data collector to either solve it
or graduate (and then cease to care). Even for the "final" dataset,
there will be a delay, since the half-life between data collection and
coordinate deposition in the PDB is still ~20 months. Plenty of time to
forget. So, although the images were archived (probably named "test"
and in a directory called "john") it may be that the only way to figure
out which PDB ID is the "right answer" is by processing them and
comparing to all deposited Fs. Assume this was done. But there will
always be some datasets that don't match any PDB. Are those
interesting? What about ones that can't be processed? What about ones
that can't even be indexed? There may be a lot of those!
(hypothetically, of course).
Anyway, assume that someone did go through all the trouble to make these
datasets "available" for download, just in case they are interesting,
and annotated them as much as possible. There will be about 20 datasets
for any given PDB ID.
Now assume that for each of these datasets this hypothetical website has
two links, one for the "raw data", which will average ~2 GB per wedge
(after gzip compression, taking at least ~45 min to download), and a
second link for a "lossy compressed" version, which is only ~100
MB/wedge (2 min download). When decompressed, the images will visually
look pretty much like the originals, and generally give you very similar
Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other
statistics when processed with contemporary software. Perhaps a bit
worse. Essentially, lossy compression is equivalent to adding noise to
the images.
Which one would you try first? Does lossy compression make it easier to
hunt for "interesting" datasets? Or is it just too repugnant to have
"modified" the data in any way shape or form ... after the detector
manufacturer's software has "corrected" it? Would it suffice to simply
supply a couple of "example" images for download instead?
-James Holton
MAD Scientist