Hi James,

I see no real need for lossy compression datasets. They may be useful for demonstration purposes, and to follow synchrotron data collection remotely. But for processing I need the real data. It is my experience that structure solution, at least in the difficult cases, depends on squeezing out every bit of scattering information from the data, as much as is possible with the given software. Using a lossy-compression dataset in this situation would give me the feeling "if structure solution does not work out, I'll have to re-do everything with the original data" - and that would be double work. Better not start going down that route.

The CBF byte compression puts even a 20bit detector pixel into a single byte, on average. These frames can be further compressed, in the case of Pilatus fine-slicing frames, using bzip2, almost down to the level of entropy in the data (since there are so many zero pixels). And that would be lossless.

Storing lossily-compressed datasets would of course not double the diskspace needed, but would significantly raise the administrative burdens.

Just to point out my standpoint in this whole discussion about storage of raw data: I've been storing our synchrotron datasets on disks, since 1999. The amount of money we spend per year for this purpose is constant (less than 1000€). This is possible because the price of a GB disk space drops faster than the amount of data per synchrotron trip rises. So if the current storage is full (about every 3 years), we set up a bigger RAID (plus a backup RAID); the old data, after copying over, always consumes only a fraction of the space on the new RAID.

So I think the storage cost is actually not the real issue - rather, the real issue has a strong psychological component. People a) may not realize that the software they use is constantly being improved, and that needs data which cover all the corner cases; b) often do not wish to give away something because they feel it might help their competitors, or expose their faults.

best,

Kay (XDS co-developer)




-------- Original Message --------
Date: Mon, 7 Nov 2011 09:30:11 -0800
From: James Holton <jmhol...@lbl.gov>
Subject: image compression
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

At the risk of sounding like another "poll", I have a pragmatic question
for the methods development community:

Hypothetically, assume that there was a website where you could download
the original diffraction images corresponding to any given PDB file,
including "early" datasets that were from the same project, but because
of smeary spots or whatever, couldn't be solved.  There might even be
datasets with "unknown" PDB IDs because that particular project never
did work out, or because the relevant protein sequence has been lost.
Remember, few of these datasets will be less than 5 years old if we try
to allow enough time for the original data collector to either solve it
or graduate (and then cease to care).  Even for the "final" dataset,
there will be a delay, since the half-life between data collection and
coordinate deposition in the PDB is still ~20 months.  Plenty of time to
forget.  So, although the images were archived (probably named "test"
and in a directory called "john") it may be that the only way to figure
out which PDB ID is the "right answer" is by processing them and
comparing to all deposited Fs.  Assume this was done.  But there will
always be some datasets that don't match any PDB.  Are those
interesting?  What about ones that can't be processed?  What about ones
that can't even be indexed?  There may be a lot of those!
(hypothetically, of course).

Anyway, assume that someone did go through all the trouble to make these
datasets "available" for download, just in case they are interesting,
and annotated them as much as possible.  There will be about 20 datasets
for any given PDB ID.

Now assume that for each of these datasets this hypothetical website has
two links, one for the "raw data", which will average ~2 GB per wedge
(after gzip compression, taking at least ~45 min to download), and a
second link for a "lossy compressed" version, which is only ~100
MB/wedge (2 min download).  When decompressed, the images will visually
look pretty much like the originals, and generally give you very similar
Rmerge, Rcryst, Rfree, I/sigma, anomalous differences, and all other
statistics when processed with contemporary software.  Perhaps a bit
worse.  Essentially, lossy compression is equivalent to adding noise to
the images.

Which one would you try first?  Does lossy compression make it easier to
hunt for "interesting" datasets?  Or is it just too repugnant to have
"modified" the data in any way shape or form ... after the detector
manufacturer's software has "corrected" it?  Would it suffice to simply
supply a couple of "example" images for download instead?

-James Holton
MAD Scientist

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to