HI James, Regarding the suggestion of lossy compression, it is really hard to comment without having a good idea of the real cost of doing this. So, I have a suggestion:
- grab a bag of JCSG data sets, which we know should all be essentially OK. - you squash then unsquash them with your macguffin, perhaps randomizing them as to whether A or B is squashed. - process them with Elves / xia2 / autoPROC (something which is reproducible) - pop the results into pdb_redo Then compare the what-comes-out. Ultimately adding "noise" may (or may not) make a measurable difference to the final refinement - this may be a way of telling if it does or doesn't. Why however would I have any reason to worry? Because the noise being added is not really random - it will compression artifacts. This could have a subtle effect on how the errors are estimated and so on. However you can hum and haw about this for a decade without reaching a conclusion. Here, it's something which in all honesty we can actually evaluate, so is it worth giving it a go? If the results were / are persuasive (i.e. a "report on the use of lossy compression in transmission and storage of X-ray diffraction data" was actually read and endorsed by the community) this would make it much more worthwhile for consideration for inclusion in e.g. cbflib. I would however always encourage (if possible) that the original raw data is kept somewhere on disk in an unmodified form - I am not a fan of one-way computational processes with unique data. Thoughts anyone? Cheerio, Graeme On 7 November 2011 17:30, James Holton <jmhol...@lbl.gov> wrote: > At the risk of sounding like another "poll", I have a pragmatic question for > the methods development community: > > Hypothetically, assume that there was a website where you could download the > original diffraction images corresponding to any given PDB file, including > "early" datasets that were from the same project, but because of smeary > spots or whatever, couldn't be solved. There might even be datasets with > "unknown" PDB IDs because that particular project never did work out, or > because the relevant protein sequence has been lost. Remember, few of these > datasets will be less than 5 years old if we try to allow enough time for > the original data collector to either solve it or graduate (and then cease > to care). Even for the "final" dataset, there will be a delay, since the > half-life between data collection and coordinate deposition in the PDB is > still ~20 months. Plenty of time to forget. So, although the images were > archived (probably named "test" and in a directory called "john") it may be > that the only way to figure out which PDB ID is the "right answer" is by > processing them and comparing to all deposited Fs. Assume this was done. > But there will always be some datasets that don't match any PDB. Are those > interesting? What about ones that can't be processed? What about ones that > can't even be indexed? There may be a lot of those! (hypothetically, of > course). > > Anyway, assume that someone did go through all the trouble to make these > datasets "available" for download, just in case they are interesting, and > annotated them as much as possible. There will be about 20 datasets for any > given PDB ID. > > Now assume that for each of these datasets this hypothetical website has two > links, one for the "raw data", which will average ~2 GB per wedge (after > gzip compression, taking at least ~45 min to download), and a second link > for a "lossy compressed" version, which is only ~100 MB/wedge (2 min > download). When decompressed, the images will visually look pretty much > like the originals, and generally give you very similar Rmerge, Rcryst, > Rfree, I/sigma, anomalous differences, and all other statistics when > processed with contemporary software. Perhaps a bit worse. Essentially, > lossy compression is equivalent to adding noise to the images. > > Which one would you try first? Does lossy compression make it easier to > hunt for "interesting" datasets? Or is it just too repugnant to have > "modified" the data in any way shape or form ... after the detector > manufacturer's software has "corrected" it? Would it suffice to simply > supply a couple of "example" images for download instead? > > -James Holton > MAD Scientist >