Reluctantly I am going to add my 2 cents to the discussion, with various aspects in one e-mail.
- It is easy to overlook that our "business" is to answer biological/biochemical questions. This is what you (generally) get grants for to do (showing that these questions are of critical importance in your ability to do science). Crystallography is one tool that we use to acquire evidence to answer questions. The time that you could get a Nobel prize for doing a structure or a PhD for doing a structure is gone. Even writing a publication with just a structure is now not as common anymore as it used to be. So the "biochemistry" drives crystallography. It is not reasonable to say that once you have collected data and you don't publish the data for 5 years, you are no longer interested. What that generally means is that "the rest of science" is not cooperating. In short: I would be against a strict rule for mandatory deposition of raw data, even after a long time. An example: I have data sets here with low resolution data (~10A) presumably of protein structures that have known structures for prokaryotes, but not for eukaryotes and it would be exciting if we could prove (or disprove) that they look the same. The problem, apart from resolution, is that the spots are so few and fuzzy that I cannot index the images. The main reason why I save the images is that if/when someone comes to me to say that they think they have made better crystals, we have something to compare. (Thanks to Gerard B. for encouragement to write this item :-) - For those that think that we have come to the end of development in crystallography, James Holton (thank you) has described nicely why we should not think this. We are all happy if our model generates an R-factor of 20%. Even small molecule crystallographers would wave that away in an instant as inadequate. However, "everybody" has come to accept that this is fine for protein crystallography. It would be better if our models were more consistent with the experimental data. How could we make such models without access to lots of data? As a student I was always taught (when asking why 20% is actually "good") that we don't (for example) model solvent. Why not? It is not easy. If we did, would the 20% go down to 3%? I am guessing not, there are other errors that come into play. - Gerard K. has eloquently spoken about cost and effort. Since I maintain a small (local) archive of images, I can affirm his words: a large-capacity disk is inexpensive ($100). A box that the disk sits in is inexpensive ($1000). A second box that sits in a different building, away for security reasons) that holds the backup, is inexpensive ($1400, with 4 disks). The infrastructure to run these boxes (power, fiber optics, boxes in between) is slightly more expensive. What is *really* expensive is people maintaining everything. It was a huge surprise to me (and my boss) how much time and effort it takes to annotate all data sets, rename them appropriately and file them away in a logical place so that anyone (who understands the scheme) can find them back. Therefore (!) the reason why this should be centralized is that the cost per data set stored goes down - it is more efficient. One person can process several (many, if largely automated) data sets per day. It is also of interest that we locally (2-5 people for a project) may not agree on what exactly should be stored. Therefore there is no hope that we can find consensus in the world, but we CAN get a reasonably compromise. But it is tough: I have heard the argument that data for published structures should be kept in case someone wants to see and/or go back, while I have also heard the argument that once published it is signed, sealed and delivered and it can go, while UNpublished data should be preserved because eventually it hopefully will get to publication. Each argument is reasonably sensible, but the conclusions are opposite. (I maintain both classes of data sets.) - Granting agencies in the US generally require that you archive scientific data. What is not yet clear is whether they would be willing to pay for a centralized facility that would do that. After all, it is more exciting to NIH to give money for the study of a disease than it is to store data. But if the argument were made that each grant(ee) would be more efficient and could apply more money towards the actual problem, this might convince them. For that we would need a reasonable consensus what we want and why. More power to John. H and "The Committee". Thanks to complete "silence" on the BB today I am finally caught up reading! Mark van der Woerd -----Original Message----- From: James Holton <jmhol...@lbl.gov> To: CCP4BB <CCP4BB@JISCMAIL.AC.UK> Sent: Tue, Nov 1, 2011 11:07 am Subject: Re: [ccp4bb] Archiving Images for PDB Depositions On general scientific principles the reasons for archiving "raw data" all boil down to one thing: there was a systematic error, and you hope to one day account for it. After all, a "systematic error" is just something you haven't modeled yet. Is it worth modelling? That depends... There are two main kinds of systematic error in MX: 1) Fobs vs Fcalc Given that the reproducibility of Fobs is typically < 3%, but typical R/Rfree values are in the 20%s, it is safe to say that this is a rather whopping systematic error. What causes it? Dunno. Would structural biologists benefit from being able to model it? Oh yes! Imagine being able to reliably see a ligand that has an occupancy of only 0.05, or to be able to unambiguously distinguish between two proposed reaction mechanisms and back up your claims with hard-core statistics (derived from SIGF). Perhaps even teasing apart all the different minor conformers occupied by the molecule in its functional cycle? I think this is the main reason why we all decided to archive Fobs: 20% error is a lot. 2) scale factors We throw a lot of things into "scale factors", including sample absorption, shutter timing errors, radiation damage, flicker in the incident beam, vibrating crystals, phosphor thickness, point-spread vaiations, and many other phenomena. Do we understand the physics behind them? Yes (mostly). Is there "new biology" to be had by modelling them more accurately? No. Unless, of course, you count all the structures we have not solved yet. Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and other "native" elements actually worked? You wouldn't have to grow SeMet protein anymore, and you could go after systems that don't express well in E. coli. Perhaps even going to the native source! I think there is plenty of "new biology" to be had there. Wouldn't it be nice if you could do S-SAD even though your spots were all smeary and overlapped and mosaic and radiation damaged? Why don't we do this now? Simple!: it doesn't work. Why doesn't it work? Because we don't know all the "scale factors" accurately enough. In most cases, the "% error" from all the scale factors usually adds up to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due to native element anomalous scattering is usually less than 1%. Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et al. 2006), but if photon-counting were the only source of error, we should be able to get Rmerge of ~0.1% or less, particularly in the low-angle resolution bins. If we can do that, then there will be little need for SeMet anymore. But, we need the "raw" images if we are to have any hope of figuring out how to get the errors down to the 0.1% level. There is no one magic dataset that will tell us how to do this, we need to "average over" lots of them. Yes, this is further "upstream" of the "new biology" than deposited Fs, and yes the cost of archiving images is higher, but I think the potential benefits to the structural biology community if we can crack the 0.1% S-SAD barrier is nothing short of revolutionary. -James Holton MAD Scientist On 11/1/2011 8:32 AM, Anastassis Perrakis wrote: > Dear Gerard > > Isolating your main points: > >> but there would have been no PDB-REDO because the >> data for running it would simply not have been available! ;-) . Or do >> you >> think the parallel does not apply? > ... >> have thought, some value. From the perspective of your message, then, >> why >> are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no >> chance of measuring up to them? > > I was thinking of the inconsistency while sending my previous email > ... ;-) > > Basically, the parallel does apply. PDB-REPROCESS in a few years would > be really fantastic - speaking as a crystallographer and methods > developer. > > Speaking as a structural biologist though, I did think long and hard > about > the usefulness of PDB_REDO. I obviously decided its useful since I am now > heavily involved in it for a few reasons, like uniformity of final > model treatment, > improving refinement software, better statistics on structure quality > metrics, > and of course seeing if the new models will change our understanding of > the biology of the system. > > An experiment that I would like to do as a structural biologist - is > the following: > What about adding an "increasing noise" model to the Fobs's of a few > datasets and re-refining? > How much would that noise change the final model quality metrics and > in absolute terms? > > (for the changes that PDB_RE(BUILD) does have a preview at > http://www.ncbi.nlm.nih.gov/pubmed/22034521 > ....I tried to avoid the shamelessly self-promoting plug-in, but could > resists at the end!) > > That experiment - or a better designed variant for it ? - would maybe > tell us if we should be advocating the archive of all images, > and being scientifically convinced of the importance of that beyond > methods development, we would all argue a strong case > to the funding and hosting agencies. > > Tassos > > PS Of course, that does not negate the all-important argument, that > when struggling with marginal > data better processing software is essential. There is a clear need > for better software > to process images, especially for low resolution and low signal/noise > cases. > Since that is dependent on having test data I am all for supporting an > initiative to collect such data, > and I would gladly spend a day digging our archives to contribute.