Reluctantly I am going to add my 2 cents to the discussion, with various 
aspects in one e-mail.

- It is easy to overlook that our "business" is to answer 
biological/biochemical questions. This is what you (generally) get grants for 
to do (showing that these questions are of critical importance in your ability 
to do science). Crystallography is one tool that we use to acquire evidence to 
answer questions. The time that you could get a Nobel prize for doing a 
structure or a PhD for doing a structure is gone. Even writing a publication 
with just a structure is now not as common anymore as it used to be. So the 
"biochemistry" drives crystallography. It is not reasonable to say that once 
you have collected data and you don't publish the data for 5 years, you are no 
longer interested. What that generally means is that "the rest of science" is 
not cooperating. In short: I would be against a strict rule for mandatory 
deposition of raw data, even after a long time. An example: I have data sets 
here with low resolution data (~10A) presumably of protein structures that have 
known structures for prokaryotes, but not for eukaryotes and it would be 
exciting if we could prove (or disprove) that they look the same. The problem, 
apart from resolution, is that the spots are so few and fuzzy that I cannot 
index the images. The main reason why I save the images is that if/when someone 
comes to me to say that they think they have made better crystals, we have 
something to compare. (Thanks to Gerard B. for encouragement to write this item 
:-)

- For those that think that we have come to the end of development in 
crystallography, James Holton (thank you) has described nicely why we should 
not think this. We are all happy if our model generates an R-factor of 20%. 
Even small molecule crystallographers would wave that away in an instant as 
inadequate. However, "everybody" has come to accept that this is fine for 
protein crystallography. It would be better if our models were more consistent 
with the experimental data. How could we make such models without access to 
lots of data? As a student I was always taught (when asking why 20% is actually 
"good") that we don't (for example) model solvent. Why not? It is not easy. If 
we did, would the 20% go down to 3%? I am guessing not, there are other errors 
that come into play. 

- Gerard K. has eloquently spoken about cost and effort. Since I maintain a 
small (local) archive of images, I can affirm his words: a large-capacity disk 
is inexpensive ($100). A box that the disk sits in is inexpensive ($1000). A 
second box that sits in a different building, away for security reasons) that 
holds the backup, is inexpensive ($1400, with 4 disks). The infrastructure to 
run these boxes (power, fiber optics, boxes in between) is slightly more 
expensive. What is *really* expensive is people maintaining everything. It was 
a huge surprise to me (and my boss) how much time and effort it takes to 
annotate all data sets, rename them appropriately and file them away in a 
logical place so that anyone (who understands the scheme) can find them back. 
Therefore (!) the reason why this should be centralized is that the cost per 
data set stored goes down - it is more efficient. One person can process 
several (many, if largely automated) data sets per day. It is also of interest 
that we locally (2-5 people for a project) may not agree on what exactly should 
be stored. Therefore there is no hope that we can find consensus in the world, 
but we CAN get a reasonably compromise. But it is tough: I have heard the 
argument that data for published structures should be kept in case someone 
wants to see and/or go back, while I have also heard the argument that once 
published it is signed, sealed and delivered and it can go, while UNpublished 
data should be preserved because eventually it hopefully will get to 
publication. Each argument is reasonably sensible, but the conclusions are 
opposite. (I maintain both classes of data sets.)

- Granting agencies in the US generally require that you archive scientific 
data. What is not yet clear is whether they would be willing to pay for a 
centralized facility that would do that. After all, it is more exciting to NIH 
to give money for the study of a disease than it is to store data. But if the 
argument were made that each grant(ee) would be more efficient and could apply 
more money towards the actual problem, this might convince them. For that we 
would need a reasonable consensus what we want and why. More power to John. H 
and "The Committee".

Thanks to complete "silence" on the BB today I am finally caught up reading!

Mark van der Woerd
 



 

 

-----Original Message-----
From: James Holton <jmhol...@lbl.gov>
To: CCP4BB <CCP4BB@JISCMAIL.AC.UK>
Sent: Tue, Nov 1, 2011 11:07 am
Subject: Re: [ccp4bb] Archiving Images for PDB Depositions


On general scientific principles the reasons for archiving "raw data" 
all boil down to one thing: there was a systematic error, and you hope 
to one day account for it.  After all, a "systematic error" is just 
something you haven't modeled yet.  Is it worth modelling?  That depends...

There are two main kinds of systematic error in MX:
1) Fobs vs Fcalc
     Given that the reproducibility of Fobs is typically < 3%, but 
typical R/Rfree values are in the 20%s, it is safe to say that this is a 
rather whopping systematic error.  What causes it?  Dunno.  Would 
structural biologists benefit from being able to model it?  Oh yes!  
Imagine being able to reliably see a ligand that has an occupancy of 
only 0.05, or to be able to unambiguously distinguish between two 
proposed reaction mechanisms and back up your claims with hard-core 
statistics (derived from SIGF).  Perhaps even teasing apart all the 
different minor conformers occupied by the molecule in its functional 
cycle?  I think this is the main reason why we all decided to archive 
Fobs: 20% error is a lot.

2) scale factors
     We throw a lot of things into "scale factors", including sample 
absorption, shutter timing errors, radiation damage, flicker in the 
incident beam, vibrating crystals, phosphor thickness, point-spread 
vaiations, and many other phenomena.  Do we understand the physics 
behind them?  Yes (mostly).  Is there "new biology" to be had by 
modelling them more accurately?  No.  Unless, of course, you count all 
the structures we have not solved yet.

Wouldn't it be nice if phasing from sulfur, phosphorous, chloride and 
other "native" elements actually worked?  You wouldn't have to grow 
SeMet protein anymore, and you could go after systems that don't express 
well in E. coli.  Perhaps even going to the native source!  I think 
there is plenty of "new biology" to be had there.  Wouldn't it be nice 
if you could do S-SAD even though your spots were all smeary and 
overlapped and mosaic and radiation damaged?

   Why don't we do this now?  Simple!: it doesn't work.  Why doesn't it 
work?  Because we don't know all the "scale factors" accurately enough.  
In most cases, the "% error" from all the scale factors usually adds up 
to ~3% (aka Rmerge, Rpim etc.), but the change in spot intensities due 
to native element anomalous scattering  is usually less than 1%.  
Currently, the world record for smallest Bijvoet ratio is ~0.5% (Wang et 
al. 2006), but if photon-counting were the only source of error, we 
should be able to get Rmerge of ~0.1% or less, particularly in the 
low-angle resolution bins.  If we can do that, then there will be little 
need for SeMet anymore.

But, we need the "raw" images if we are to have any hope of figuring out 
how to get the errors down to the 0.1% level.  There is no one magic 
dataset that will tell us how to do this, we need to "average over" lots 
of them.  Yes, this is further "upstream" of the "new biology" than 
deposited Fs, and yes the cost of archiving images is higher, but I 
think the potential benefits to the structural biology community if we 
can crack the 0.1% S-SAD barrier is nothing short of revolutionary.

-James Holton
MAD Scientist

On 11/1/2011 8:32 AM, Anastassis Perrakis wrote:
> Dear Gerard
>
> Isolating your main points:
>
>> but there would have been no PDB-REDO because the
>> data for running it would simply not have been available! ;-) . Or do 
>> you
>> think the parallel does not apply?
> ...
>> have thought, some value. From the perspective of your message, then, 
>> why
>> are the benefits of PDB-REDO so unique that PDB-REPROCESS would have no
>> chance of measuring up to them?
>
> I was thinking of the inconsistency while sending my previous email 
> ... ;-)
>
> Basically, the parallel does apply. PDB-REPROCESS in a few years would
> be really fantastic - speaking as a crystallographer and methods 
> developer.
>
> Speaking as a structural biologist though, I did think long and hard 
> about
> the usefulness of PDB_REDO. I obviously decided its useful since I am now
> heavily involved in it for a few reasons, like uniformity of final 
> model treatment,
> improving refinement software, better statistics on structure quality 
> metrics,
> and of course seeing if the new models will change our understanding of
> the biology of the system.
>
> An experiment that I would like to do as a structural biologist - is 
> the following:
> What about adding an "increasing noise" model to the Fobs's of a few 
> datasets and re-refining?
> How much would that noise change the final model quality metrics and 
> in absolute terms?
>
> (for the changes that PDB_RE(BUILD) does have a preview at 
> http://www.ncbi.nlm.nih.gov/pubmed/22034521
> ....I tried to avoid the shamelessly self-promoting plug-in, but could 
> resists at the end!)
>
> That experiment - or a better designed variant for it ? - would maybe 
> tell us if we should be advocating the archive of all images,
> and being scientifically convinced of the importance of that beyond 
> methods development, we would all argue a strong case
> to the funding and hosting agencies.
>
> Tassos
>
> PS Of course, that does not negate the all-important argument, that 
> when struggling with marginal
> data better processing software is essential. There is a clear need 
> for better software
> to process images, especially for low resolution and low signal/noise 
> cases.
> Since that is dependent on having test data I am all for supporting an 
> initiative to collect such data,
> and I would gladly spend a day digging our archives to contribute.

 

Reply via email to