Re: [ccp4bb] refining against weak data and Table I stats

Felix Frolow Thu, 13 Dec 2012 00:40:00 -0800

James, thank you for taking time to write this nice essay. I only hope that 
your evaluation of possibility of cutting data to I/sigmaI = 3 is for 
EXHIBITION PURPOSE ONLY
and not for the refinement or for the calculation of  electron density maps. 
After all, what we see on the electron density maps is important and not any 
single number qualifiers :-)
" Do not believe in a single number qualifier, or, for that matter, in anything 
else!" - Attributed to Vladimir Prelog by Jack Dunitz 
Dr Felix Frolow   
Professor of Structural Biology and Biotechnology, Department of Molecular 
Microbiology and Biotechnology
Tel Aviv University 69978, Israel


Acta Crystallographica F, co-editor

e-mail: mbfro...@post.tau.ac.il
Tel:  ++972-3640-8723
Fax: ++972-3640-9407
Cellular: 0547 459 608

On Dec 13, 2012, at 08:52 , James Holton <jmhol...@lbl.gov> wrote:

> I think CC* (derived from CC1/2) is an important step forward in how to 
> decide where to cut off the data you give to your refinement program, but I 
> don't think it is a good idea to re-define what we call the "resolution of a 
> structure".  These do NOT have to be the same thing!
> 
>  Remember, what we crystallographers call "resolution" is actually about 3x 
> the "resolution" a normal person would use. That is, for most types of 
> imaging whether it be 2D (pictures of Mars) or 3D (such as electron density) 
> the "resolution" is the minimum feature size you can reliably detect in the 
> image.  This definition of "resolution" makes intuitive sense, especially to 
> non-crystallographers.  It is also considerably less pessimistic than our 
> current definition since the minimum observable feature size in an electron 
> density map is about 1/3 of the d-spacing of the highest-angle spots.  This 
> is basically because the d-spacing is the period of a sine wave in space, but 
> the minimum feature size is related to the full-width at half max of this 
> same wave. So, all you have to do is change your definition of "resolution" 
> and a 3.0 A structure becomes a 1.0 A structure!
> 
>  However, I think proposing this new way to define "resolution" in 
> crystallography will be met with some resistance.  Why? Because changing the 
> meaning of "resolution" so drastically after ~100 years would be devastating 
> to its usefulness in structure evaluation.  I, for one, do not want to have 
> to check the deposition date and see if the structure was solved before or 
> after the end of the world (Dec 2012) before I can figure out whether or not 
> I need to divide or multiply by 3 to get the "real" resolution of the 
> structure.  I don't think I'm alone in this.
> 
> Now, calling what used to be a 1.6 A structure a 1.42 A structure (one way to 
> interpret Karplus & Diederichs 2012) is not quite as drastic a change as the 
> one I flippantly propose above, but it is still a change, and there is a real 
> danger of "definition creep" here.  Most people these days seem to define the 
> resolution limit of their data at the point where the merged I/sigma(I) drops 
> below 2.  However, using CC* = 0.5 would place the new "resolution" at the 
> point where merged I/sigma(I) drops below 0.5.  That's definitely going 
> beyond what anyone would have called the "resolution of the structure" last 
> year.  So, which one is it?  Is it a 1.6 A structure (refined using data out 
> to 1.42 A), or is it actually a 1.42 A structure?
> 
> Unfortunately, if you talk to a number of experienced crystallographers, they 
> will each have a slightly different set of rules for defining the "resolution 
> limit" that they learned from their thesis advisor, who, in turn, learned it 
> from theirs, etc. Nearly all of these "rule sets" include some reference to 
> Rmerge, but the "acceptable" Rmerge seems to vary from 30% to as much as 
> 150%, depending on whom you talk to.  However, despite this prevalence of 
> Rmerge in our perception of resolution there does not seem to be a single 
> publication anywhere in the literature that recommends the use of Rmerge to 
> define the resolution limit. Several papers have been cited to that effect, 
> but then if you go and read them they actually made no such claim.
> 
> Mathematically, it is fairly easy to show that Rmerge is wildly unstable as 
> the average intensity approaches zero, so how did we get stuck on it as a 
> criterion for evaluating the outer resolution bin?  I'm not really sure, but 
> I think it must have happened around 1995.  Before that, there are NO entries 
> for Rmerge in the high-resolution bin in the PDB.  Not one.  Looking at 
> papers from the pre-1995 era, you don't see it reported in "table 1" either. 
> What is more, ever since 1995, the average reported Rmerge in the 
> high-resolution shell has been slowly rising by about 1.6 percentage points 
> each year.  Started around 20%, and now it is up to 50%.  Seriously.  Here is 
> the graph:
> http://bl831.als.lbl.gov/~jamesh/pickup/outershell_Rmerge.png
> 
> I think this could be yet another example of "definition creep". For any 
> given year, I imagine a high-resolution Rmerge that is only "a few percent 
> worse" than the average over "the PDB" at that time is probably considered 
> "okay", and the average just keeps increasing over time.
> 
> Nevertheless, Rmerge is a useful statistic for evaluating the quality of a 
> diffractometer, provided it is used in the way it was originally defined by 
> Uli Arndt: over the entire dataset for spots with I/sd > 3.  At large 
> multiplicity, the  Rmerge calculated this way asymptotically approaches the 
> average "% error" for measuring a single spot.  If it is more than 5% or so, 
> then there might be something wrong with the camera (or the space group 
> choice, etc).  This is only true for Rmerge of ALL the data, not when it is 
> relegated to a given resolution bin.
> 
> 
> Perhaps it is time we did have a discussion about what we mean by "the 
> resolution of a structure" so that some kind of historically relevant and 
> "future proof" definition for it can be devised? Otherwise, we will probably 
> one day see 1.0 A used to describe what today we would call a 3.0 A 
> structure?  The whole point here is to be able to compare results done by 
> different people at different periods in history to each other, so I think 
> its important to try and keep our definition of "resolution" stable, even if 
> we do "use" spots that are beyond it.
> 
> So, what I would advise is to refine your model with data out to the 
> resolution limit defined by CC*, but declare the "resolution of the 
> structure" to be where the merged I/sigma(I) falls to 2. You might even want 
> to calculate your Rmerge, Rcryst, Rfree and all the other R values to this 
> resolution as well, since including a lot of zeroes does nothing but 
> artificially drive up estimates of relative error.  Perhaps we should even 
> take a lesson from our "small molecule" friends and start reporting "R1", 
> where the R factor is computed only for hkls where I/sigma(I) is above 3?
> 
> -James Holton
> MAD Scientist
> 
> On 12/8/2012 4:04 AM, Miller, Mitchell D. wrote:
>> I too like the idea of reporting the table 1 stats vs resolution
>> rather than just the overall values and highest resolution shell.
>> 
>> I also wanted to point out an earlier thread from April about the
>> limitations of the PDB's defining the resolution as being that of
>> the highest resolution reflection (even if data is incomplete or weak).
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204&L=ccp4bb&D=0&1=ccp4bb&9=A&I=-3&J=on&d=No+Match%3BMatch%3BMatches&z=4&P=376289
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204&L=ccp4bb&D=0&1=ccp4bb&9=A&I=-3&J=on&d=No+Match%3BMatch%3BMatches&z=4&P=377673
>> 
>> What we have done in the past for cases of low completeness
>> in the outer shell is to define the nominal resolution ala Bart
>> Hazes' method of same number of reflections as a complete data set and
>> use this in the PDB title and describe it in the remark 3 other
>> refinement remarks.
>>   There is also the possibility of adding a comment to the PDB
>> remark 2 which we have not used.
>> http://www.wwpdb.org/documentation/format33/remarks1.html#REMARK%202
>> This should help convince reviewers that you are not trying
>> to mis-represent the resolution of the structure.
>> 
>> 
>> Regards,
>> Mitch
>> 
>> -----Original Message-----
>> From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Edward 
>> A. Berry
>> Sent: Friday, December 07, 2012 8:43 AM
>> To: CCP4BB@JISCMAIL.AC.UK
>> Subject: Re: [ccp4bb] refining against weak data and Table I stats
>> 
>> Yes, well, actually i'm only a middle author on that paper for a good
>> reason, but I did encourage Rebecca and Stephan to use all the data.
>> But on a later, much more modest submission, where the outer shell
>> was not only weak but very incomplete (edges of the detector),
>> the reviewers found it difficult to evaluate the quality
>> of the data (we had also excluded a zone with bad ice-ring
>> problems). So we provided a second table, cutting off above
>> the ice ring in the good strong data, which convinced them
>> that at least it is a decent 2A structure. In the PDB it is
>> a 1.6A structure. but there was a lot of good data between
>> the ice ring and 1.6 A.
>> 
>> Bart Hazes (I think) suggested a statistic called "effective
>> resolution" which is the resolution to which a complete dataset
>> would have the number of reflectionin your dataset, and we
>> reported this, which came out to something like 1.75.
>> 
>> I do like the idea of reporting in multiple shells, not just overall
>> and highest shell, and the PDB accomodatesthis, even has a GUI
>> to enter it in the ADIT 2.0 software. It could also be used to
>> report two different overall ranges, such as completeness, 25 to 1.6 A,
>> which would be shocking in my case, and 25 to 2.0 which would
>> be more reassuring.
>> 
>> eab
>> 
>> Douglas Theobald wrote:
>>> Hi Ed,
>>> 
>>> Thanks for the comments.  So what do you recommend?  Refine against weak 
>>> data, and report all stats in a single Table I?
>>> 
>>> Looking at your latest V-ATPase structure paper, it appears you favor 
>>> something like that, since you report a high res shell with I/sigI=1.34 and 
>>> Rsym=1.65.
>>> 
>>> 
>>> On Dec 6, 2012, at 7:24 PM, Edward A. Berry<ber...@upstate.edu>  wrote:
>>> 
>>>> Another consideration here is your PDB deposition. If the reason for using
>>>> weak data is to get a better structure, presumably you are going to deposit
>>>> the structure using all the data. Then the statistics in the PDB file must
>>>> reflect the high resolution refinement.
>>>> 
>>>> There are I think three places in the PDB file where the resolution is 
>>>> stated,
>>>> but i believe they are all required to be the same and to be equal to the
>>>> highest resolution data used (even if there were only two reflections in 
>>>> that shell).
>>>> Rmerge or Rsymm must be reported, and until recently I think they were not 
>>>> allowed
>>>> to exceed 1.00 (100% error?).
>>>> 
>>>> What are your reviewers going to think if the title of your paper is
>>>> "structure of protein A at 2.1 A resolution" but they check the PDB file
>>>> and the resolution was really 1.9 A?  And Rsymm in the PDB is 0.99 but
>>>> in your table 1* says 1.3?
>>>> 
>>>> Douglas Theobald wrote:
>>>>> Hello all,
>>>>> 
>>>>> I've followed with interest the discussions here about how we should be 
>>>>> refining against weak data, e.g. data with I/sigI<<   2 (perhaps using 
>>>>> all bins that have a "significant" CC1/2 per Karplus and Diederichs 
>>>>> 2012).  This all makes statistical sense to me, but now I am wondering 
>>>>> how I should report data and model stats in Table I.
>>>>> 
>>>>> Here's what I've come up with: report two Table I's.  For comparability 
>>>>> to legacy structure stats, report a "classic" Table I, where I call the 
>>>>> resolution whatever bin I/sigI=2.  Use that as my "high res" bin, with 
>>>>> high res bin stats reported in parentheses after global stats.   Then 
>>>>> have another Table (maybe Table I* in supplementary material?) where I 
>>>>> report stats for the whole dataset, including the weak data I used in 
>>>>> refinement.  In both tables report CC1/2 and Rmeas.
>>>>> 
>>>>> This way, I don't redefine the (mostly) conventional usage of 
>>>>> "resolution", my Table I can be compared to precedent, I report stats for 
>>>>> all the data and for the model against all data, and I take advantage of 
>>>>> the information in the weak data during refinement.
>>>>> 
>>>>> Thoughts?
>>>>> 
>>>>> Douglas
>>>>> 
>>>>> 
>>>>> ^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
>>>>> Douglas L. Theobald
>>>>> Assistant Professor
>>>>> Department of Biochemistry
>>>>> Brandeis University
>>>>> Waltham, MA  02454-9110
>>>>> 
>>>>> dtheob...@brandeis.edu
>>>>> http://theobald.brandeis.edu/
>>>>> 
>>>>>              ^\
>>>>>    /`  /^.  / /\
>>>>>   / / /`/  / . /`
>>>>> / /  '   '
>>>>> '
>>>>> 
>>>>>

Re: [ccp4bb] refining against weak data and Table I stats

Reply via email to