Re: [ccp4bb] refining against weak data and Table I stats

Frank von Delft Wed, 12 Dec 2012 23:27:45 -0800

I like the R1 idea...  report CC* and R1.

Of course, anisotropy screws up everything (what do our small moleculefriends know about that - ha!). So earlier in the thread, Ed Berrybrought up the "effective resolution":


   Bart Hazes (I think) suggested a statistic called "effective
   resolution" which is the resolution to which a complete dataset
   would have the number of reflections in your dataset.

We just have to settle on how to determine "number of reflections" -maybe those with I/s > 3?


phx




On 13/12/2012 06:52, James Holton wrote:

I think CC* (derived from CC1/2) is an important step forward in howto decide where to cut off the data you give to your refinementprogram, but I don't think it is a good idea to re-define what we callthe "resolution of a structure". These do NOT have to be the same thing!
Remember, what we crystallographers call "resolution" is actuallyabout 3x the "resolution" a normal person would use. That is, for mosttypes of imaging whether it be 2D (pictures of Mars) or 3D (such aselectron density) the "resolution" is the minimum feature size you canreliably detect in the image. This definition of "resolution" makesintuitive sense, especially to non-crystallographers. It is alsoconsiderably less pessimistic than our current definition since theminimum observable feature size in an electron density map is about1/3 of the d-spacing of the highest-angle spots. This is basicallybecause the d-spacing is the period of a sine wave in space, but theminimum feature size is related to the full-width at half max of thissame wave. So, all you have to do is change your definition of"resolution" and a 3.0 A structure becomes a 1.0 A structure!
However, I think proposing this new way to define "resolution" incrystallography will be met with some resistance. Why? Becausechanging the meaning of "resolution" so drastically after ~100 yearswould be devastating to its usefulness in structure evaluation. I,for one, do not want to have to check the deposition date and see ifthe structure was solved before or after the end of the world (Dec2012) before I can figure out whether or not I need to divide ormultiply by 3 to get the "real" resolution of the structure. I don'tthink I'm alone in this.
Now, calling what used to be a 1.6 A structure a 1.42 A structure (oneway to interpret Karplus & Diederichs 2012) is not quite as drastic achange as the one I flippantly propose above, but it is still achange, and there is a real danger of "definition creep" here. Mostpeople these days seem to define the resolution limit of their data atthe point where the merged I/sigma(I) drops below 2. However, usingCC* = 0.5 would place the new "resolution" at the point where mergedI/sigma(I) drops below 0.5. That's definitely going beyond whatanyone would have called the "resolution of the structure" last year.So, which one is it? Is it a 1.6 A structure (refined using data outto 1.42 A), or is it actually a 1.42 A structure?
Unfortunately, if you talk to a number of experiencedcrystallographers, they will each have a slightly different set ofrules for defining the "resolution limit" that they learned from theirthesis advisor, who, in turn, learned it from theirs, etc. Nearly allof these "rule sets" include some reference to Rmerge, but the"acceptable" Rmerge seems to vary from 30% to as much as 150%,depending on whom you talk to. However, despite this prevalence ofRmerge in our perception of resolution there does not seem to be asingle publication anywhere in the literature that recommends the useof Rmerge to define the resolution limit. Several papers have beencited to that effect, but then if you go and read them they actuallymade no such claim.
Mathematically, it is fairly easy to show that Rmerge is wildlyunstable as the average intensity approaches zero, so how did we getstuck on it as a criterion for evaluating the outer resolution bin?I'm not really sure, but I think it must have happened around 1995.Before that, there are NO entries for Rmerge in the high-resolutionbin in the PDB. Not one. Looking at papers from the pre-1995 era,you don't see it reported in "table 1" either. What is more, eversince 1995, the average reported Rmerge in the high-resolution shellhas been slowly rising by about 1.6 percentage points each year.Started around 20%, and now it is up to 50%. Seriously. Here is thegraph:
http://bl831.als.lbl.gov/~jamesh/pickup/outershell_Rmerge.png
I think this could be yet another example of "definition creep". Forany given year, I imagine a high-resolution Rmerge that is only "a fewpercent worse" than the average over "the PDB" at that time isprobably considered "okay", and the average just keeps increasing overtime.
Nevertheless, Rmerge is a useful statistic for evaluating the qualityof a diffractometer, provided it is used in the way it was originallydefined by Uli Arndt: over the entire dataset for spots with I/sd >3. At large multiplicity, the Rmerge calculated this wayasymptotically approaches the average "% error" for measuring a singlespot. If it is more than 5% or so, then there might be somethingwrong with the camera (or the space group choice, etc). This is onlytrue for Rmerge of ALL the data, not when it is relegated to a givenresolution bin.
Perhaps it is time we did have a discussion about what we mean by"the resolution of a structure" so that some kind of historicallyrelevant and "future proof" definition for it can be devised?Otherwise, we will probably one day see 1.0 A used to describe whattoday we would call a 3.0 A structure? The whole point here is to beable to compare results done by different people at different periodsin history to each other, so I think its important to try and keep ourdefinition of "resolution" stable, even if we do "use" spots that arebeyond it.
So, what I would advise is to refine your model with data out to theresolution limit defined by CC*, but declare the "resolution of thestructure" to be where the merged I/sigma(I) falls to 2. You mighteven want to calculate your Rmerge, Rcryst, Rfree and all the other Rvalues to this resolution as well, since including a lot of zeroesdoes nothing but artificially drive up estimates of relative error.Perhaps we should even take a lesson from our "small molecule" friendsand start reporting "R1", where the R factor is computed only for hklswhere I/sigma(I) is above 3?
-James Holton
MAD Scientist

On 12/8/2012 4:04 AM, Miller, Mitchell D. wrote:
I too like the idea of reporting the table 1 stats vs resolution
rather than just the overall values and highest resolution shell.

I also wanted to point out an earlier thread from April about the
limitations of the PDB's defining the resolution as being that of
the highest resolution reflection (even if data is incomplete or weak).
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204&L=ccp4bb&D=0&1=ccp4bb&9=A&I=-3&J=on&d=No+Match%3BMatch%3BMatches&z=4&P=376289https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1204&L=ccp4bb&D=0&1=ccp4bb&9=A&I=-3&J=on&d=No+Match%3BMatch%3BMatches&z=4&P=377673
What we have done in the past for cases of low completeness
in the outer shell is to define the nominal resolution ala Bart
Hazes' method of same number of reflections as a complete data set and
use this in the PDB title and describe it in the remark 3 other
refinement remarks.
   There is also the possibility of adding a comment to the PDB
remark 2 which we have not used.
http://www.wwpdb.org/documentation/format33/remarks1.html#REMARK%202
This should help convince reviewers that you are not trying
to mis-represent the resolution of the structure.


Regards,
Mitch

-----Original Message-----
From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf OfEdward A. Berry
Sent: Friday, December 07, 2012 8:43 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] refining against weak data and Table I stats

Yes, well, actually i'm only a middle author on that paper for a good
reason, but I did encourage Rebecca and Stephan to use all the data.
But on a later, much more modest submission, where the outer shell
was not only weak but very incomplete (edges of the detector),
the reviewers found it difficult to evaluate the quality
of the data (we had also excluded a zone with bad ice-ring
problems). So we provided a second table, cutting off above
the ice ring in the good strong data, which convinced them
that at least it is a decent 2A structure. In the PDB it is
a 1.6A structure. but there was a lot of good data between
the ice ring and 1.6 A.

Bart Hazes (I think) suggested a statistic called "effective
resolution" which is the resolution to which a complete dataset
would have the number of reflectionin your dataset, and we
reported this, which came out to something like 1.75.

I do like the idea of reporting in multiple shells, not just overall
and highest shell, and the PDB accomodatesthis, even has a GUI
to enter it in the ADIT 2.0 software. It could also be used to
report two different overall ranges, such as completeness, 25 to 1.6 A,
which would be shocking in my case, and 25 to 2.0 which would
be more reassuring.

eab

Douglas Theobald wrote:
Hi Ed,
Thanks for the comments. So what do you recommend? Refine againstweak data, and report all stats in a single Table I?
Looking at your latest V-ATPase structure paper, it appears youfavor something like that, since you report a high res shell withI/sigI=1.34 and Rsym=1.65.
On Dec 6, 2012, at 7:24 PM, Edward A. Berry<ber...@upstate.edu>  wrote:
Another consideration here is your PDB deposition. If the reasonfor usingweak data is to get a better structure, presumably you are going todepositthe structure using all the data. Then the statistics in the PDBfile must
reflect the high resolution refinement.
There are I think three places in the PDB file where the resolutionis stated,but i believe they are all required to be the same and to be equalto thehighest resolution data used (even if there were only tworeflections in that shell).Rmerge or Rsymm must be reported, and until recently I think theywere not allowed
to exceed 1.00 (100% error?).

What are your reviewers going to think if the title of your paper is
"structure of protein A at 2.1 A resolution" but they check the PDBfile
and the resolution was really 1.9 A?  And Rsymm in the PDB is 0.99 but
in your table 1* says 1.3?

Douglas Theobald wrote:
Hello all,
I've followed with interest the discussions here about how weshould be refining against weak data, e.g. data with I/sigI<< 2(perhaps using all bins that have a "significant" CC1/2 perKarplus and Diederichs 2012). This all makes statistical sense tome, but now I am wondering how I should report data and modelstats in Table I.
Here's what I've come up with: report two Table I's. Forcomparability to legacy structure stats, report a "classic" TableI, where I call the resolution whatever bin I/sigI=2. Use that asmy "high res" bin, with high res bin stats reported in parenthesesafter global stats. Then have another Table (maybe Table I* insupplementary material?) where I report stats for the wholedataset, including the weak data I used in refinement. In bothtables report CC1/2 and Rmeas.
This way, I don't redefine the (mostly) conventional usage of"resolution", my Table I can be compared to precedent, I reportstats for all the data and for the model against all data, and Itake advantage of the information in the weak data during refinement.
Thoughts?

Douglas


^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`^`
Douglas L. Theobald
Assistant Professor
Department of Biochemistry
Brandeis University
Waltham, MA  02454-9110

dtheob...@brandeis.edu
http://theobald.brandeis.edu/

              ^\
    /`  /^.  / /\
   / / /`/  / . /`
/ /  '   '
'

Re: [ccp4bb] refining against weak data and Table I stats

Reply via email to