I very much agree with Holton. I also find the following to be a simple
and very helpful argument in the discussion (it was given to me by Morten
Kjeldgaard):

"Your model should reproduce your weak data by weak Fc's - therefore you
need your  weak reflections in the refinement"

I personally like to judge a proper data cutoff from the Wilson plot - as
long as it looks right data are OK to use. If on the other hand  the plot
changes slope, levels out or even rises at some point, then I cut the data
there.
The plot will often show nice appearance to a I/sigI level of about 1 -
1.5, other times 2 or 3 - it really depends on the crystal (and the
assumption of valid wilson statistics, of course).

Poul

> I generally cut off integration at the shell wheree I/sigI < 0.5 and
> then cut off merged data where MnI/sd(I) ~ 1.5.  It is always easier to
> to cut off data later than to re-integrate it.  I  never look at the
> Rmerge, Rsym, Rpim or Rwhatever in the highest resolution shell.  This
> is because R-statistics are inappropriate for weak data.
>
> Don't believe me?  If anybody out there doesn't think that spots with no
> intensity are important, then why are you looking so carefully at your
> systematic absences? ;)  The "Rabsent" statistic (if it existed) would
> always be dividing by zero, and giving wildly varying numbers > 100%
> (unless your "absences" really do have intensity, but then they are not
> absent, are they?).
>
>   There is information in the intensity of an "absent" spot (a
> systematic absence, or any spot beyond your "true resolution limit").
> Unfortunately, measuring zero is "hard" because the "signal to noise
> ratio" will always be ... zero.  Statistics as we know it seems to fear
> this noise>signal domain.  For example, the error propagation Ulrich
> pointed out (F/sigF = 2 I/sigI) breaks down as I approaches zero.  If
> you take F=0, and add random noise to it and then square it, you will
> get an average value for <I>=<F^2> that always equals the square of the
> noise you added.  It will never be zero, no matter how much averaging
> you do.  Going the other way is problematic because if <I> really is
> zero, then half of your measurments of it will be negative (and sqrt(I)
> will be "imaginary" (ha ha)).  This is the problem TRUNCATE tries to
> solve.
>
> Despite these difficulties, IMHO, cutting out weak data from a ML
> refinement is a really bad idea.  This is because there is a big
> difference between "1 +/- 10" and "I don't know, could be anything" when
> you are fitting a model to data.  ESPECIALLY when your data/parameters
> ratio is already ~1.0 or less.  This is because the DIFFERENCE between
> Fobs and Fcalc relative to the uncertainty of Fobs is what determines
> wether or not your model is correct "to within experimental error".  If
> weak, high-res data are left out, then they can become a dumping ground
> for model bias.  Indeed, there are some entries in the PDB (particularly
> those pre-dating when we knew how to restrain B factors properly) that
> show an up-turn in "intensity" beyond the quoted resolution cutoff (if
> you look at the Wilson plot of Fcalc).  This is because the refinement
> program was allowed to make Fcalc beyond the resolution cutoff anything
> it wanted (and it did).
>
> The only time I think cutting out data because it is weak is appropriate
> is for map calculations. Leaving out an HKL from the map is the same as
> assigning it to zero (unless it is a sigma-a map that "fills in" with
> Fcalcs).  In maps, weak data (I/sd < 1) will (by definition) add more
> noise than signal.  In fact, calculating an anomalous difference
> Patterson with DANO/SIGDANO as the coefficients instead of DANO can
> often lead to "better" maps.
>
> Yes, your Rmerge, Rcryst and Rfree will all go up if you include weak
> data in your scaling and refinement, but the accuracy of your model will
> improve.  If you (or your reviewer) are worried about this, I suggest
> using the old, traditional 3-sigma cutoff for data used to calculate R.
> Keep the anachronisms together.  Yes, the PDB allows this.  In fact,
> (last time I checked) you are asked to enter what sigma cutoff you used
> for your R factors.
>
> In the last 100 days (3750 PDB depositions), the "REMARK   3   DATA
> CUTOFF" stats are thus:
>
> sigma-cutoff  popularity
> NULL          13.84%
> NONE          13.65%
> -2.5 to -1.5   0.37%
> -0.5 to 0.5   62.48%
> 0.5 to 1.5     2.03%
> 1.5 to 2.5     6.51%
> 2.5 to 3.5     0.61%
> 3.5 to 4.5     0.24%
>  >4.5           0.27%
>
> So it would appear mine is not a popular attitude.
>
> -James Holton
> MAD Scientist
>
>
> Shane Atwell wrote:
>
>> Could someone point me to some standards for data quality, especially
>> for publishing structures? I'm wondering in particular about highest
>> shell completeness, multiplicity, sigma and Rmerge.
>>
>> A co-worker pointed me to a '97 article by Kleywegt and Jones:
>>
>> _http://xray.bmc.uu.se/gerard/gmrp/gmrp.html_
>>
>> "To decide at which shell to cut off the resolution, we nowadays tend
>> to use the following criteria for the highest shell: completeness > 80
>> %, multiplicity > 2, more than 60 % of the reflections with I > 3
>> sigma(I), and Rmerge < 40 %. In our opinion, it is better to have a
>> good 1.8 Å structure, than a poor 1.637 Å structure."
>>
>> Are these recommendations still valid with maximum likelihood methods?
>> We tend to use more data, especially in terms of the Rmerge and sigma
>> cuttoff.
>>
>> Thanks in advance,
>>
>> *Shane Atwell*
>>
>
>


-- 
Poul Nissen, Professor, Ph.d.
Centre for Structural Biology
Aarhus University, Dept. Molecular Biology
Gustav Wieds Vej 10c, DK-8000 Aarhus C, Denmark
[EMAIL PROTECTED], http://www.bioxray.dk/pn
Tel +45 8942 5025, Fax +45 8612 3178

Reply via email to