I very much agree with Holton. I also find the following to be a simple and very helpful argument in the discussion (it was given to me by Morten Kjeldgaard):
"Your model should reproduce your weak data by weak Fc's - therefore you need your weak reflections in the refinement" I personally like to judge a proper data cutoff from the Wilson plot - as long as it looks right data are OK to use. If on the other hand the plot changes slope, levels out or even rises at some point, then I cut the data there. The plot will often show nice appearance to a I/sigI level of about 1 - 1.5, other times 2 or 3 - it really depends on the crystal (and the assumption of valid wilson statistics, of course). Poul > I generally cut off integration at the shell wheree I/sigI < 0.5 and > then cut off merged data where MnI/sd(I) ~ 1.5. It is always easier to > to cut off data later than to re-integrate it. I never look at the > Rmerge, Rsym, Rpim or Rwhatever in the highest resolution shell. This > is because R-statistics are inappropriate for weak data. > > Don't believe me? If anybody out there doesn't think that spots with no > intensity are important, then why are you looking so carefully at your > systematic absences? ;) The "Rabsent" statistic (if it existed) would > always be dividing by zero, and giving wildly varying numbers > 100% > (unless your "absences" really do have intensity, but then they are not > absent, are they?). > > There is information in the intensity of an "absent" spot (a > systematic absence, or any spot beyond your "true resolution limit"). > Unfortunately, measuring zero is "hard" because the "signal to noise > ratio" will always be ... zero. Statistics as we know it seems to fear > this noise>signal domain. For example, the error propagation Ulrich > pointed out (F/sigF = 2 I/sigI) breaks down as I approaches zero. If > you take F=0, and add random noise to it and then square it, you will > get an average value for <I>=<F^2> that always equals the square of the > noise you added. It will never be zero, no matter how much averaging > you do. Going the other way is problematic because if <I> really is > zero, then half of your measurments of it will be negative (and sqrt(I) > will be "imaginary" (ha ha)). This is the problem TRUNCATE tries to > solve. > > Despite these difficulties, IMHO, cutting out weak data from a ML > refinement is a really bad idea. This is because there is a big > difference between "1 +/- 10" and "I don't know, could be anything" when > you are fitting a model to data. ESPECIALLY when your data/parameters > ratio is already ~1.0 or less. This is because the DIFFERENCE between > Fobs and Fcalc relative to the uncertainty of Fobs is what determines > wether or not your model is correct "to within experimental error". If > weak, high-res data are left out, then they can become a dumping ground > for model bias. Indeed, there are some entries in the PDB (particularly > those pre-dating when we knew how to restrain B factors properly) that > show an up-turn in "intensity" beyond the quoted resolution cutoff (if > you look at the Wilson plot of Fcalc). This is because the refinement > program was allowed to make Fcalc beyond the resolution cutoff anything > it wanted (and it did). > > The only time I think cutting out data because it is weak is appropriate > is for map calculations. Leaving out an HKL from the map is the same as > assigning it to zero (unless it is a sigma-a map that "fills in" with > Fcalcs). In maps, weak data (I/sd < 1) will (by definition) add more > noise than signal. In fact, calculating an anomalous difference > Patterson with DANO/SIGDANO as the coefficients instead of DANO can > often lead to "better" maps. > > Yes, your Rmerge, Rcryst and Rfree will all go up if you include weak > data in your scaling and refinement, but the accuracy of your model will > improve. If you (or your reviewer) are worried about this, I suggest > using the old, traditional 3-sigma cutoff for data used to calculate R. > Keep the anachronisms together. Yes, the PDB allows this. In fact, > (last time I checked) you are asked to enter what sigma cutoff you used > for your R factors. > > In the last 100 days (3750 PDB depositions), the "REMARK 3 DATA > CUTOFF" stats are thus: > > sigma-cutoff popularity > NULL 13.84% > NONE 13.65% > -2.5 to -1.5 0.37% > -0.5 to 0.5 62.48% > 0.5 to 1.5 2.03% > 1.5 to 2.5 6.51% > 2.5 to 3.5 0.61% > 3.5 to 4.5 0.24% > >4.5 0.27% > > So it would appear mine is not a popular attitude. > > -James Holton > MAD Scientist > > > Shane Atwell wrote: > >> Could someone point me to some standards for data quality, especially >> for publishing structures? I'm wondering in particular about highest >> shell completeness, multiplicity, sigma and Rmerge. >> >> A co-worker pointed me to a '97 article by Kleywegt and Jones: >> >> _http://xray.bmc.uu.se/gerard/gmrp/gmrp.html_ >> >> "To decide at which shell to cut off the resolution, we nowadays tend >> to use the following criteria for the highest shell: completeness > 80 >> %, multiplicity > 2, more than 60 % of the reflections with I > 3 >> sigma(I), and Rmerge < 40 %. In our opinion, it is better to have a >> good 1.8 Å structure, than a poor 1.637 Å structure." >> >> Are these recommendations still valid with maximum likelihood methods? >> We tend to use more data, especially in terms of the Rmerge and sigma >> cuttoff. >> >> Thanks in advance, >> >> *Shane Atwell* >> > > -- Poul Nissen, Professor, Ph.d. Centre for Structural Biology Aarhus University, Dept. Molecular Biology Gustav Wieds Vej 10c, DK-8000 Aarhus C, Denmark [EMAIL PROTECTED], http://www.bioxray.dk/pn Tel +45 8942 5025, Fax +45 8612 3178