You do also have always to consider why you are doing this calculation - usually to satisfy a sceptical and possibly ill-informed referee. A major reason for doing this is to justify including an outer resolution shell of data (see this BB passim), and for this I have come to prefer the random half-dataset correlation coefficient in shells. A CC has a more straightforward distribution than an R-factor (though not entirely without problems). It is independent of the SD estimates, and easy to understand.
Phil > > The fundamental mathematical problem of using an R statistic on data > with I/sigma(I) < 3 is that the assumption that the fractional deviates > (I-<I>)/<I> obey a Gaussian distribution breaks down. And when that > happens, the R calculation itself becomes unstable, giving essentially > random R values. Therefore, including weak data in R calculations is > equivalent to calculating R with a 3-sigma cutoff, and then adding a > random number to the R value. Now, "random" data is one thing, but if > the statistic used to evaluate the data quality is itself random, then > it is not what I would call "useful". > > Since I am not very good at math, I always find myself approaching > statistics by generating long lists of random numbers, manipulating them > in some way, and then graphing the results. For graphing Rmerge vs > I/sigma(I), one does find that "Bernhard's rule" of Rmerge = 0.8/( > I/sigma(I) ) generally applies, but only for I/sigma(I) that is >= 3. > It gets better with high multiplicity, but even with m=100, the Rmerge > values for the I/sigma(I) < 1 points are all over the place. This is > true even if you average the value of Rmerge over a million random > number "seeds". In fact, one must do so much averaging, that I start to > worry about the low-order bits of common random number generators. I > have attached images of these "Rmerge vs I/sigma" graphs. The error > bars reflect the rms deviation from the average of a large number of > "Rmerge" values (different random number seeds). The "missing" values > are actually points where the average Rmerge in 600000 trials (m=3) was > still negative. > > The reason for this "noisy R factor problem" becomes clear if you > consider the limiting case where the "true" intensity is zero, and make > a histogram of ( I - <I> )/<I>. It is not a Gaussian. Rather, it is > the Gaussian's evil stepsister: the Lorentzian (or "Cauchy > distribution"). This distribution may look a lot like a Gaussian, but > it has longer tails, and these tails give it the weird statistical > property of having an undefined mean value. This is counterintuitive! > Because you can clearly just look at the histogram and see that it has a > central peak (at zero), but if you generate a million > Lorentzian-distributed random numbers and take the average value, you > will not get anything close to zero. Try it! You can generate a > Lorentzian deviate from a uniform deviate like this: > tan(pi*(rand()-0.5)), where "rand()" makes a random number from 0 to 1. > > Now, it is not too hard to understand how R could "blow up" when the > "true" spot intensities are all zero. After all, as <I> approaches > zero, the ratio ( I - <I> ) / <I> approaches a divide-by-zero problem. > But what about when I/sigma(I) = 1? Or 2? If you look at these > histograms, you find that they are a "cross" between a Gaussian and a > Lorentzian (the so-called "Voigt function"), and the histogram does not > become truly "Gaussian-looking" until I/sigma(I) = 3. At this point, > the R factor behaves "Bernhard's rule" quite well, even with > multiplicities as low as 2 or 3. This was the moment when I realized > that the "early crystallographers" who first decided to use this 3-sigma > cutoff, were smarter than I am. > > Now, you can make a Voigt function (or even a Lorentzian) look more like > a Gaussian by doing something called "outlier rejection", but it is hard > to rationalize why the "outliers" are being rejected. Especially in a > simulation! Then again, the silly part of all this is all we really > want is the "middle" of the histogram of ( I - <I> )/<I>. In fact, if > you just pick the "most common" Rmerge, you would get a much better > estimate of the "true Rmerge" in a given resolution bin than you would > by averaging a hundred times more data. Such procedures are called > "robust estimators" in statistics, and the "robust estimator" > equivalents to the average and the "rms deviation from the average" are > the median and the "median absolute deviation from the median". If you > make a list of Lorentzian-random numbers as above, and compute the > median, you will get a value very close to zero, even with modest > multiplicity! And the "median absolute deviation from the median" > rapidly converges to 1, which matches the "full width at half maximum" > of the histogram quite nicely. > > So, what are the practical implications of this? Perhaps instead of the > average Rmerge in each bin we should be looking at the median Rmerge? > This will be the same as the average for the cases where I/sigma(I) > 3, > but still be "well behaved" for the bins that contain only weak data. > The last two graphs attached compare the average (red) Rmerge vs the > median (blue). These were computed for the same data, and clearly the > median Rmerge is a lot more stable. The blue error bars are the median > absolute deviation from the median. True, the median Rmerge curves down > a bit as I/sigma approaches zero at m=3, but at least it doesn't go > negative! > > Just a suggestion. > > -James Holton > MAD Scientist > > On 3/9/2011 8:33 AM, Graeme Winter wrote: >> Hi James, >> >> May I just offer a short counter-argument to your case for not >> including weak reflections in the merging residuals? >> >> Unlike many people I rather like Rmerge, not because it tells you how >> good the data are, but because it gives you a clue as to how well the >> unmerged measurements agree with one another. It's already been >> mentioned on this thread that Rmerge is ~ 0.8 / <I/sigma> which means >> that the inverse is also true - an Rmerge of 0.8 indicates that the >> average measurement in the shell has an I/sigma of ~ 1 (presuming >> there are sufficient multiple measurements - if the multiplicity is < >> 3 or so this can be nonsense) >> >> This does not depend on the error model or the multiplicity. It just >> talks about the average. Now, if we exclude all measurements with an >> I/sigma of less than three we have no idea of how strong the >> reflections in the shell are on average. We're just top-slicing the >> good reflections and asking how well they agree. Well, with an I/sigma >> > 3 I would hope they agree rather well if your error model is >> reasonable. It would suddenly become rare to see an Rmerge > 0.3 in >> the outer shell. >> >> I like Rpim. It tells you how good the average measurement should be >> provided you have not too much radiation damage. However, without >> Rmerge I can't get a real handle on how well the measurements agree. >> >> Personally, what I would like to see is the full contents of the Scala >> log file available as graphs along with Rd from xdsstat and some other >> choice statistics so you can get a relatively complete picture, >> however I appreciate that this is unrealistic :o) >> >> Just my 2c. >> >> Cheerio, >> >> Graeme >> >> On 8 March 2011 20:07, James Holton <jmhol...@lbl.gov >> <mailto:jmhol...@lbl.gov>> wrote: >> >> Although George does not mention anything about data reduction >> programs, I take from his description that common small-molecule >> data processing packages (SAINT, others?), have also been >> modernized to record all data (no I/sigmaI > 2 or 3 cutoff). I >> agree with him that this is a good thing! And it is also a good >> thing that small-molecule refinement programs use all data. I >> just don't think it is a good idea to use all data in R factor >> calculations. >> >> Like Ron, I will probably be dating myself when I say that when I >> first got into the macromolecular crystallography business, it was >> still commonplace to use a 2-3 sigma spot intensity cutoff. In >> fact, this is the reason why the PDB wants to know your >> "completeness" in the outermost resolution shell (in those days, >> the outer resolution was defined by where completeness drops to >> ~80% after the 3 sigma spot cutoff). My experience with this, >> however, was brief, as the maximum-likelihood revolution was just >> starting to take hold, and the denzo manual specifically stated >> that only bad people use sigma cutoffs > -3.0. Nevertheless, like >> many crystallographers from this era, I have fond memories of the >> REALLY low R factors you can get by using this arcane and now >> reviled practice. Rsym values of 1-2% were common. >> >> It was only recently that I learned enough about statistics to >> understand the wisdom of my ancestors and that a 3-sigma cutoff is >> actually the "right thing to do" if you want to measure a >> fractional error (like an R factor). That is all I'm saying. >> >> -James Holton >> MAD Scientist >> >> >> On 3/6/2011 2:50 PM, Ronald E Stenkamp wrote: >> >> My small molecule experience is old enough (maybe 20 years) >> that I doubt if it's even close to representing current >> practices (best or otherwise). Given George's comments, I >> suspect (and hope) that less-than cutoffs are historical >> artifacts at this point, kept around in software for making >> comparisons with older structure determinations. But a bit of >> scanning of Acta papers and others might be necessary to >> confirm that. Ron >> >> >> On Sun, 6 Mar 2011, James Holton wrote: >> >> >> Yes, I would classify anything with I/sigmaI < 3 as >> "weak". And yes, of course it is possible to get "weak" >> spots from small molecule crystals. After all, there is no >> spot so "strong" that it cannot be defeated by a >> sufficient amount of background! I just meant that, >> relatively speaking, the intensities diffracted from a >> small molecule crystal are orders of magnitude brighter >> than those from a macromolecular crystal of the same size, >> and even the same quality (the 1/Vcell^2 term in Darwin's >> formula). >> >> I find it interesting that you point out the use of a 2 >> sigma(I) intensity cutoff for small molecule data sets! >> Is this still common practice? I am not a card-carrying >> "small molecule crystallographer", so I'm not sure. >> However, if that is the case, then by definition there are >> no "weak" intensities in the data set. And this is >> exactly the kind of data you want for least-squares >> refinement targets and computing "% error" quality metrics >> like R factors. For likelihood targets, however, the >> "weak" data are actually a powerful restraint. >> >> -James Holton >> MAD Scientist >> >> On 3/6/2011 11:22 AM, Ronald E Stenkamp wrote: >> >> Could you please expand on your statement that >> "small-molecule data has essentially no weak spots."? >> The small molecule data sets I've worked with have >> had large numbers of "unobserved" reflections where I >> used 2 sigma(I) cutoffs (maybe 15-30% of the >> reflections). Would you consider those "weak" spots >> or not? Ron >> >> On Sun, 6 Mar 2011, James Holton wrote: >> >> I should probably admit that I might be indirectly >> responsible for the resurgence of this I/sigma > 3 >> idea, but I never intended this in the way >> described by the original poster's reviewer! >> >> What I have been trying to encourage people to do >> is calculate R factors using only hkls for which >> the signal-to-noise ratio is > 3. Not refinement! >> Refinement should be done against all data. I >> merely propose that weak data be excluded from >> R-factor calculations after the >> refinement/scaling/mergeing/etc. is done. >> >> This is because R factors are a metric of the >> FRACTIONAL error in something (aka a "% >> difference"), but a "% error" is only meaningful >> when the thing being measured is not zero. >> However, in macromolecular crystallography, we >> tend to measure a lot of "zeroes". There is >> nothing wrong with measuring zero! An excellent >> example of this is confirming that a systematic >> absence is in fact "absent". The "sigma" on the >> intensity assigned to an absent spot is still a >> useful quantity, because it reflects how confident >> you are in the measurement. I.E. a sigma of "10" >> vs "100" means you are more sure that the >> intensity is zero. However, there is no "R factor" >> for systematic absences. How could there be! This >> is because the definition of "% error" starts to >> break down as the "true" spot intensity gets >> weaker, and it becomes completely meaningless when >> the "true" intensity reaches zero. >> >> Historically, I believe the widespread use of R >> factors came about because small-molecule data has >> essentially no weak spots. With the exception of >> absences (which are not used in refinement), spots >> from "salt crystals" are strong all the way out to >> edge of the detector, (even out to the "limiting >> sphere", which is defined by the x-ray >> wavelength). So, when all the data are strong, a >> "% error" is an easy-to-calculate quantity that >> actually describes the "sigma"s of the data very >> well. That is, sigma(I) of strong spots tends to >> be dominated by things like beam flicker, spindle >> stability, shutter accuracy, etc. All these >> usually add up to ~5% error, and indeed even the >> Braggs could typically get +/-5% for the intensity >> of the diffracted rays they were measuring. >> Things like Rsym were therefore created to check >> that nothing "funny" happened in the measurement. >> >> For similar reasons, the quality of a model >> refined against all-strong data is described very >> well by a "% error", and this is why the >> refinement R factors rapidly became popular. Most >> people intuitively know what you mean if you say >> that your model fits the data to "within 5%". In >> fact, a widely used criterion for the correctness >> of a "small molecule" structure is that the >> refinement R factor must be LOWER than Rsym. This >> is equivalent to saying that your curve (model) >> fit your data "to within experimental error". >> Unfortunately, this has never been the case for >> macromolecular structures! >> >> The problem with protein crystals, of course, is >> that we have lots of "weak" data. And by "weak", >> I don't mean "bad"! Yes, it is always nicer to >> have more intense spots, but there is nothing >> shameful about knowing that certain intensities >> are actually very close to zero. In fact, from >> the point of view of the refinement program, isn't >> describing some high-angle spot as: "zero, plus or >> minus 10", better than "I have no idea"? Indeed, >> several works mentioned already as well as the >> "free lunch algorithm" have demonstrated that >> these "zero" data can actually be useful, even if >> it is well beyond the "resolution limit". >> >> So, what do we do? I see no reason to abandon R >> factors, since they have such a long history and >> give us continuity of criteria going back almost a >> century. However, I also see no reason to punish >> ourselves by including lots of zeroes in the >> denominator. In fact, using weak data in an R >> factor calculation defeats their best feature. R >> factors are a very good estimate of the fractional >> component of the total error, provided they are >> calculated with strong data only. >> >> Of course, with strong and weak data, the best >> thing to do is compare the model-data disagreement >> with the magnitude of the error. That is, compare >> |Fobs-Fcalc| to sigma(Fobs), not Fobs itself. >> Modern refinement programs do this! And I say >> the more data the merrier. >> >> >> -James Holton >> MAD Scientist >> >> >> On 3/4/2011 5:15 AM, Marjolein Thunnissen wrote: >> >> hi >> >> Recently on a paper I submitted, it was the >> editor of the journal who wanted exactly the >> same thing. I never argued with the editor >> about this (should have maybe), but it could >> be one cause of the epidemic that Bart Hazes >> saw.... >> >> >> best regards >> >> Marjolein >> >> On Mar 3, 2011, at 12:29 PM, Roberto >> Battistutta wrote: >> >> Dear all, >> I got a reviewer comment that indicate the >> "need to refine the structures at an >> appropriate resolution (I/sigmaI of>3.0), >> and re-submit the revised coordinate files >> to the PDB for validation.". In the >> manuscript I present some crystal >> structures determined by molecular >> replacement using the same protein in a >> different space group as search model. >> Does anyone know the origin or the >> theoretical basis of this "I/sigmaI>3.0" >> rule for an appropriate resolution? >> Thanks, >> Bye, >> Roberto. >> >> >> Roberto Battistutta >> Associate Professor >> Department of Chemistry >> University of Padua >> via Marzolo 1, 35131 Padova - ITALY >> tel. +39.049.8275265/67 >> fax. +39.049.8275239 >> roberto.battistu...@unipd.it >> <mailto:roberto.battistu...@unipd.it> >> www.chimica.unipd.it/roberto.battistutta/ >> >> <http://www.chimica.unipd.it/roberto.battistutta/> >> VIMM (Venetian Institute of Molecular >> Medicine) >> via Orus 2, 35129 Padova - ITALY >> tel. +39.049.7923236 >> fax +39.049.7923250 >> www.vimm.it <http://www.vimm.it> >> >> >> >> >> >> >> > >