[ccp4bb] I/sigma continued

Phil Evans Mon, 30 Mar 2009 07:43:05 -0700

“I/sigma” statistics seem to be contentious & confusing (see recentdiscussions on CCP4BB), particularly in what the various measuresshould be called (and how they should be labelled in a table, wherethere is only room for a very short name). I thought it worthcommenting on this issue at a little more length.


There are several interacting issues:

1) Statistics can be calculated either for individual observations Ihlor for intensities averaged over multiple (symmetry-related orreplicate) measurements Ih(avg): both are useful, but they need to bedistinguished

2) The statistic can be (a) the ratio of means <I>/<sigma> or (b) themean of ratios <I/sigma> . These are not the same.

3)The “sigma” used in 2(a) can be either (a) the estimated correctedSD or (b) the RMS scatter of observations ie the RMS deviation (whichis itself generally used to estimate a “correction” to the SD). TheRMS scatter cannot be used for 2(b) of course, since that needsindividual sigmas for each reflection.


4) Values will depend on how many outliers have been rejected.

For what it’s worth, Scala outputs two such statistics:-

(i) “I/sigma”: this is calculated for individual observations Ihl andis the (mean intensity <Ihl>)/(RMS scatter of Ihl). RMS scatter = RMS[Ihl – Ih(avg)]. This is some measure of the average significance ofindividual observations, but does not take into account multiplicity.In my new program under development (a Scala replacement) I haverelabelled this column “I/RMS” but I don’t really know what best tocall it. This value is a ratio of means (see 2(a) above).

(ii) “Mn(I/sd)”: this is the mean value of (Ih(avg)/sd(Ih(avg))),where Ih(avg) is the (weighted) average over all observations forreflection h, and sd(Ih(avg)) is the estimated SD of this average,after any “corrections” have been applied. This is, I think, the bestestimate of “signal-to-noise ratio”, but does depend on realisticestimates of sd(Ih(avg)), which is not entirely straightforward (andcertainly doesn’t allow for systematic errors!). This value is a meanof ratios (see 2(b) above).




The “corrected” sd(Ihl) is calculated in Scala for each observation as

sd(Ihl)corrected = SdFac * sqrt{sd(I)**2 + SdB*Ihl*LP +(SdAdd*Ihl)**2}with the parameters SdFac, SdB & SdAdd determined by trying to makethe RMS normalised deviation Delta(hl) = (Ihl - Ih(avg))/sd(Ihl)corrected = 1.0 for all intensity ranges (different parametersfor each run). If the sd estimates are correct, then the distributionof Delta(hl) should have SD = 1.0, and this “correction” tries toenforce this. This is more or less equivalent to making the RMSscatter == average SD. However the uncertainties in how best toestimate the real error do then influence the reliability of the Mn(I/sd) statistic (see (ii) above)

So what statistics do we want to look at? Probably the main reason forlooking at signal/noise statistics is to choose a “real resolution”cutoff, from some sort of signal/noise ratio. It isn’t clear (to me)what is the best way of doing this, and it is particularly difficultif the data are significantly anisotropic. The multiplicity needs tobe taken into account, so the individual “I/sigma” (see (i) above)isn’t the best guide. Personally, I generally cut data at around thepoint where Mn(I/sd) =~ 2, but I would cut off at <2 for anisotropicdata. I also find a useful guide from the correlation coefficientbetween Ih(avg) (Imean) pairs in half-datasets (plotted by Scala): theCC should be >0.5 at least, I think.

Note that the overall value of any of these statistics over allresolution ranges is not very useful and can be confusing, dependingon the distribution of intensities, since it mixes up strong lowresolution data (high signal/noise) with weak high resolution data(low signal/noise).

That leaves the question of how to label these statistics in aconsistent, clear and concise way: suggestions?


Phil Evans

[ccp4bb] I/sigma continued

Reply via email to