“I/sigma” statistics seem to be contentious & confusing (see recent discussions on CCP4BB), particularly in what the various measures should be called (and how they should be labelled in a table, where there is only room for a very short name). I thought it worth commenting on this issue at a little more length.

There are several interacting issues:

1) Statistics can be calculated either for individual observations Ihl or for intensities averaged over multiple (symmetry-related or replicate) measurements Ih(avg): both are useful, but they need to be distinguished

2) The statistic can be (a) the ratio of means <I>/<sigma> or (b) the mean of ratios <I/sigma> . These are not the same.

3)The “sigma” used in 2(a) can be either (a) the estimated corrected SD or (b) the RMS scatter of observations ie the RMS deviation (which is itself generally used to estimate a “correction” to the SD). The RMS scatter cannot be used for 2(b) of course, since that needs individual sigmas for each reflection.

4) Values will depend on how many outliers have been rejected.

For what it’s worth, Scala outputs two such statistics:-

(i) “I/sigma”: this is calculated for individual observations Ihl and is the (mean intensity <Ihl>)/(RMS scatter of Ihl). RMS scatter = RMS [Ihl – Ih(avg)]. This is some measure of the average significance of individual observations, but does not take into account multiplicity. In my new program under development (a Scala replacement) I have relabelled this column “I/RMS” but I don’t really know what best to call it. This value is a ratio of means (see 2(a) above).

(ii) “Mn(I/sd)”: this is the mean value of (Ih(avg)/sd(Ih(avg))), where Ih(avg) is the (weighted) average over all observations for reflection h, and sd(Ih(avg)) is the estimated SD of this average, after any “corrections” have been applied. This is, I think, the best estimate of “signal-to-noise ratio”, but does depend on realistic estimates of sd(Ih(avg)), which is not entirely straightforward (and certainly doesn’t allow for systematic errors!). This value is a mean of ratios (see 2(b) above).



The “corrected” sd(Ihl) is calculated in Scala for each observation as
sd(Ihl)corrected = SdFac * sqrt{sd(I)**2 + SdB*Ihl*LP + (SdAdd*Ihl)**2} with the parameters SdFac, SdB & SdAdd determined by trying to make the RMS normalised deviation Delta(hl) = (Ihl - Ih(avg))/ sd(Ihl)corrected = 1.0 for all intensity ranges (different parameters for each run). If the sd estimates are correct, then the distribution of Delta(hl) should have SD = 1.0, and this “correction” tries to enforce this. This is more or less equivalent to making the RMS scatter == average SD. However the uncertainties in how best to estimate the real error do then influence the reliability of the Mn(I/ sd) statistic (see (ii) above)

So what statistics do we want to look at? Probably the main reason for looking at signal/noise statistics is to choose a “real resolution” cutoff, from some sort of signal/noise ratio. It isn’t clear (to me) what is the best way of doing this, and it is particularly difficult if the data are significantly anisotropic. The multiplicity needs to be taken into account, so the individual “I/sigma” (see (i) above) isn’t the best guide. Personally, I generally cut data at around the point where Mn(I/sd) =~ 2, but I would cut off at <2 for anisotropic data. I also find a useful guide from the correlation coefficient between Ih(avg) (Imean) pairs in half-datasets (plotted by Scala): the CC should be >0.5 at least, I think.

Note that the overall value of any of these statistics over all resolution ranges is not very useful and can be confusing, depending on the distribution of intensities, since it mixes up strong low resolution data (high signal/noise) with weak high resolution data (low signal/noise).

That leaves the question of how to label these statistics in a consistent, clear and concise way: suggestions?

Phil Evans

Reply via email to