I agree absolutely with James - be as succinct as you like in a table
but include the verbose definition for each entry in the log file - or
at the very least in the manual. It should be easy to search for with
the table tag.
People will not go and read a reference..
Eleanor
James Holton wrote:
I think the best way to deal with issues like this can be found in
Strunk & White "The Elements of Style" (1918). Among other things,
these authors put forward a rather simple yet often overlooked rule to
writing in general, which I think applies equally well to computer
programs:
"Be clear."
The sentence itself is an example of how brevity need not sacrifice
clarity. Yes, you need labels in the table itself to be short, but
there is space immediately below (and above) every table that (IMHO)
ought to contain the definitions of each and every
variable/abbreviation used in the table, spelled out no matter how
obvious it may seem to the author. I can tell you many long and
painful stories about me trying to figure out what some variable in
some equation in some paper actually meant! Context is everything.
If you are tight for space, cite a reference (such as the manual).
That, and scientists talking about such quantities in email, papers,
etc. (such as myself) should also heed Strunk and White and also not
just assume that everyone knows exactly what "structure factor" means
as opposed to a "structure amplitude", let alone I/sigma. Indeed, the
word "intensity" is an incredibly ill-defined unit all by itself, to
the point of being useless. It can have units of photons, photons/s,
photons/area/s, photons/area, energy/volume, and many many more.
Often even in the same equation!
I would strongly advise against changing the "variable names" printed
out in log files by SCALA and other programs, especially when a given
name has persisted for a decade or more. Adding an "inline
definition" is fine, but changing names not only breaks programs that
were written to read these logs (and sometimes even humans reading the
log), but it also confines the meaning of "I/SIGMA from SCALA" to a
particular period in history.
So, what statistic do we want to look at? That depends on what you
are trying to do with the data. There is no way for Phil to know
this, so it is good that he prints out lots of different statistics.
That said, when talking about the data quality requirements for
structure solution by MAD/SAD, I suggest looking at I/sigma(I) where:
I - merged intensity (proportional to photons) assigned to a
reciprocal lattice point (hkl index)
sigma(I) - the error assigned to I
Exactly what I/sigma(I) is required to solve a structure, or make some
conclusion about a solved structure is a topic for another day.
-James Holton
MAD Scientist
Phil Evans wrote:
“I/sigma” statistics seem to be contentious & confusing (see recent
discussions on CCP4BB), particularly in what the various measures
should be called (and how they should be labelled in a table, where
there is only room for a very short name). I thought it worth
commenting on this issue at a little more length.
There are several interacting issues:
1) Statistics can be calculated either for individual observations
Ihl or for intensities averaged over multiple (symmetry-related or
replicate) measurements Ih(avg): both are useful, but they need to be
distinguished
2) The statistic can be (a) the ratio of means <I>/<sigma> or (b) the
mean of ratios <I/sigma> . These are not the same.
3)The “sigma” used in 2(a) can be either (a) the estimated corrected
SD or (b) the RMS scatter of observations ie the RMS deviation (which
is itself generally used to estimate a “correction” to the SD). The
RMS scatter cannot be used for 2(b) of course, since that needs
individual sigmas for each reflection.
4) Values will depend on how many outliers have been rejected.
For what it’s worth, Scala outputs two such statistics:-
(i) “I/sigma”: this is calculated for individual observations Ihl and
is the (mean intensity <Ihl>)/(RMS scatter of Ihl). RMS scatter = RMS
[Ihl – Ih(avg)]. This is some measure of the average significance of
individual observations, but does not take into account multiplicity.
In my new program under development (a Scala replacement) I have
relabelled this column “I/RMS” but I don’t really know what best to
call it. This value is a ratio of means (see 2(a) above).
(ii) “Mn(I/sd)”: this is the mean value of (Ih(avg)/sd(Ih(avg))),
where Ih(avg) is the (weighted) average over all observations for
reflection h, and sd(Ih(avg)) is the estimated SD of this average,
after any “corrections” have been applied. This is, I think, the best
estimate of “signal-to-noise ratio”, but does depend on realistic
estimates of sd(Ih(avg)), which is not entirely straightforward (and
certainly doesn’t allow for systematic errors!). This value is a mean
of ratios (see 2(b) above).
The “corrected” sd(Ihl) is calculated in Scala for each observation as
sd(Ihl)corrected = SdFac * sqrt{sd(I)**2 + SdB*Ihl*LP +
(SdAdd*Ihl)**2}
with the parameters SdFac, SdB & SdAdd determined by trying to make
the RMS normalised deviation Delta(hl) = (Ihl -
Ih(avg))/sd(Ihl)corrected = 1.0 for all intensity ranges (different
parameters for each run). If the sd estimates are correct, then the
distribution of Delta(hl) should have SD = 1.0, and this “correction”
tries to enforce this. This is more or less equivalent to making the
RMS scatter == average SD. However the uncertainties in how best to
estimate the real error do then influence the reliability of the
Mn(I/sd) statistic (see (ii) above)
So what statistics do we want to look at? Probably the main reason
for looking at signal/noise statistics is to choose a “real
resolution” cutoff, from some sort of signal/noise ratio. It isn’t
clear (to me) what is the best way of doing this, and it is
particularly difficult if the data are significantly anisotropic. The
multiplicity needs to be taken into account, so the individual
“I/sigma” (see (i) above) isn’t the best guide. Personally, I
generally cut data at around the point where Mn(I/sd) =~ 2, but I
would cut off at <2 for anisotropic data. I also find a useful guide
from the correlation coefficient between Ih(avg) (Imean) pairs in
half-datasets (plotted by Scala): the CC should be >0.5 at least, I
think.
Note that the overall value of any of these statistics over all
resolution ranges is not very useful and can be confusing, depending
on the distribution of intensities, since it mixes up strong low
resolution data (high signal/noise) with weak high resolution data
(low signal/noise).
That leaves the question of how to label these statistics in a
consistent, clear and concise way: suggestions?
Phil Evans