Hi Douglas, So will you and/or other participants in this fascinating and informative thread pick up the glove and implement the suggestions made here? At least we'll know if it makes a change to our data. In any case I doubt that it can harm.
Cheers, Boaz Boaz Shaanan, Ph.D. Dept. of Life Sciences Ben-Gurion University of the Negev Beer-Sheva 84105 Israel E-mail: bshaa...@bgu.ac.il Phone: 972-8-647-2220 Skype: boaz.shaanan Fax: 972-8-647-2992 or 972-8-646-1710 ________________________________________ From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas Theobald [dtheob...@brandeis.edu] Sent: Sunday, June 23, 2013 1:52 AM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] ctruncate bug? On Jun 22, 2013, at 6:18 PM, Frank von Delft <frank.vonde...@sgc.ox.ac.uk> wrote: > A fascinating discussion (I've learnt a lot!); a quick sanity check, though: > > In what scenarios would these improved estimates make a significant > difference? Who knows? I always think that improved estimates are always a good thing, ignoring computational complexity (by "improved" I mean making more accurate physical assumptions). This may all be academic --- estimating Itrue with unphysical negative values, and then later correcting w/French-Wilson, may give approximately the same answers and make no tangible difference in the models. But that all seems a bit convoluted, ad hoc, and unnecessary, esp. now with the available computational power. It might make a difference. > Or rather: are there any existing programs (as opposed to vapourware) that > would benefit significantly? > > Cheers > phx > > > > On 22/06/2013 18:04, Douglas Theobald wrote: >> Ian, I really do think we are almost saying the same thing. Let me try to >> clarify. >> >> You say that the Gaussian model is not the "correct" data model, and that >> the Poisson is correct. I more-or-less agree. If I were being pedantic >> (me?) I would say that the Poisson is *more* physically realistic than the >> Gaussian, and more realistic in a very important and relevant way --- but in >> truth the Poisson model does not account for other physical sources of error >> that arise from real crystals and real detectors, such as dark noise and >> read noise (that's why I would prefer a gamma distribution). I also agree >> that for x>10 the Gaussian is a good approximation to the Poisson. I >> basically agree with every point you make about the Poisson vs the Gaussian, >> except for the following. >> >> The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption, >> except as an approximation when Ispot > Iback. It *can* be derived from >> the Gaussian assumption (and in fact I think that is probably the *only* >> justification it has). It is true that the difference between two Poissons >> can be negative. It is also true that for moderate # of counts, the >> Gaussian is a good approximation to the Poisson. But we are trying to >> estimate Itrue, and both of those points are irrelevant to estimating Itrue >> when Ispot < Iback. Contrary to your assertion, we are not concerned with >> differences of Poissonians, only sums. Here is why: >> >> In the Poisson model you outline, Ispot is the sum of two Poisson variables, >> Iback and Iobs. That means Ispot is also Poisson and can never be negative. >> Again --- the observed data (Ispot) is a *sum*, so that is what we must >> deal with. The likelihood function for this model is: >> >> L(a) = (a+b)^k exp(-a-b) >> >> where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson >> (i.e., a = Itrue), and 'b' is the mean of the Iback Poisson. Of >> course k>=0, and both parameters a>0 and b>0. Our job is to estimate 'a', >> Itrue. Given the likelihood function above, there is no valid estimate of >> 'a' that will give a negative value. For example, the ML estimate of 'a' is >> always non-negative. Specifically, if we assume 'b' is known from >> background extrapolation, the ML estimate of 'a' is: >> >> a = k-b if k>b >> >> a = 0 if k<=b >> >> You can verify this visually by plotting the likelihood function (vs 'a' as >> variable) for any combination of k and b you want. The SD is a bit more >> difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML >> estimate of 'a'. >> >> Note that the ML estimate of 'a', when k>b (Ispot>Iback), is equivalent to >> Ispot-Iback. >> >> Now, to restate: as an estimate of Itrue, Ispot-Iback cannot be derived >> from the Poisson model. In contrast, Ispot-Iback *can* be derived from a >> Gaussian model (as the ML and LS estimate of Itrue). In fact, I'll wager >> the Gaussian is the only reasonable model that gives Ispot-Iback as an >> estimate of Itrue. This is why I claim that using Ispot-Iback as an >> estimate of Itrue, even when Ispot<Iback, implicitly means you are using a >> (non-physical) Gaussian model. Feel free to prove me wrong --- can you >> derive Ispot-Iback, as an estimate of Itrue, from anything besides a >> Gaussian? >> >> Cheers, >> >> Douglas >> >> >> >> >> On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle <ianj...@gmail.com> wrote: >> On 21 June 2013 19:45, Douglas Theobald <dtheob...@brandeis.edu> wrote: >> >> The current way of doing things is summarized by Ed's equation: >> Ispot-Iback=Iobs. Here Ispot is the # of counts in the spot (the area >> encompassing the predicted reflection), and Iback is # of counts in the >> background (usu. some area around the spot). Our job is to estimate the >> true intensity Itrue. Ed and others argue that Iobs is a reasonable >> estimate of Itrue, but I say it isn't because Itrue can never be negative, >> whereas Iobs can. >> >> Now where does the Ispot-Iback=Iobs equation come from? It implicitly >> assumes that both Iobs and Iback come from a Gaussian distribution, in which >> Iobs and Iback can have negative values. Here's the implicit data model: >> >> Ispot = Iobs + Iback >> >> There is an Itrue, to which we add some Gaussian noise and randomly generate >> an Iobs. To that is added some background noise, Iback, which is also >> randomly generated from a Gaussian with a "true" mean of Ibtrue. This gives >> us the Ispot, the measured intensity in our spot. Given this data model, >> Ispot will also have a Gaussian distribution, with mean equal to the sum of >> Itrue + Ibtrue. From the properties of Gaussians, then, the ML estimate of >> Itrue will be Ispot-Iback, or Iobs. >> >> Douglas, sorry I still disagree with your model. Please note that I do >> actually support your position, that Ispot-Iback is not the best estimate of >> Itrue. I stress that I am not arguing against this conclusion, merely (!) >> with your data model, i.e. you are arriving at the correct conclusion >> despite using the wrong model! So I think it's worth clearing that up. >> >> First off, I can assure you that there is no assumption, either implicit or >> explicit, that Ispot and Iback come from a Gaussian distribution. They are >> both essentially measured photon counts (perhaps indirectly), so it is >> logically impossible that they could ever be negative, even with any >> experimental error you can imagine. The concept of a photon counter >> counting a negative number of photons is simply a logical impossibility (it >> would be like counting the coins in your pocket and coming up with a >> negative number, even allowing for mistakes in counting!). This immediately >> rules out the idea that they are Gaussian. Photon counting where the >> photons appear completely randomly in time (essentially as a consequence of >> the Heisenberg Uncertainly Principle) obeys a Poisson distribution. In fact >> we routinely estimate the standard uncertainties of Ispot & Iback on the >> basis that they are Poissonian, i.e. using var(count) = count. That is >> hardly a Gaussian assumption for the uncertainty! >> >> Here is the correct data model: there is a true Ispot which is (or is >> proportional to) the diffracted energy from the _sum_ of the Bragg >> diffraction spot and the background under the spot (this is not the same as >> Iback). This energy ends up as individual photons being counted at the >> detector (I know there's a complication that some detectors are not actually >> photon counters, but the result is the same: you end up with a photon count, >> or something proportional to it). However photons are indistinguishable >> (they do not carry labels telling us where they came from), so quantum >> mechanics doesn't even allow us to talk about photons coming from different >> places: all we see are indistinguishable photons arriving at the detector >> and literally being counted. Therefore the estimated Ispot being the total >> number of photons counted from Bragg + background has a Poisson >> distribution. There will be some experimental error associated with the >> random-in-time appearance of photons and also instrumental errors (e.g we >> might simply fail to count some of the photons, or we might count extra >> photons coming from somewhere else), but whatever the source of the error >> there is no way that the measured count of photons can ever be negative. >> >> Now obviously we want to estimate the background under the spot but we can't >> do that by looking at the spot itself (because the photons are >> indistinguishable). So completely independently of the Ispot measurement we >> look at a nearby representative (hopefully!) area where there are no Bragg >> spots and count that also: there is a true Iback associated with this and >> our estimate of it from counting photons. Again, being a photon count it is >> also Poissonian and will have some experimental error associated with it, >> but regardless of what the error is Iback, like Ispot, can never be negative. >> >> Now we have two Poissonian variables Ispot & Iback and traditionally we >> perform the calculation Iobs = Ispot - Iback (whatever meaning you want to >> attach to Iobs). Provided Ispot and Iback are 'sufficiently' large numbers >> a Poisson distribution can be approximated by a Gaussian with the same mean >> and standard deviation, but with the proviso that the variate of this >> approximate Gaussian can never be negative. In fact you only need about 10 >> counts or more in _both_ Ispot and Iback for the approximation to be pretty >> good. (As an aside, 10 counts used to be a small number, nowadays detectors >> are becoming much more sensitive and the backgrounds are now so low that >> maybe the assumption that typical counts are > 10 is no longer tenable.). >> This of course means that the difference of 2 approximate Gaussians is also >> an approximate Gaussian, with mean equal to the difference of the means and >> variance equal to the sum of the variances. Importantly, as a consequence >> of the experimental errors (including the fact that Iback is probably not an >> accurate estimate of the background in Ispot), this Gaussian _can_ have >> negative values of the variate. F-W indeed makes the explicit assumption >> that Ispot - Iback is Gaussian and therefore can be negative. >> >> Your observation that the sum of 2 (or indeed any number of) Poissons is >> also Poissonian is of course completely correct (we can arbitrarily separate >> the photons into any number of groups each of which is Poissonian, and then >> adding the groups together at the end must give exactly the same result as >> having kept the photons in a single group). However this point is >> irrelevant to the present discussion: we are not concerned with sums of >> Poissonians, only differences. >> >> Your previous statement that "the case when Iback>Ispot, where the Gaussian >> approximation to the Poisson no longer holds" is not correct. The Gaussian >> approximation to the Poisson holds regardless of whether or not Iback > >> Ispot: the only assumption is that _both_ Ispot and Iback are "sufficiently >> large". >> >> My point about integrated intensities being required for estimating the >> Wilson distribution parameter in order to correct the intensities using F-W >> was that it's easy to iterate inside a single program. It's much harder to >> iterate when it has to be done over several programs (in this case the >> integration program, the sorting/scaling/outlier rejection/merging program >> and the I->F conversion program), since not all the information required may >> be available at the same time (this is essentially Phil's point). Also >> dealing with non-Gaussian values that would be generated by your algorithm >> in the outlier rejection/merging program will be tricky, and probably would >> require a radical overhaul of that program (a point I made previously). >> >> Sorry this got so long, but I felt it was important that you start out with >> the correct data model! >> >> Cheers >> >> -- Ian >> >