Re: [ccp4bb] ctruncate bug?

Boaz Shaanan Sun, 23 Jun 2013 04:16:07 -0700

Hi Douglas,

So will you and/or other participants in this fascinating and informative 
thread pick up the glove and implement the suggestions made here? At least 
we'll know if it makes a change to our data. In any case I doubt that it can 
harm.


 Cheers,

          Boaz


Boaz Shaanan, Ph.D.
Dept. of Life Sciences
Ben-Gurion University of the Negev
Beer-Sheva 84105
Israel

E-mail: bshaa...@bgu.ac.il
Phone: 972-8-647-2220  Skype: boaz.shaanan
Fax:   972-8-647-2992 or 972-8-646-1710





________________________________________
From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Douglas Theobald 
[dtheob...@brandeis.edu]
Sent: Sunday, June 23, 2013 1:52 AM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] ctruncate bug?

On Jun 22, 2013, at 6:18 PM, Frank von Delft <frank.vonde...@sgc.ox.ac.uk> 
wrote:

> A fascinating discussion (I've learnt a lot!);  a quick sanity check, though:
>
> In what scenarios would these improved estimates make a significant 
> difference?

Who knows?  I always think that improved estimates are always a good thing, 
ignoring computational complexity (by "improved" I mean making more accurate 
physical assumptions).  This may all be academic --- estimating Itrue with 
unphysical negative values, and then later correcting w/French-Wilson, may give 
approximately the same answers and make no tangible difference in the models.  
But that all seems a bit convoluted, ad hoc, and unnecessary, esp. now with the 
available computational power.  It might make a difference.

> Or rather:  are there any existing programs (as opposed to vapourware) that 
> would benefit significantly?
>
> Cheers
> phx
>
>
>
> On 22/06/2013 18:04, Douglas Theobald wrote:
>> Ian, I really do think we are almost saying the same thing.  Let me try to 
>> clarify.
>>
>> You say that the Gaussian model is not the "correct" data model, and that 
>> the Poisson is correct.  I more-or-less agree.  If I were being pedantic 
>> (me?) I would say that the Poisson is *more* physically realistic than the 
>> Gaussian, and more realistic in a very important and relevant way --- but in 
>> truth the Poisson model does not account for other physical sources of error 
>> that arise from real crystals and real detectors, such as dark noise and 
>> read noise (that's why I would prefer a gamma distribution).  I also agree 
>> that for x>10 the Gaussian is a good approximation to the Poisson.  I 
>> basically agree with every point you make about the Poisson vs the Gaussian, 
>> except for the following.
>>
>> The Iobs=Ispot-Iback equation cannot be derived from a Poisson assumption, 
>> except as an approximation when  Ispot > Iback.  It *can* be derived from 
>> the Gaussian assumption (and in fact I think that is probably the *only* 
>> justification it has).   It is true that the difference between two Poissons 
>> can be negative.  It is also true that for moderate # of counts, the 
>> Gaussian is a good approximation to the Poisson.  But we are trying to 
>> estimate Itrue, and both of those points are irrelevant to estimating Itrue 
>> when Ispot < Iback.  Contrary to your assertion, we are not concerned with 
>> differences of Poissonians, only sums.  Here is why:
>>
>> In the Poisson model you outline, Ispot is the sum of two Poisson variables, 
>> Iback and Iobs.  That means Ispot is also Poisson and can never be negative. 
>>  Again --- the observed data (Ispot) is a *sum*, so that is what we must 
>> deal with.  The likelihood function for this model is:
>>
>> L(a) = (a+b)^k exp(-a-b)
>>
>> where 'k' is the # of counts in Ispot, 'a' is the mean of the Iobs Poisson 
>> (i.e., a = Itrue), and 'b' is the           mean of the Iback Poisson.  Of 
>> course k>=0, and both parameters a>0 and b>0.  Our job is to estimate 'a', 
>> Itrue.  Given the likelihood function above, there is no valid estimate of 
>> 'a' that will give a negative value.  For example, the ML estimate of 'a' is 
>> always non-negative.  Specifically, if we assume 'b' is known from 
>> background extrapolation, the ML estimate of 'a' is:
>>
>> a = k-b   if k>b
>>
>> a = 0   if k<=b
>>
>> You can verify this visually by plotting the likelihood function (vs 'a' as 
>> variable) for any combination of k and b you want.  The SD is a bit more 
>> difficult, but it is approximately (a+b)/sqrt(k), where 'a' is now the ML 
>> estimate of 'a'.
>>
>> Note that the ML estimate of 'a', when k>b (Ispot>Iback), is equivalent to 
>> Ispot-Iback.
>>
>> Now, to restate:  as an estimate of Itrue, Ispot-Iback cannot be derived 
>> from the Poisson model.  In contrast, Ispot-Iback *can* be derived from a 
>> Gaussian model (as the ML and LS estimate of Itrue).  In fact, I'll wager 
>> the Gaussian is the only reasonable model that gives Ispot-Iback as an 
>> estimate of Itrue.  This is why I claim that using Ispot-Iback as an 
>> estimate of Itrue, even when Ispot<Iback, implicitly means you are using a 
>> (non-physical) Gaussian model.  Feel free to prove me wrong --- can you 
>> derive Ispot-Iback, as an estimate of Itrue, from anything besides a 
>> Gaussian?
>>
>> Cheers,
>>
>> Douglas
>>
>>
>>
>>
>> On Sat, Jun 22, 2013 at 12:06 PM, Ian Tickle <ianj...@gmail.com> wrote:
>> On 21 June 2013 19:45, Douglas Theobald <dtheob...@brandeis.edu> wrote:
>>
>> The current way of doing things is summarized by Ed's equation: 
>> Ispot-Iback=Iobs.  Here Ispot is the # of counts in the spot (the area 
>> encompassing the predicted reflection), and Iback is # of counts in the 
>> background (usu. some area around the spot).  Our job is to estimate the 
>> true intensity Itrue.  Ed and others argue that Iobs is a reasonable 
>> estimate of Itrue, but I say it isn't because Itrue can never be negative, 
>> whereas Iobs can.
>>
>> Now where does the Ispot-Iback=Iobs equation come from?  It implicitly 
>> assumes that both Iobs and Iback come from a Gaussian distribution, in which 
>> Iobs and Iback can have negative values.  Here's the implicit data model:
>>
>> Ispot = Iobs + Iback
>>
>> There is an Itrue, to which we add some Gaussian noise and randomly generate 
>> an Iobs.  To that is added some background noise, Iback, which is also 
>> randomly generated from a Gaussian with a "true" mean of Ibtrue.  This gives 
>> us the Ispot, the measured intensity in our spot.  Given this data model, 
>> Ispot will also have a Gaussian distribution, with mean equal to the sum of 
>> Itrue + Ibtrue.  From the properties of Gaussians, then, the ML estimate of 
>> Itrue will be Ispot-Iback, or Iobs.
>>
>> Douglas, sorry I still disagree with your model.  Please note that I do 
>> actually support your position, that Ispot-Iback is not the best estimate of 
>> Itrue.  I stress that I am not arguing against this conclusion, merely (!) 
>> with your data model, i.e. you are arriving at the correct conclusion 
>> despite using the wrong model!  So I think it's worth clearing that up.
>>
>> First off, I can assure you that there is no assumption, either implicit or 
>> explicit, that Ispot and Iback come from a Gaussian distribution.  They are 
>> both essentially measured photon counts (perhaps indirectly), so it is 
>> logically impossible that they could ever be negative, even with any 
>> experimental error you can imagine.  The concept of a photon counter 
>> counting a negative number of photons is simply a logical impossibility (it 
>> would be like counting the coins in your pocket and coming up with a 
>> negative number, even allowing for mistakes in counting!).  This immediately 
>> rules out the idea that they are Gaussian.  Photon counting where the 
>> photons appear completely randomly in time (essentially as a consequence of 
>> the Heisenberg Uncertainly Principle) obeys a Poisson distribution.  In fact 
>> we routinely estimate the standard uncertainties of Ispot & Iback on the 
>> basis that they are Poissonian, i.e. using var(count) = count.  That is 
>> hardly a Gaussian assumption for the uncertainty!
>>
>> Here is the correct data model: there is a true Ispot which is (or is 
>> proportional to) the diffracted energy from the _sum_ of the Bragg 
>> diffraction spot and the background under the spot (this is not the same as 
>> Iback).  This energy ends up as individual photons being counted at the 
>> detector (I know there's a complication that some detectors are not actually 
>> photon counters, but the result is the same: you end up with a photon count, 
>> or something proportional to it).  However photons are indistinguishable 
>> (they do not carry labels telling us where they came from), so quantum 
>> mechanics doesn't even allow us to talk about photons coming from different 
>> places: all we see are indistinguishable photons arriving at the detector 
>> and literally being counted.  Therefore the estimated Ispot being the total 
>> number of photons counted from Bragg + background has a Poisson 
>> distribution.  There will be some experimental error associated with the 
>> random-in-time appearance of photons and also instrumental errors (e.g we 
>> might simply fail to count some of the photons, or we might count extra 
>> photons coming from somewhere else), but whatever the source of the error 
>> there is no way that the measured count of photons can ever be negative.
>>
>> Now obviously we want to estimate the background under the spot but we can't 
>> do that by looking at the spot itself (because the photons are 
>> indistinguishable).  So completely independently of the Ispot measurement we 
>> look at a nearby representative (hopefully!) area where there are no Bragg 
>> spots and count that also: there is a true Iback associated with this and 
>> our estimate of it from counting photons.  Again, being a photon count it is 
>> also Poissonian and will have some experimental error associated with it, 
>> but regardless of what the error is Iback, like Ispot, can never be negative.
>>
>> Now we have two Poissonian variables Ispot & Iback and traditionally we 
>> perform the calculation Iobs = Ispot - Iback (whatever meaning you want to 
>> attach to Iobs).  Provided Ispot and Iback are 'sufficiently' large numbers 
>> a Poisson distribution can be approximated by a Gaussian with the same mean 
>> and standard deviation, but with the proviso that the variate of this 
>> approximate Gaussian can never be negative.  In fact you only need about 10 
>> counts or more in _both_ Ispot and Iback for the approximation to be pretty 
>> good.  (As an aside, 10 counts used to be a small number, nowadays detectors 
>> are becoming much more sensitive and the backgrounds are now so low that 
>> maybe the assumption that typical counts are > 10 is no longer tenable.).  
>> This of course means that the difference of 2 approximate Gaussians is also 
>> an approximate Gaussian, with mean equal to the difference of the means and 
>> variance equal to the sum of the variances.  Importantly, as a consequence 
>> of the experimental errors (including the fact that Iback is probably not an 
>> accurate estimate of the background in Ispot), this Gaussian _can_ have 
>> negative values of the variate.  F-W indeed makes the explicit assumption 
>> that Ispot - Iback is Gaussian and therefore can be negative.
>>
>> Your observation that the sum of 2 (or indeed any number of) Poissons is 
>> also Poissonian is of course completely correct (we can arbitrarily separate 
>> the photons into any number of groups each of which is Poissonian, and then 
>> adding the groups together at the end must give exactly the same result as 
>> having kept the photons in a single group).  However this point is 
>> irrelevant to the present discussion: we are not concerned with sums of 
>> Poissonians, only differences.
>>
>> Your previous statement that "the case when Iback>Ispot, where the Gaussian 
>> approximation to the Poisson no longer holds" is not correct.  The Gaussian 
>> approximation to the Poisson holds regardless of whether or not Iback > 
>> Ispot: the only assumption is that _both_ Ispot and Iback are "sufficiently 
>> large".
>>
>> My point about integrated intensities being required for estimating the 
>> Wilson distribution parameter in order to correct the intensities using F-W 
>> was that it's easy to iterate inside a single program.  It's much harder to 
>> iterate when it has to be done over several programs (in this case the 
>> integration program, the sorting/scaling/outlier rejection/merging program 
>> and the I->F conversion program), since not all the information required may 
>> be available at the same time (this is essentially Phil's point).  Also 
>> dealing with non-Gaussian values that would be generated by your algorithm 
>> in the outlier rejection/merging program will be tricky, and probably would 
>> require a radical overhaul of that program (a point I made previously).
>>
>> Sorry this got so long, but I felt it was important that you start out with 
>> the correct data model!
>>
>> Cheers
>>
>> -- Ian
>>
>

Re: [ccp4bb] ctruncate bug?

Reply via email to