Thank you for this.

Hmmm.

Interesting, and good to know the expected distribution of extreme values.

However, what I'm more worried about is how to evaluate the other 999 points?  Lets say I'm trying to compare two 1000-member sets (A and B) that both have an extreme value of 3, but for the other 999 they are all 2sigma in "A" and 1sigma in B.  Clearly, "B" is better than "A", but how to quantify?


On 11/8/2022 3:34 PM, Petrus Zwart wrote:
Hi James,

This is what you need.

https://en.wikipedia.org/wiki/Generalized_extreme_value_distribution

The distribution of a maximum of 1k random variates looks like this, and the (fitted by eye) analytical distribution associated with it seems to have a decent fit - as expected.

image.png

The idea of a p-value to judge the quality of a structure is interesting. xtriage uses this mechanism to flag suspicious normalized intensities, the idea being that in a small dataset it is less likely to see a large E value as compared to in a large dataset. The issue of course is that the total intensity of a normalized intensity is bound by the number of atoms and the underlying assumption used is that it can be potentially infinitely large. It still is a decent metric I think.

P


P


On Tue, Nov 8, 2022 at 3:25 PM James Holton <jmhol...@lbl.gov> wrote:

    Thank you Ian for your quick response!

    I suppose what I'm really trying to do is put a p-value on the
    "geometry" of a given PDB file.  As in: what are the odds the
    deviations from ideality of this model are due to chance?

    I am leaning toward the need to take all the deviations in the
    structure together as a set, but, as Joao just noted, that it just
    "feels wrong" to tolerate a 3-sigma deviate. Even more wrong to
    tolerate 4 sigma, 5 sigma. And 6 sigma deviates are really
    difficult to swallow unless your have trillions of data points.

    To put it down in equations, is the p-value of a structure with
    1000 bonds in it with one 3-sigma deviate given by:

    a)  p = 1-erf(3/sqrt(2))
    or
    b)  p = 1-erf(3/sqrt(2))**1000
    or
    c) something else?



    On 11/8/2022 2:56 PM, Ian Tickle wrote:
    Hi James

    I don't think it's meaningful to ask whether the deviation of a
    single bond length (or anything else that's single) from its
    expected value is significant, since as you say there's always
    some finite probability that it occurred purely by chance. 
    Statistics can only meaningfully be applied to samples of a
    'reasonable' size.  I know there are statistics designed for
    small samples but not for samples of size 1 !  It's more
    meaningful to talk about distributions.  For example if 1% of the
    sample contained deviations > 3 sigma when you expected there to
    be only 0.3 %, that is probably significant (but it still has a
    finite probability of occurring by chance), as would be finding
    no deviations > 3 sigma (for a reasonably large sample to avoid
    sampling errors).

    Cheers

    -- Ian


    On Tue, Nov 8, 2022, 22:22 James Holton <jmhol...@lbl.gov> wrote:

        OK, so lets suppose there is this bond in your structure that is
        stretched a bit.  Is that for real? Or just a random fluke? 
        Let's say
        for example its a CA-CB bond that is supposed to be 1.529 A
        long, but in
        your model its 1.579 A.  This is 0.05 A too long. Doesn't
        seem like
        much, right? But the "sigma" given to such a bond in our
        geometry
        libraries is 0.016 A.  These sigmas are typically derived from a
        database of observed bonds of similar type found in highly
        accurate
        structures, like small molecules. So, that makes this a
        3-sigma outlier.
        Assuming the distribution of deviations is Gaussian, that's a
        pretty
        unlikely thing to happen. You expect 3-sigma deviates to
        appear less
        than 0.3% of the time.  So, is that significant?

        But, then again, there are lots of other bonds in the
        structure. Lets
        say there are 1000. With that many samplings from a Gaussian
        distribution you generally expect to see a 3-sigma deviate at
        least
        once.  That is, do an "experiment" where you pick 1000
        Gaussian-random
        numbers from a distribution with a standard deviation of 1.0.
        Then, look
        for the maximum over all 1000 trials. Is that one > 3 sigma?
        It probably
        is. If you do this "experiment" millions of times it turns
        out seeing at
        least one 3-sigma deviate in 1000 tries is very common.
        Specifically,
        about 93% of the time. It is rare indeed to have every member
        of a
        1000-deviate set all lie within 3 sigmas.  So, we have gone
        from one
        3-sigma deviate being highly unlikely to being a virtual
        certainty if
        you look at enough samples.

        So, my question is: is a 3-sigma deviate significant?  Is it
        significant
        only if you have one bond in the structure?  What about
        angles? What if
        you have 500 bonds and 500 angles?  Do they count as 1000
        deviates
        together? Or separately?

        I'm sure the more mathematically inclined out there will have
        some
        intelligent answers for the rest of us, however, if you are
        not a
        mathematician, how about a vote?  Is a 3-sigma bond length
        deviation
        significant? Or not?

        Looking forward to both kinds of responses,

        -James Holton
        MAD Scientist

        ########################################################################

        To unsubscribe from the CCP4BB list, click the following link:
        https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
        <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>

        This message was issued to members of
        www.jiscmail.ac.uk/CCP4BB <http://www.jiscmail.ac.uk/CCP4BB>,
        a mailing list hosted by www.jiscmail.ac.uk
        <http://www.jiscmail.ac.uk>, terms & conditions are available
        at https://www.jiscmail.ac.uk/policyandsecurity/



    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>



--
------------------------------------------------------------------------------------------
P.H. Zwart
Staff Scientist, Molecular Biophysics and Integrated Bioimaging
Biosciences Lead, Center for Advanced Mathematics for Energy Research Applications
Lawrence Berkeley National Laboratories
1 Cyclotron Road, Berkeley, CA-94703, USA
Cell: 510 289 9246
PHENIX: http://www.phenix-online.org
CAMERA: http://camera.lbl.gov/
------------------------------------------------------------------------------------------

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to