Re: [ccp4bb] Rfree in similar data set

Ian Tickle Fri, 25 Sep 2009 02:40:44 -0700

Hi Tom

Attainment of the global optimum is not a necessary condition for the
argument to hold, it was merely an example, but I agree with you that
maybe it wasn't such a good example from a practical point of view! -
but it was intended only as a hypothetical example to illustrate the
point I was making.  This was that the same optimum with the same value
of Rfree can be reached by many different paths some of which might
involve switching the test set midway (i.e. the ones claimed to be
biased), and some where the same test set is used throughout (i.e. the
ones we're all agreed are unbiased); obviously in each case the final
refinement must use the same test set for any comparison of the Rfree's
to be valid.  However it's a logical impossibility (i.e. in essence it
comes down to a reductio ad absurdum to the equation '0=1') for the same
Rfree at the same optimum to be both biased and unbiased (bias of course
being the difference between the expectation and the true value).  The
*only* necessary (and sufficient) condition is that the refinement with
the new data has converged, whether it's to a global or local optimum
makes no essential difference, so that the Rfree for the parameters at
that optimum is meaningful and any previous bias is removed.


Note that bias in Rfree arises because the model parameters are
unavoidably overfitted to the 'noise' in the data (i.e. random
experimental errors in Iobs or Fobs), whereas what we want is to fit the
parameters to only the 'signal' in the data (i.e. differences between
Fobs and Fcalc which relate only to real differences in the model).
Unfortunately optimization algorithms are unable to make any distinction
between fitting signal and noise, so of course we end up fitting both.
When we fit the model to a new set of data, the parameters are re-fitted
to the signal and noise in the new data, and any 'memory' of fitting to
the old data, along with any bias in Rfree due to fitting the noise in
the old data, is completely replaced at convergence by the 'memory' of
fitting to the new data.

Cheers

-- Ian

> -----Original Message-----
> From: owner-ccp...@jiscmail.ac.uk [mailto:owner-ccp...@jiscmail.ac.uk]
On
> Behalf Of Tom Terwilliger
> Sent: 24 September 2009 16:58
> To: Ian Tickle
> Cc: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] Rfree in similar data set
> 
> Hi Ian,
> 
> Surely you are correct that  "...once all issues of local optima are
> resolved, by whatever means it takes, you will end up at the same
unique
> global optimum no matter where you started from."   However the key
here
> is "by whatever means it takes".  I think that in practice there are a
> vast number of local minima in this problem.  You can rebuild a model
from
> the PDB that is highly refined and find many other models that have R-
> factors that are the same or better, and all can be refined to a
stable
> "minimum".  All of course are very similar and differ principally in
side-
> chain conformations and small main chain differences.   I think that
means
> it is very difficult to find the global minimum.
> 
> In practice, relative to the Rfree set discussion that started this, I
> think this also means that once an Rfree set is chosen and a model has
> been refined using that Rfree set, the Rfree set should be kept.
> 
> All the best,
> Tom T
> 
> On Sep 24, 2009, at 9:41 AM, Ian Tickle wrote:
> 
> 
>               -----Original Message-----
> 
> 
>               From: owner-ccp...@jiscmail.ac.uk [mailto:owner-
> ccp...@jiscmail.ac.uk]
> 
> 
>       On
> 
> 
>               Behalf Of Eric Bennett
> 
> 
>               Sent: 24 September 2009 13:31
> 
> 
>               To: CCP4BB@JISCMAIL.AC.UK
> 
> 
>               Subject: Re: [ccp4bb] Rfree in similar data set
> 
> 
> 
>               Ian Tickle wrote:
> 
> 
> 
>                       For that to
> 
> 
>                       be true it would have to be possible to arrive
at a
> different
> 
> 
>       unbiased
> 
> 
>                       Rfree from another starting point.  But provided
your
> starting point
> 
> 
>                       wasn't a local maximum LL and you haven't gotten
into a
> local maximum
> 
> 
>                       along the way, convergence will be to a unique
global
> maximum of the
> 
> 
>       LL,
> 
> 
>                       so the Rfree must be the same whatever starting
point is
> used (within
> 
> 
>                       the radius of convergence of course).
> 
> 
> 
>               But if you're using a different set of data the minima
and
> maxima of
> 
> 
>               the function aren't necessarily going to be in the same
place.
> Rfree
> 
> 
>               is supposed to inform about overfitting.  In an
overfitting
> situation
> 
> 
>               there are multiple possible models which describe the
data
> well and
> 
> 
>               which overfit solution you end up with could be
sensitive to
> the data
> 
> 
>               set used.  The provisions that you haven't gotten stuck
in a
> local
> 
> 
>               maximum and are within radius of convergence don't seem
safe
> 
> 
>               considering historical situations that led to the
introduction
> of
> 
> 
>               Rfree.  What algorithm is going to converge main chain
tracing
> errors
> 
> 
>               to the correct maximum?  Thinking about that situation,
isn't
> part of
> 
> 
>               the goal of Rfree to give you a hint in situations where
you
> have, in
> 
> 
>               fact, gotten stuck in a local maximum due to a
significant
> error in
> 
> 
>               the model that places it outside the radius of
convergence of
> the
> 
> 
>               refinement algorithm?
> 
> 
> 
>       Hi Eric,
> 
>       Yes clearly the function optima won't necessarily be in the same
> place
>       for different datasets; the question is whether the distance
between
> the
>       optima is less than the convergence radius.  This will depend
> largely on
>       whether the datasets have similar dmin; if they do then the
> differences
>       will be largely random measurement errors (I'm assuming that
there's
>       nothing fundamentally wrong with the data).  Then there should
be no
>       problem re-refining against the 2nd dataset, and the Rfree will
be
>       unbiased at the global optimum.  The more common situation
perhaps
> is
>       that the 2nd dataset is at much higher resolution; in that case
it's
>       quite likely that there are undetected local optima in the model
> from
>       the 1st dataset that only become apparent in the maps when the
2nd
>       dataset is used.  In that case refinement is almost certainly
not
> the
>       answer (or at least not the whole answer), you're going to have
to
> go
>       back to the maps and model building.
> 
>       On the question of overfitting, again any problems of local
optima
>       (possibly indicated by a higher than expected Rfree as you say)
have
> to
>       be resolved first for each of your candidate parameterizations
of
> the
>       model, as best as the data will allow.  Then if you find that
Rfree
> at
>       convergence is higher (or LLfree lower) for one parameterization
> than
>       another, you choose the parameterization with the lower Rfree
> (higher
>       LLfree) to go forward.  You cannot safely reject a model as
being
>       overfitted if the refinement generating the Rfree didn't
converge,
> so
>       that the Rfree is unbiased.  I don't see the problem there
(except
> of
>       course in choosing which parameterizations to try).
> 
>       I think you misunderstood my provisos, I was only doing that to
> simplify
>       the argument; if there are local optima then they have to be
> resolved,
>       most likely by means other than refinement, but their presence
does
> not
>       affect the argument about Rfree bias.  My contention is that
once
> all
>       issues of local optima are resolved, by whatever means it takes,
you
>       will end up at the same unique global optimum no matter where
you
>       started from (unless of course you're very unlucky and there are
>       multiple global optima with identical likelihoods but I think we
can
>       discount that as unlikely!), and therefore Rfree must be
unbiased at
>       that point.  At intermediate points in this process (i.e. on the
> paths
>       connecting optima), Rfree has no meaning or indeed usefulness
and
>       therefore the question whether it's biased or not is also
> meaningless.
> 
>       Cheers
> 
>       -- Ian
> 
> 
>       Disclaimer
>       This communication is confidential and may contain privileged
> information intended solely for the named addressee(s). It may not be
used
> or disclosed except for the purpose for which it has been sent. If you
are
> not the intended recipient you must not review, use, disclose, copy,
> distribute or take any action in reliance upon it. If you have
received
> this communication in error, please notify Astex Therapeutics Ltd by
> emailing i.tic...@astex-therapeutics.com and destroy all copies of the
> message and any attached documents.
>       Astex Therapeutics Ltd monitors, controls and protects all its
> messaging traffic in compliance with its corporate email policy. The
> Company accepts no liability or responsibility for any onward
transmission
> or use of emails and attachments having left the Astex Therapeutics
> domain.  Unless expressly stated, opinions in this message are those
of
> the individual sender and not of Astex Therapeutics Ltd. The recipient
> should check this email and any attachments for the presence of
computer
> viruses. Astex Therapeutics Ltd accepts no liability for damage caused
by
> any virus transmitted by this email. E-mail is susceptible to data
> corruption, interception, unauthorized amendment, and tampering, Astex
> Therapeutics Ltd only send and receive e-mails on the basis that the
> Company is not liable for any such alteration or any consequences
thereof.
>       Astex Therapeutics Ltd., Registered in England at 436 Cambridge
> Science Park, Cambridge CB4 0QA under number 3751674
> 
> 
> 
> 
> Thomas C. Terwilliger
> Mail Stop M888
> Los Alamos National Laboratory
> Los Alamos, NM 87545
> 
> Tel:  505-667-0072                 email: terwilli...@lanl.gov
> Fax: 505-665-3024                 SOLVE web site:
http://solve.lanl.gov
> PHENIX web site: http:www.phenix-online.org
> ISFI Integrated Center for Structure and Function Innovation web site:
> http://techcenter.mbi.ucla.edu
> TB Structural Genomics Consortium web site:
http://www.doe-mbi.ucla.edu/TB
> CBSS Center for Bio-Security Science web site:
http://www.lanl.gov/cbss
> 
> 
> 
> 



Disclaimer
This communication is confidential and may contain privileged information 
intended solely for the named addressee(s). It may not be used or disclosed 
except for the purpose for which it has been sent. If you are not the intended 
recipient you must not review, use, disclose, copy, distribute or take any 
action in reliance upon it. If you have received this communication in error, 
please notify Astex Therapeutics Ltd by emailing 
i.tic...@astex-therapeutics.com and destroy all copies of the message and any 
attached documents. 
Astex Therapeutics Ltd monitors, controls and protects all its messaging 
traffic in compliance with its corporate email policy. The Company accepts no 
liability or responsibility for any onward transmission or use of emails and 
attachments having left the Astex Therapeutics domain.  Unless expressly 
stated, opinions in this message are those of the individual sender and not of 
Astex Therapeutics Ltd. The recipient should check this email and any 
attachments for the presence of computer viruses. Astex Therapeutics Ltd 
accepts no liability for damage caused by any virus transmitted by this email. 
E-mail is susceptible to data corruption, interception, unauthorized amendment, 
and tampering, Astex Therapeutics Ltd only send and receive e-mails on the 
basis that the Company is not liable for any such alteration or any 
consequences thereof.
Astex Therapeutics Ltd., Registered in England at 436 Cambridge Science Park, 
Cambridge CB4 0QA under number 3751674

Re: [ccp4bb] Rfree in similar data set

Reply via email to