Yes, that's particularly a problem early in refinement, as the likelihood 
target used to fit the sigmaA curve from the cross-validation data is very 
insensitive when the true value of sigmaA is low, but it becomes more sensitive 
as the true value of sigmaA increases when the model improves.  Since the 
sigmaA curve is used to calibrate the likelihood functions used for refinement, 
this is a significant issue.

My initial experience with refining sigmaA using cross-validation data indeed 
suggested that at least 500-1000 reflections were needed, particularly early in 
the process.  For cases with very small cells and limited resolution, I've 
sometimes wondered if we could put some of the cross-validation data back into 
the working set, after the model has become sufficiently good, but I've never 
really tested this idea.

Regards,

Randy Read

On 26 Mar 2013, at 11:54, Ed. Pozharski <epozh...@umaryland.edu> wrote:

> As I recall, number of reflections set aside for cross-validation also 
> affects stability of sigmaA estimates.  With 500 reflections and 20 
> resolution shells you are down to 25 reflections per shell, which may be a 
> bit too low.
> 
> 
> -------- Original message --------
> From: Robbie Joosten <robbie_joos...@hotmail.com> 
> Date: 
> To: CCP4BB@JISCMAIL.AC.UK 
> Subject: Re: [ccp4bb] Rfree reflections 
> 
> 
> Hi Tim,
> 
> The derivation of sigma(Rw-free) is in this paper: Acta Cryst. (2000). D56,
> 442-450. Tickle et al.
> Note the difference between the sigma of weighted/generalized/Hamilton
> R-free and that of the 'regular' R-free (there is a 2 there somewhere). From
> my own tests (10 fold cross-validation on 38 small datasets) I also find
> sigma(R-free) = R-free/sqrt(Ntest).
> 
> For large datasets you really do not need to do k-fold cross validation,
> because sigma(R-free) can be predicted quite well. We just need to realize
> that it exists,
> 
> Cheers,
> Robbie
> 
> > -----Original Message-----
> > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of
> > Tim Gruene
> > Sent: Tuesday, March 26, 2013 11:05
> > To: CCP4BB@JISCMAIL.AC.UK
> > Subject: Re: [ccp4bb] Rfree reflections
> > 
> > Hi Robbie,
> > 
> > thank you for the explanation. Heinz Gut and Michael Hadders pointed me at
> > Axel Brunger's publication Methods Enzymol. 1997;277:366-96.,
> > http://www.ncbi.nlm.nih.gov/pubmed/18488318, which is where I got the
> > notion of
> > 500-1000 from. In this article a decrease of the error margin of Rfree
> with
> > n^(1/2) is mentioned (p.384), but only as an observation. Is your
> statement
> > "inverse proportional with the number of reflections" based on some
> > statistical treatment, or also just on observation?
> > 
> > It is a pity that k-cross validation is not standard routine because it
> seems so
> > easy and so quickly to do with nowadays computers and a simple script. But
> > that's probably like reminding people of not using R_int anymore in favour
> of
> > R_meas...
> > 
> > Cheers,
> > Tim
> > 
> > On Tue, Mar 26, 2013 at 10:24:51AM +0100, Robbie Joosten wrote:
> > > Hi Tim,
> > >
> > > I don't think the 5-10% or 500-1000 reflections are real rules, but
> > > rather practical choices. The error margin in R-free is inverse
> > > proportional with the number of reflections in your test set and also
> > > proportional with R-free itself. So for R-free to be 'significant' you
> > > need some absolute number of reflections to reach your cut-off of
> > > significance. This is where the 1000 comes from (500 is really pushing
> the
> > limit).
> > > You want to make sure the error margin in R and R-free are not too far
> > > apart and you probably also want to keep the test set representative
> > > of the whole data set (this is particularly important because we use
> > > hold-out validation, you only get one shot at validating). This is where
> the
> > 5%-10% comes from.
> > > Another consideration for going for the 5%-10% thing is that this
> > > makes it feasible to do 'full' (i.e. k-fold) cross-validation: you
> > > only have to do
> > > 20-10 refinements.  If you would go for 1000 reflections you would
> > > have to do 48 refinements for the average dataset.
> > >
> > > Personally, I take 5% and increase this percentage to maximum 10% if
> > > using 5% gives me a test set smaller than 1000 reflections.
> > >
> > > HTH,
> > > Robbie
> > >
> > > > -----Original Message-----
> > > > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf
> > > > Of Tim Gruene
> > > > Sent: Tuesday, March 26, 2013 09:33
> > > > To: CCP4BB@JISCMAIL.AC.UK
> > > > Subject: [ccp4bb] Rfree reflections
> > > >
> > > > Dear all,
> > > >
> > > > I recall that the set of Rfree reflections should be 500-1000,
> > > > rather than
> > > 5-
> > > > 10%, but I cannot find the reference for it (maybe Ian Tickle?).
> > > >
> > > > I would therefore like to be confirmed or corrected:
> > > >
> > > > Is there an absolute number required for Rfree to be significant, i.e.
> > > 500-1000
> > > > irrespective of the total number of unique reflections in the data
> > > > set, or
> > > is it
> > > > 5-10% (as a compromise)?
> > > >
> > > > Thanks and regards,
> > > > Tim
> > > >
> > > > --
> > > > --
> > > > Dr Tim Gruene
> > > > Institut fuer anorganische Chemie
> > > > Tammannstr. 4
> > > > D-37077 Goettingen
> > > >
> > > > GPG Key ID = A46BEE1A
> > >
> > 
> > --
> > --
> > Dr Tim Gruene
> > Institut fuer anorganische Chemie
> > Tammannstr. 4
> > D-37077 Goettingen
> > 
> > GPG Key ID = A46BEE1A

------
Randy J. Read
Department of Haematology, University of Cambridge
Cambridge Institute for Medical Research      Tel: + 44 1223 336500
Wellcome Trust/MRC Building                   Fax: + 44 1223 336827
Hills Road                                    E-mail: rj...@cam.ac.uk
Cambridge CB2 0XY, U.K.                       www-structmed.cimr.cam.ac.uk

Reply via email to