Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Murpholino Peligro Tue, 02 Nov 2021 17:05:30 -0700

That's exactly what I am doing...
citing David...

"I expect the A and B data sets to be quite similar, but I would like to
evaluate which protocol was "better", and I want to do this quickly,
ideally looking at a single number."


and

"I do want to find a way to assess the various tweaks I can try in data
processing for a single case"

Why not do all those things with Rwork?
I thought that comparing the R-free rather than the R-work was going to be
easier.... Because last week the structure was dehydrated.... So the
refinement program added "strong waters" and due to a thousand or so extra
reflections I could have a dozen or so extra waters and the difference in
R-work value between protocols due to extra waters was going to be a little
bit more difficult to compare. I have now the final structure so I could
very well compare the R-work doing another round of refinement, maybe
randomizing adps at the beginning or something.

Thanks a lot.









El lun, 1 de nov. de 2021 a la(s) 03:22, David Waterman (
dgwater...@gmail.com) escribió:

> Hi James,
>
> What you wrote makes lots of sense. I had not heard about Rsleep, so that
> looks like interesting reading, thanks.
>
> I have often used Rfree as a simple tool to compare two protocols. If I am
> not actually optimising against Rfree but just using it for a one-off
> comparison then that is okay, right?
>
> Let's say I have two data processing protocols, A and B. Between these I
> might be exploring some difference in options within one data processing
> program, perhaps different geometry refinement parameters, or scaling
> options. I expect the A and B data sets to be quite similar, but I would
> like to evaluate which protocol was "better", and I want to do this
> quickly, ideally looking at a single number. I don't like I/sigI because I
> don't trust the sigmas, CC1/2 is often noisy, and I'm totally sworn off
> merging R statistics for these purposes. I tend to use Rfree as an
> easily-available metric, independent from the data processing program and
> the merging stats. It also allows a comparison of A and B in terms of the
> "product" of crystallography, namely the refined structure. In this I am
> lucky because I'm not trying to solve a structure. I may be looking at
> lysozyme or proteinase K: something where I can download a pretty good
> approximation to the truth from the PDB.
>
> So, what I do is process the data by A and process by B, ensure the data
> sets have the same free set, then refine to convergence (or at least, a lot
> of cycles) starting from a PDB structure. I then evaluate A vs B in terms
> of Rfree, though without an error bar on Rfree I don't read too much into
> small differences.
>
> Does this procedure seem sound? Perhaps it could be improved by randomly
> jiggling the atoms in the starting structure, in case the PDB deposition
> had already followed an A- or B-like protocol. Perhaps the whole approach
> is suspect. Certainly I wouldn't want to generalise by saying that A or B
> is better in all cases, but I do want to find a way to assess the various
> tweaks I can try in data processing for a single case.
>
> Any thoughts? I appreciate the wisdom of the BB here.
>
> Cheers
>
> -- David
>
>
> On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:
>
>>
>> Well, of all the possible metrics you could use to asses data quality
>> Rfree is probably the worst one.  This is because it is a cross-validation
>> metric, and cross-validations don't work if you use them as an optimization
>> target. You can try, and might even make a little headway, but then your
>> free set is burnt. If you have a third set of observations, as suggested
>> for Rsleep (doi:10.1107/S0907444907033458), then you have a chance at
>> another round of cross-validation. Crystallographers don't usually do this,
>> but it has become standard practice in machine learning (training=Rwork,
>> validation=Rfree and testing=Rsleep).
>>
>> So, unless you have an Rsleep set, any time you contemplate doing a bunch
>> of random things and picking the best Rfree ... don't.  Just don't.  There
>> madness lies.
>>
>> What happens after doing this is you will be initially happy about your
>> lower Rfree, but everything you do after that will make it go up more than
>> it would have had you not performed your Rfree optimization. This is
>> because the changes in the data that made Rfree randomly better was
>> actually noise, and as the structure becomes more correct it will move away
>> from that noise. It's always better to optimize on something else, and then
>> check your Rfree as infrequently as possible. Remember it is the control
>> for your experiment. Never mix your positive control with your sample.
>>
>> As for the best metric to assess data quality?  Well, what are you doing
>> with the data? There are always compromises in data processing and
>> reduction that favor one application over another.  If this is a "I just
>> want the structure" project, then score on the resolution where CC1/2 hits
>> your favorite value. For some that is 0.5, others 0.3. I tend to use 0.0 so
>> I can cut it later without re-processing.  Whatever you do just make it
>> consistent.
>>
>> If its for anomalous, score on CCanom or if that's too noisy the
>> Imean/sigma in the lowest-angle resolution or highest-intensity bin. This
>> is because for anomalous you want to minimize relative error. The
>> end-all-be-all of anomalous signal strength is the phased anomalous
>> difference Fourier. You need phases to do one, but if you have a structure
>> just omit an anomalous scatterer of interest, refine to convergence, and
>> then measure the peak height at the position of the omitted anomalous
>> atom.  Instructions for doing anomalous refinement in refmac5 are here:
>>
>> https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html
>>
>> If you're looking for a ligand you probably want isomorphism, and in that
>> case refining with a reference structure looking for low Rwork is not a bad
>> strategy. This will tend to select for crystals containing a molecule that
>> looks like the one you are refining.  But be careful! If it is an apo
>> structure your ligand-bound crystals will have higher Rwork due to the very
>> difference density you are looking for.
>>
>> But if its the same data just being processed in different ways, first
>> make a choice about what you are interested in, and then optimize on that.
>> just don't optimize on Rfree!
>>
>> -James Holton
>> MAD Scientist
>>
>>
>> On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
>>
>> Let's say I ran autoproc with different combinations of options for a
>> specific dataset, producing dozens of different (but not so different) mtz
>> files...
>> Then I ran phenix.refine with the same options for the same structure but
>> with all my mtz zoo
>> What would be the best metric to say "hey this combo works the best!"?
>> R-free?
>> Thanks
>>
>> M. Peligro
>>
>> ------------------------------
>>
>> To unsubscribe from the CCP4BB list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>>
>>
>>
>> ------------------------------
>>
>> To unsubscribe from the CCP4BB list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>>
>
> ------------------------------
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Reply via email to