Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

James Holton Thu, 04 Nov 2021 13:34:27 -0700

Ahh, waters.  Where would structure-related debate be without them?

I see. So if your default refinement procedure is to add an unspecifiednumber of waters, then yes Rwork might not be all that useful, as itwill depend on how the building goes.

Again, it all depends on what you want your data to do. If you arelooking for subtle difference features, such as a bound ligand,mutation, etc. then clear identification of weak density should be your"score". So, I say:

1) pick some weak density
2) omit the model under it
3) refine to convergence
4) measure the Fo-Fc difference peak

I tend to use the MAPMAN "peek" function for this, but I'm sure thereare other ways. I say use the SAME starting model each time, one whereyou have built in the obvious waters, ions, etc, but borderline orotherwise inconsistent ones, leave them out. Then pick a low-lyingfeature as your test density.

Do not use molecular replacement. Use pointless with your starting modelon "xyzin" to re-index the data so that it matches the model. Super fastand easy to do. No origin issues, and it doesn't modify the pdb.

Aside: I actually have a program for removing unneeded waters I call"watershed". It is not fast, but it is thorough, and you only need todo it for your reference structure. You will need these programs:

https://github.com/fraser-lab/holton_scripts/tree/master/watershed
https://github.com/fraser-lab/holton_scripts/blob/master/converge_refmac.com

a pdb, an mtz, and a file called refmac_opts.txt that contains all therefmac5 keywords you want to use (if any). You will also want a lot ofCPUs, and the script works with the PBS, and SGE clusters I have accessto (and I'm working on Slurm). What watershed does is delete waters oneat a time and re-refine to convergence. Also, as a control, you want torefine the starting structure for the same number of cycles. Each "minusone water" structure gets its own CPU. Once everything settles, you lookat the final Rwork values. If deleting a water ends up making Rworkbetter? ... then you probably shouldn't have built it in the firstplace. That water is evil and must go. After throwing out the worstwater, you now have a new starting point. In some published structuresmore than 100 waters can be eliminated this way. Almost always bringsRwork and Rfree closer together, even though Rfree does not enter intoany automated decisions.

Using simulated data (where I know the ground truth) I find thewatershed procedure tends to un-do all the horrible things that happenafter you get over-aggressive and stuff waters into every little peakyou see. Eventually, as you add more noise waters, Rfree starts to goup, and the map starts to look less like the ground truth, but Rworkkeeps going down the more waters you add. What watershed does prettyreliably is bring you back to the point just before where Rfree startedto take a turn for the worse, and you can do this without ever lookingat Rfree!

Of course, it is always better to not put in bad waters in the firstplace, but sometimes its hard to tell.


Anyway, I suggest using a watershed-ed model as your reference.

Hope that is helpful in some way?

-James Holton
MAD Scientist


On 11/2/2021 5:01 PM, Murpholino Peligro wrote:

That's exactly what I am doing...
citing David...

"I expect the A and B data sets to be quite similar, but I would liketo evaluate which protocol was "better", and I want to do thisquickly, ideally looking at a single number."

and

"I do want to find a way to assess the various tweaks I can try indata processing for a single case"


Why not do all those things with Rwork?

I thought that comparing the R-free rather than the R-work was goingto be easier.... Because last week the structure was dehydrated.... Sothe refinement program added "strong waters" and due to a thousand orso extra reflections I could have a dozen or so extra waters and thedifference in R-work value between protocols due to extra waters wasgoing to be a little bit more difficult to compare. I have now thefinal structure so I could very well compare the R-work doing anotherround of refinement, maybe randomizing adps at the beginning orsomething.


Thanks a lot.

El lun, 1 de nov. de 2021 a la(s) 03:22, David Waterman(dgwater...@gmail.com) escribió:


    Hi James,

    What you wrote makes lots of sense. I had not heard about Rsleep,
    so that looks like interesting reading, thanks.

    I have often used Rfree as a simple tool to compare two protocols.
    If I am not actually optimising against Rfree but just using it
    for a one-off comparison then that is okay, right?

    Let's say I have two data processing protocols, A and B. Between
    these I might be exploring some difference in options within one
    data processing program, perhaps different geometry refinement
    parameters, or scaling options. I expect the A and B data sets to
    be quite similar, but I would like to evaluate which protocol was
    "better", and I want to do this quickly, ideally looking at a
    single number. I don't like I/sigI because I don't trust the
    sigmas, CC1/2 is often noisy, and I'm totally sworn off merging R
    statistics for these purposes. I tend to use Rfree as an
    easily-available metric, independent from the data processing
    program and the merging stats. It also allows a comparison of A
    and B in terms of the "product" of crystallography, namely the
    refined structure. In this I am lucky because I'm not trying to
    solve a structure. I may be looking at lysozyme or proteinase K:
    something where I can download a pretty good approximation to the
    truth from the PDB.

    So, what I do is process the data by A and process by B, ensure
    the data sets have the same free set, then refine to convergence
    (or at least, a lot of cycles) starting from a PDB structure. I
    then evaluate A vs B in terms of Rfree, though without an error
    bar on Rfree I don't read too much into small differences.

    Does this procedure seem sound? Perhaps it could be improved by
    randomly jiggling the atoms in the starting structure, in case the
    PDB deposition had already followed an A- or B-like protocol.
    Perhaps the whole approach is suspect. Certainly I wouldn't want
    to generalise by saying that A or B is better in all cases, but I
    do want to find a way to assess the various tweaks I can try in
    data processing for a single case.

    Any thoughts? I appreciate the wisdom of the BB here.

    Cheers

    -- David


    On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:


        Well, of all the possible metrics you could use to asses data
        quality Rfree is probably the worst one.  This is because it
        is a cross-validation metric, and cross-validations don't work
        if you use them as an optimization target. You can try, and
        might even make a little headway, but then your free set is
        burnt. If you have a third set of observations, as suggested
        for Rsleep (doi:10.1107/S0907444907033458), then you have a
        chance at another round of cross-validation. Crystallographers
        don't usually do this, but it has become standard practice in
        machine learning (training=Rwork, validation=Rfree and
        testing=Rsleep).

        So, unless you have an Rsleep set, any time you contemplate
        doing a bunch of random things and picking the best Rfree ...
        don't.  Just don't.  There madness lies.

        What happens after doing this is you will be initially happy
        about your lower Rfree, but everything you do after that will
        make it go up more than it would have had you not performed
        your Rfree optimization. This is because the changes in the
        data that made Rfree randomly better was actually noise, and
        as the structure becomes more correct it will move away from
        that noise. It's always better to optimize on something else,
        and then check your Rfree as infrequently as possible.
        Remember it is the control for your experiment. Never mix your
        positive control with your sample.

        As for the best metric to assess data quality?  Well, what are
        you doing with the data? There are always compromises in data
        processing and reduction that favor one application over
        another.  If this is a "I just want the structure" project,
        then score on the resolution where CC1/2 hits your favorite
        value. For some that is 0.5, others 0.3. I tend to use 0.0 so
        I can cut it later without re-processing.  Whatever you do
        just make it consistent.

        If its for anomalous, score on CCanom or if that's too noisy
        the Imean/sigma in the lowest-angle resolution or
        highest-intensity bin. This is because for anomalous you want
        to minimize relative error. The end-all-be-all of anomalous
        signal strength is the phased anomalous difference Fourier.
        You need phases to do one, but if you have a structure just
        omit an anomalous scatterer of interest, refine to
        convergence, and then measure the peak height at the position
        of the omitted anomalous atom.  Instructions for doing
        anomalous refinement in refmac5 are here:
        
https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html

        If you're looking for a ligand you probably want isomorphism,
        and in that case refining with a reference structure looking
        for low Rwork is not a bad strategy. This will tend to select
        for crystals containing a molecule that looks like the one you
        are refining.  But be careful! If it is an apo structure your
        ligand-bound crystals will have higher Rwork due to the very
        difference density you are looking for.

        But if its the same data just being processed in different
        ways, first make a choice about what you are interested in,
        and then optimize on that.  just don't optimize on Rfree!

        -James Holton
        MAD Scientist


        On 10/27/2021 8:44 AM, Murpholino Peligro wrote:

        Let's say I ran autoproc with different combinations of
        options for a specific dataset, producing dozens of different
        (but not so different) mtz files...
        Then I ran phenix.refine with the same options for the same
        structure but with all my mtz zoo
        What would be the best metric to say "hey this combo works
        the best!"?
        R-free?
        Thanks

        M. Peligro

        ------------------------------------------------------------------------

        To unsubscribe from the CCP4BB list, click the following link:
        https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
        <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>



        ------------------------------------------------------------------------

        To unsubscribe from the CCP4BB list, click the following link:
        https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
        <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>



    ------------------------------------------------------------------------

    To unsubscribe from the CCP4BB list, click the following link:
    https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
    <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


------------------------------------------------------------------------

To unsubscribe from the CCP4BB list, click the following link:

https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] what would be the best metric to asses the quality of a mtz file?

Reply via email to