Ahh, waters. Where would structure-related debate be without them?
I see. So if your default refinement procedure is to add an unspecified
number of waters, then yes Rwork might not be all that useful, as it
will depend on how the building goes.
Again, it all depends on what you want your data to do. If you are
looking for subtle difference features, such as a bound ligand,
mutation, etc. then clear identification of weak density should be your
"score". So, I say:
1) pick some weak density
2) omit the model under it
3) refine to convergence
4) measure the Fo-Fc difference peak
I tend to use the MAPMAN "peek" function for this, but I'm sure there
are other ways. I say use the SAME starting model each time, one where
you have built in the obvious waters, ions, etc, but borderline or
otherwise inconsistent ones, leave them out. Then pick a low-lying
feature as your test density.
Do not use molecular replacement. Use pointless with your starting model
on "xyzin" to re-index the data so that it matches the model. Super fast
and easy to do. No origin issues, and it doesn't modify the pdb.
Aside: I actually have a program for removing unneeded waters I call
"watershed". It is not fast, but it is thorough, and you only need to
do it for your reference structure. You will need these programs:
https://github.com/fraser-lab/holton_scripts/tree/master/watershed
https://github.com/fraser-lab/holton_scripts/blob/master/converge_refmac.com
a pdb, an mtz, and a file called refmac_opts.txt that contains all the
refmac5 keywords you want to use (if any). You will also want a lot of
CPUs, and the script works with the PBS, and SGE clusters I have access
to (and I'm working on Slurm). What watershed does is delete waters one
at a time and re-refine to convergence. Also, as a control, you want to
refine the starting structure for the same number of cycles. Each "minus
one water" structure gets its own CPU. Once everything settles, you look
at the final Rwork values. If deleting a water ends up making Rwork
better? ... then you probably shouldn't have built it in the first
place. That water is evil and must go. After throwing out the worst
water, you now have a new starting point. In some published structures
more than 100 waters can be eliminated this way. Almost always brings
Rwork and Rfree closer together, even though Rfree does not enter into
any automated decisions.
Using simulated data (where I know the ground truth) I find the
watershed procedure tends to un-do all the horrible things that happen
after you get over-aggressive and stuff waters into every little peak
you see. Eventually, as you add more noise waters, Rfree starts to go
up, and the map starts to look less like the ground truth, but Rwork
keeps going down the more waters you add. What watershed does pretty
reliably is bring you back to the point just before where Rfree started
to take a turn for the worse, and you can do this without ever looking
at Rfree!
Of course, it is always better to not put in bad waters in the first
place, but sometimes its hard to tell.
Anyway, I suggest using a watershed-ed model as your reference.
Hope that is helpful in some way?
-James Holton
MAD Scientist
On 11/2/2021 5:01 PM, Murpholino Peligro wrote:
That's exactly what I am doing...
citing David...
"I expect the A and B data sets to be quite similar, but I would like
to evaluate which protocol was "better", and I want to do this
quickly, ideally looking at a single number."
and
"I do want to find a way to assess the various tweaks I can try in
data processing for a single case"
Why not do all those things with Rwork?
I thought that comparing the R-free rather than the R-work was going
to be easier.... Because last week the structure was dehydrated.... So
the refinement program added "strong waters" and due to a thousand or
so extra reflections I could have a dozen or so extra waters and the
difference in R-work value between protocols due to extra waters was
going to be a little bit more difficult to compare. I have now the
final structure so I could very well compare the R-work doing another
round of refinement, maybe randomizing adps at the beginning or
something.
Thanks a lot.
El lun, 1 de nov. de 2021 a la(s) 03:22, David Waterman
(dgwater...@gmail.com) escribió:
Hi James,
What you wrote makes lots of sense. I had not heard about Rsleep,
so that looks like interesting reading, thanks.
I have often used Rfree as a simple tool to compare two protocols.
If I am not actually optimising against Rfree but just using it
for a one-off comparison then that is okay, right?
Let's say I have two data processing protocols, A and B. Between
these I might be exploring some difference in options within one
data processing program, perhaps different geometry refinement
parameters, or scaling options. I expect the A and B data sets to
be quite similar, but I would like to evaluate which protocol was
"better", and I want to do this quickly, ideally looking at a
single number. I don't like I/sigI because I don't trust the
sigmas, CC1/2 is often noisy, and I'm totally sworn off merging R
statistics for these purposes. I tend to use Rfree as an
easily-available metric, independent from the data processing
program and the merging stats. It also allows a comparison of A
and B in terms of the "product" of crystallography, namely the
refined structure. In this I am lucky because I'm not trying to
solve a structure. I may be looking at lysozyme or proteinase K:
something where I can download a pretty good approximation to the
truth from the PDB.
So, what I do is process the data by A and process by B, ensure
the data sets have the same free set, then refine to convergence
(or at least, a lot of cycles) starting from a PDB structure. I
then evaluate A vs B in terms of Rfree, though without an error
bar on Rfree I don't read too much into small differences.
Does this procedure seem sound? Perhaps it could be improved by
randomly jiggling the atoms in the starting structure, in case the
PDB deposition had already followed an A- or B-like protocol.
Perhaps the whole approach is suspect. Certainly I wouldn't want
to generalise by saying that A or B is better in all cases, but I
do want to find a way to assess the various tweaks I can try in
data processing for a single case.
Any thoughts? I appreciate the wisdom of the BB here.
Cheers
-- David
On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote:
Well, of all the possible metrics you could use to asses data
quality Rfree is probably the worst one. This is because it
is a cross-validation metric, and cross-validations don't work
if you use them as an optimization target. You can try, and
might even make a little headway, but then your free set is
burnt. If you have a third set of observations, as suggested
for Rsleep (doi:10.1107/S0907444907033458), then you have a
chance at another round of cross-validation. Crystallographers
don't usually do this, but it has become standard practice in
machine learning (training=Rwork, validation=Rfree and
testing=Rsleep).
So, unless you have an Rsleep set, any time you contemplate
doing a bunch of random things and picking the best Rfree ...
don't. Just don't. There madness lies.
What happens after doing this is you will be initially happy
about your lower Rfree, but everything you do after that will
make it go up more than it would have had you not performed
your Rfree optimization. This is because the changes in the
data that made Rfree randomly better was actually noise, and
as the structure becomes more correct it will move away from
that noise. It's always better to optimize on something else,
and then check your Rfree as infrequently as possible.
Remember it is the control for your experiment. Never mix your
positive control with your sample.
As for the best metric to assess data quality? Well, what are
you doing with the data? There are always compromises in data
processing and reduction that favor one application over
another. If this is a "I just want the structure" project,
then score on the resolution where CC1/2 hits your favorite
value. For some that is 0.5, others 0.3. I tend to use 0.0 so
I can cut it later without re-processing. Whatever you do
just make it consistent.
If its for anomalous, score on CCanom or if that's too noisy
the Imean/sigma in the lowest-angle resolution or
highest-intensity bin. This is because for anomalous you want
to minimize relative error. The end-all-be-all of anomalous
signal strength is the phased anomalous difference Fourier.
You need phases to do one, but if you have a structure just
omit an anomalous scatterer of interest, refine to
convergence, and then measure the peak height at the position
of the omitted anomalous atom. Instructions for doing
anomalous refinement in refmac5 are here:
https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html
If you're looking for a ligand you probably want isomorphism,
and in that case refining with a reference structure looking
for low Rwork is not a bad strategy. This will tend to select
for crystals containing a molecule that looks like the one you
are refining. But be careful! If it is an apo structure your
ligand-bound crystals will have higher Rwork due to the very
difference density you are looking for.
But if its the same data just being processed in different
ways, first make a choice about what you are interested in,
and then optimize on that. just don't optimize on Rfree!
-James Holton
MAD Scientist
On 10/27/2021 8:44 AM, Murpholino Peligro wrote:
Let's say I ran autoproc with different combinations of
options for a specific dataset, producing dozens of different
(but not so different) mtz files...
Then I ran phenix.refine with the same options for the same
structure but with all my mtz zoo
What would be the best metric to say "hey this combo works
the best!"?
R-free?
Thanks
M. Peligro
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
########################################################################
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list
hosted by www.jiscmail.ac.uk, terms & conditions are available at
https://www.jiscmail.ac.uk/policyandsecurity/