Re: [ccp4bb] Rmergicide Through Programming

Frank von Delft Fri, 07 Jul 2017 11:27:07 -0700

Okay, that /is/ a strong answer: Rmeas has too many infinities forcomfort. Thanks, very instructive yet again!

phx



On 07/07/2017 18:57, James Holton wrote:

I happen to be one of those people who think Rmerge is a very usefulstatistic. Not as a method of evaluating the resolution limit, whichis mathematically ridiculous, but for a host of other importantthings, like evaluating the performance of data collection equipment,and evaluating the isomorphism of different crystals, to name a few.
I like Rmerge because it is a simple statistic that has a simpleformula and has not undergone any "corrections". Corrections increasecomplexity, and complexity opens the door to manipulation by thedesperate and/or misguided. For example, overzealous outlierrejection is a common way to abuse R factors, and it is far too oftenswept under the rug, sometimes without the user even knowing aboutit. This is especially problematic when working in a regime where thestatistic of interest is unstable, and for R factors this is lowintensity data. Rejecting just the right "outliers" can make any Rfactor look a lot better. Why would Rmeas be any more unstable thanRmerge? Look at the formula. There is an "n-1" in the denominator,where n is the multiplicity. So, what happens when n approaches 1 ?What happens when n=1? This is not to say Rmerge is better than Rmeas.In fact, I believe the latter is generally superior to the first,unless you are working near n = 1. The sqrt(n/(n-1)) is trying tocorrect for bias in the R statistic, but fighting one infinity withanother infinity is a dangerous game.
My point is that neither Rmerge nor Rmeas are easily interpretedwithout knowing the multiplicity. If you see Rmeas = 10% and themultiplicity is 10, then you know what that means. Same for Rmerge,since at n=10 both stats have nearly the same value. But if you haveRmeas = 45% and multiplicity = 1.05, what does that mean? Rmeas willbe only 33% if the multiplicity is rounded up to 1.1. This is what Imean by "numerical instability", the value of the R statistic itselfbecomes sensitive to small amounts of noise, and behaves more and morelike a random number generator. And if you have Rmeas = 33% and noindication of multiplicity, it is hard to know what is going on. Ipersonally am a lot more comfortable seeing qualitative agreementbetween Rmerge and Rmeas, because that means the numerical instabilityof the multiplicity correction didn't mess anything up.
Of course, when the intensity is weak R statistics in general are notuseful. Both Rmeas and Rmerge have the sum of all intensities in thedenominator, so when the bin-wide sum approaches zero you have anotherinfinity to contend with. This one starts to rear its ugly head onceI/sigma drops below about 3, and this is why our ancestors alwaysapplied a sigma cutoff before computing an R factor. Oursmall-molecule colleagues still do this! They call it "R1". And itis an excellent indicator of the overall relative error. The relativeerror in the outermost bin is not meaningful, and strangely enoughnobody ever reported the outer-resolution Rmerge before 1995.
For weak signals, Correlation Coefficients are better, but for strongsignals CC pegs out at >95%, making it harder to see relative errors.I/sigma is what we'd like to know, but the value of "sigma" is stillprone to manipulation by not just outlier rejection, but massaging theso-called "error model". Suffice it to say, crystallographic datacontain more than one type of error. Some sources are important forweak spots, others are important for strong spots, and still othersare only apparent in the mid-range. Some sources of error are onlyimportant at low multiplicity, and others only manifest at highmultiplicity. There is no single number that can be used to evaluate

all aspects of data quality.
So, I remain a champion of reporting Rmerge. Not in the high-anglebin, because that is essentially a random number, but overall Rmergeand low-angle-bin Rmerge next to multiplicity, Rmeas, CC1/2 and otherstatistics is the only way you can glean enough information aboutwhere the errors are coming from in the data. Rmeas is a usefuladdition because it helps us correct for multiplicity without havingto do math in our head. Users generally thank you for that. Rmerge,however, has served us well for more than half a century, and Ibelieve Uli Arndt knew what he was doing. I hope we all know enoughabout history to realize that future generations seldom thank theirancestors for "protecting" them from information.
-James Holton
MAD Scientist


On 7/5/2017 10:36 AM, Graeme Winter wrote:
Frank,
you are asking me to remove features that I like, so I would feelthat the challenge is for you to prove that this is harmful however:
- at the minimum, I find it a useful check sum that the stats areinternally consistent (though I interpret it for lots of otherreasons too)
  - it is faulty I agree, but (with caveats) still useful IMHO
Sorry for being terse, but I remain to be convinced that removing itincreases the amount of information
CC’ing BB as requested

Best wishes Graeme
On 5 Jul 2017, at 17:17, Frank von Delft<frank.vonde...@sgc.ox.ac.uk> wrote:
You keep not answering the challenge.
It's really simple: what information does Rmerge provide that Rmeasdoesn't.
(If you answer, email to the BB.)


On 05/07/2017 16:04, graeme.win...@diamond.ac.uk wrote:
Dear Frank,
You are forcefully arguing essentially that others are wrong if wefeel an existing statistic continues to be useful, and insteadinsist that it be outlawed so that we may not make use of it, justin case someone misinterprets it.
Very well
I do however express disquiet that we as software developers feelbrowbeaten to remove the output we find useful because “thecommunity” feel that it is obsolete.
I feel that Jacob’s short story on this thread illustrates thateducating the next generation of crystallographers to understandwhat all of the numbers mean is critical, and that a numerologicalapproach of trying to optimise any one statistic is essentiallydoomed. Precisely the same argument could be made for peoplecutting the “resolution” at the wrong place in order to improve theaverage I/sig(I) of the data set.
Denying access to information is not a solution tomisinterpretation, from where I am sat, however I acknowledge thatother points of view exist.
Best wishes Graeme
On 5 Jul 2017, at 12:11, Frank von Delft<frank.vonde...@sgc.ox.ac.uk<mailto:frank.vonde...@sgc.ox.ac.uk>>wrote:
Graeme, Andrew
Jacob is not arguing against an R-based statistic; he's pointingout that leaving out the multiplicity-weighting is prehistoric(Diederichs & Karplus published it 20 years ago!).
So indeed: Rmerge, Rpim and I/sigI give different information.As you say.
But no: Rmerge and Rmeas and Rcryst do NOT give differentinformation. Except:
   * Rmerge is a (potentially) misleading version of Rmeas.
* Rcryst and Rmerge and Rsym are terms that no longer havesignificance in the single cryo-dataset world.
phx.



On 05/07/2017 09:43, Andrew Leslie wrote:
I would like to support Graeme in his wish to retain Rmerge inTable 1, essentially for exactly the same reasons.
I also strongly support Francis Reyes comment about the usefulnessof Rmerge at low resolution, and I would add to his list that itcan also, in some circumstances, be more indicative of the wrongchoice of symmetry (too high) than the statistics that come fromPOINTLESS (excellent though that program is!).
Andrew
On 5 Jul 2017, at 05:44, Graeme Winter<graeme.win...@gmail.com<mailto:graeme.win...@gmail.com>> wrote:
HI Jacob
Yes, I got this - and I appreciate the benefit of Rmeas for dealingwith measuring agreement for small-multiplicity observations.Having this *as well* is very useful and I agree Rmeas / Rpim /CC-half should be the primary “quality” statistics.
However, you asked if there is any reason to *keep* rather than*eliminate* Rmerge, and I offered one :o)
I do not see what harm there is reporting Rmerge, even if it isjust used in the inner shell or just used to capture a flavour ofthe data set overall. I also appreciate that Rmeas converges to thesame value for large multiplicity i.e.:
Overall InnerShellOuterShell
Low resolution limit                       39.02 39.02      1.39
High resolution limit                       1.35 6.04      1.35

Rmerge  (within I+/I-)                     0.080 0.057     2.871
Rmerge  (all I+ and I-)                    0.081 0.059     2.922
Rmeas (within I+/I-)                       0.081 0.058     2.940
Rmeas (all I+ & I-)                        0.082 0.059     2.958
Rpim (within I+/I-)                        0.013 0.009     0.628
Rpim (all I+ & I-)                         0.009 0.007     0.453
Rmerge in top intensity bin                0.050 -         -
Total number of observations             1265512 16212     53490
Total number unique                        17515 224      1280
Mean((I)/sd(I))                             29.7 104.3       1.5
Mn(I) half-set correlation CC(1/2)         1.000 1.000     0.778
Completeness                               100.0 99.7     100.0
Multiplicity                                72.3 72.4      41.8

Anomalous completeness                     100.0 100.0     100.0
Anomalous multiplicity                      37.2 42.7      21.0
DelAnom correlation between half-sets      0.497 0.766    -0.026
Mid-Slope of Anom Normal Probability       1.039 -         -

(this is a good case for Rpim & CC-half as resolution limit criteria)
If the statistics you want to use are there & some others also,what is the pressure to remove them? Surely we want to educate onhow best to interpret the entire table above to get a fullerpicture of the overall quality of the data? My 0th-order requestwould be to publish the three shells as above ;o)
Cheers Graeme
On 4 Jul 2017, at 22:09, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>> wrote:
I suggested replacing Rmerge/sym/cryst with Rmeas, not Rpim. Rmeasis simply (Rmerge * sqrt(n/n-1)) where n is the number ofmeasurements of that reflection. It's merely a way of correctingfor the multiplicity-related artifact of Rmerge, which is becomingeven more of a problem with data sets of increasing variability inmultiplicity. Consider the case of comparing a data set with amultiplicity of 2 versus one of 100: equivalent data quality wouldyield Rmerges diverging by a factor of ~1.4. But this has all beencovered before in several papers. It can be and is reported inresolution bins, so can used exactly as you say. So, why not"disappear" Rmerge from the software?
The only reason I could come up with for keeping it is historicalreasons or comparisons to previous datasets, but anyway thosecomparisons would be confounded by variabities in multiplicity anda hundred other things, so come on, developers, just comment it out!
JPK




-----Original Message-----
From:graeme.win...@diamond.ac.uk<mailto:graeme.win...@diamond.ac.uk>[mailto:graeme.win...@diamond.ac.uk]
Sent: Tuesday, July 04, 2017 4:37 PM
To: Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>>
Cc: ccp4bb@jiscmail.ac.uk<mailto:ccp4bb@jiscmail.ac.uk>
Subject: Re: [ccp4bb] Rmergicide Through Programming

HI Jacob
Unbiased estimate of the true unmerged I/sig(I) of your data (Ifind this particularly useful at low resolution) i.e. if your innershell Rmerge is 10% your data agree very poorly; if 2% says yourdata agree very well provided you have sensible multiplicity…obviously depends on sensible interpretation. Rpim hides this(though tells you more about the quality of average measurement)
Essentially, for I/sig(I) you can (by and large) adjust your sig(I)values however you like if you were so inclined. You can onlyadjust Rmerge by excluding measurements.
I would therefore defend that - amongst the other stats youenumerate below - it still has a place
Cheers Graeme
On 4 Jul 2017, at 14:10, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org>> wrote:
Rmerge does contain information which complements the others.
What information? I was trying to think of a counterargument towhat I proposed, but could not think of a reason in the world tokeep reporting it.
JPK
On 4 Jul 2017, at 12:00, Keller, Jacob<kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>>wrote:
Dear Crystallographers,
Having been repeatedly chagrinned about the continued use andreporting of Rmerge rather than Rmeas or similar, I thought of apotential way to promote the change: what if merging programs wouldcompletely omit Rmerge/cryst/sym? Is there some reason to continueto report these stats, or are they just grandfathered into thesoftware? I doubt that any journal or crystallographer would insiston reporting Rmerge per se. So, I wonder what developers wouldthink about commenting out a few lines of their code, seeing whathappens? Maybe a comment to the effect of "Rmerge is nowdeprecated; use Rmeas" would be useful as well. Would somethingcatastrophic happen?
All the best,

Jacob Keller

*******************************************
Jacob Pearson Keller, PhD
Research Scientist
HHMI Janelia Research Campus / Looger lab
Phone: (571)209-4000 x3159
Email:kell...@janelia.hhmi.org<mailto:kell...@janelia.hhmi.org><mailto:kell...@janelia.hhmi.org>
*******************************************


--
This e-mail and any attachments may contain confidential, copyrightand or privileged material, and are for the use of the intendedaddressee only. If you are not the intended addressee or anauthorised recipient of the addressee please notify us of receiptby returning the e-mail and do not use, copy, retain, distribute ordisclose the information in or attached to the e-mail.Any opinions expressed within this e-mail are those of theindividual and not necessarily of Diamond Light Source Ltd.Diamond Light Source Ltd. cannot guarantee that this e-mail or anyattachments are free from viruses and we cannot accept liabilityfor any damage which you may sustain as a result of softwareviruses which may be transmitted in or with the message.Diamond Light Source Limited (company no. 4375679). Registered inEngland and Wales with its registered office at Diamond House,Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX110DE, United Kingdom

Re: [ccp4bb] Rmergicide Through Programming

Reply via email to