That's exactly what I am doing... citing David... "I expect the A and B data sets to be quite similar, but I would like to evaluate which protocol was "better", and I want to do this quickly, ideally looking at a single number."
and "I do want to find a way to assess the various tweaks I can try in data processing for a single case" Why not do all those things with Rwork? I thought that comparing the R-free rather than the R-work was going to be easier.... Because last week the structure was dehydrated.... So the refinement program added "strong waters" and due to a thousand or so extra reflections I could have a dozen or so extra waters and the difference in R-work value between protocols due to extra waters was going to be a little bit more difficult to compare. I have now the final structure so I could very well compare the R-work doing another round of refinement, maybe randomizing adps at the beginning or something. Thanks a lot. El lun, 1 de nov. de 2021 a la(s) 03:22, David Waterman ( dgwater...@gmail.com) escribió: > Hi James, > > What you wrote makes lots of sense. I had not heard about Rsleep, so that > looks like interesting reading, thanks. > > I have often used Rfree as a simple tool to compare two protocols. If I am > not actually optimising against Rfree but just using it for a one-off > comparison then that is okay, right? > > Let's say I have two data processing protocols, A and B. Between these I > might be exploring some difference in options within one data processing > program, perhaps different geometry refinement parameters, or scaling > options. I expect the A and B data sets to be quite similar, but I would > like to evaluate which protocol was "better", and I want to do this > quickly, ideally looking at a single number. I don't like I/sigI because I > don't trust the sigmas, CC1/2 is often noisy, and I'm totally sworn off > merging R statistics for these purposes. I tend to use Rfree as an > easily-available metric, independent from the data processing program and > the merging stats. It also allows a comparison of A and B in terms of the > "product" of crystallography, namely the refined structure. In this I am > lucky because I'm not trying to solve a structure. I may be looking at > lysozyme or proteinase K: something where I can download a pretty good > approximation to the truth from the PDB. > > So, what I do is process the data by A and process by B, ensure the data > sets have the same free set, then refine to convergence (or at least, a lot > of cycles) starting from a PDB structure. I then evaluate A vs B in terms > of Rfree, though without an error bar on Rfree I don't read too much into > small differences. > > Does this procedure seem sound? Perhaps it could be improved by randomly > jiggling the atoms in the starting structure, in case the PDB deposition > had already followed an A- or B-like protocol. Perhaps the whole approach > is suspect. Certainly I wouldn't want to generalise by saying that A or B > is better in all cases, but I do want to find a way to assess the various > tweaks I can try in data processing for a single case. > > Any thoughts? I appreciate the wisdom of the BB here. > > Cheers > > -- David > > > On Fri, 29 Oct 2021 at 15:50, James Holton <jmhol...@lbl.gov> wrote: > >> >> Well, of all the possible metrics you could use to asses data quality >> Rfree is probably the worst one. This is because it is a cross-validation >> metric, and cross-validations don't work if you use them as an optimization >> target. You can try, and might even make a little headway, but then your >> free set is burnt. If you have a third set of observations, as suggested >> for Rsleep (doi:10.1107/S0907444907033458), then you have a chance at >> another round of cross-validation. Crystallographers don't usually do this, >> but it has become standard practice in machine learning (training=Rwork, >> validation=Rfree and testing=Rsleep). >> >> So, unless you have an Rsleep set, any time you contemplate doing a bunch >> of random things and picking the best Rfree ... don't. Just don't. There >> madness lies. >> >> What happens after doing this is you will be initially happy about your >> lower Rfree, but everything you do after that will make it go up more than >> it would have had you not performed your Rfree optimization. This is >> because the changes in the data that made Rfree randomly better was >> actually noise, and as the structure becomes more correct it will move away >> from that noise. It's always better to optimize on something else, and then >> check your Rfree as infrequently as possible. Remember it is the control >> for your experiment. Never mix your positive control with your sample. >> >> As for the best metric to assess data quality? Well, what are you doing >> with the data? There are always compromises in data processing and >> reduction that favor one application over another. If this is a "I just >> want the structure" project, then score on the resolution where CC1/2 hits >> your favorite value. For some that is 0.5, others 0.3. I tend to use 0.0 so >> I can cut it later without re-processing. Whatever you do just make it >> consistent. >> >> If its for anomalous, score on CCanom or if that's too noisy the >> Imean/sigma in the lowest-angle resolution or highest-intensity bin. This >> is because for anomalous you want to minimize relative error. The >> end-all-be-all of anomalous signal strength is the phased anomalous >> difference Fourier. You need phases to do one, but if you have a structure >> just omit an anomalous scatterer of interest, refine to convergence, and >> then measure the peak height at the position of the omitted anomalous >> atom. Instructions for doing anomalous refinement in refmac5 are here: >> >> https://www2.mrc-lmb.cam.ac.uk/groups/murshudov/content/refmac/refmac_keywords.html >> >> If you're looking for a ligand you probably want isomorphism, and in that >> case refining with a reference structure looking for low Rwork is not a bad >> strategy. This will tend to select for crystals containing a molecule that >> looks like the one you are refining. But be careful! If it is an apo >> structure your ligand-bound crystals will have higher Rwork due to the very >> difference density you are looking for. >> >> But if its the same data just being processed in different ways, first >> make a choice about what you are interested in, and then optimize on that. >> just don't optimize on Rfree! >> >> -James Holton >> MAD Scientist >> >> >> On 10/27/2021 8:44 AM, Murpholino Peligro wrote: >> >> Let's say I ran autoproc with different combinations of options for a >> specific dataset, producing dozens of different (but not so different) mtz >> files... >> Then I ran phenix.refine with the same options for the same structure but >> with all my mtz zoo >> What would be the best metric to say "hey this combo works the best!"? >> R-free? >> Thanks >> >> M. Peligro >> >> ------------------------------ >> >> To unsubscribe from the CCP4BB list, click the following link: >> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 >> >> >> >> ------------------------------ >> >> To unsubscribe from the CCP4BB list, click the following link: >> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 >> > > ------------------------------ > > To unsubscribe from the CCP4BB list, click the following link: > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 > ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/