Dear Randy, Harry et al, sorry for replying to several emails in one message - but as always, everything is connected ;-)
On Thu, Apr 18, 2024 at 10:02:40AM +0000, Randy John Read wrote: > I’d like to add my strong agreement to what Robbie said, but also point > out a wrinkle. When the PDB runs validation, it just takes the data that > are in the first reflection loop of the reflections.cif file. Just a small adjustment of naming convention here (as I understand it): you are referring to so-called "data blocks" within an mmCIF file (delimited by a "data_XYZ" token) while a "loop" is just a CIF format construct. So a single datablock can have multiple loops - as visible in a model mmCIF file. Right? > So if you want the validation statistics to match your reported > refinement statistics, that loop should contain the set of data you gave > the refinement program, Couldn't agree more - but there are a few additional creases (see below) to try and stay in your picture ... > especially if you’ve done something like apply an elliptical truncation, > correct for anisotropy, or convert intensities to amplitudes, all of > which change the data in ways that can’t be reversed later. ... some of those are actually applied by the refinement program itself (rejection of outlier reflections or scaling of the observed amplitudes against the model using anisotropic B-factors). That immediately creates a problem: we require the output reflection data (containing map coefficients for e.g. 2mFo-DFc and mFo-DFc maps) but can't rely on the observed intensities/amplitudes in that file to accurately represent the input data. Therefore, we might also need to collect the observed data as-is from the input reflection file - and combine those two files into a single datablock of a reflection mmCIF file intended for deposition. There are a few points we could make about (a) ellipsoidal truncation and its relation to isotropic and anisotropic truncation, and (b) correction for anisotropy, but we placed them onto a separate page at [1] to keep this email thread reasonably short. Your remark about the conversion from intensities to amplitudes in the context of multiple datablocks in mmCIF files is probably out of a concern that one might end up just with amplitudes for a given PDB deposition - which indeed would not be ideal. That danger lies mostly in the way we now seem to split, shuffle, reassemble and push our original reflection data files around different programs, online services and GUIs. Originally, any program that we know of doing that I->F conversion would always carry over the input intensities into the output (together with the newly created amplitudes and other items like anomalous data etc), so there shouldn't be a reason for losing the intensities at this point ... ideally. The beauty of the good ol' MTZ files was that all those items were always kept together and - once a test-set flag was added - one only ever needed to refer to that "master" reflection file for any subsequent steps to keep all those relations and the content intact ... no daisy-chaining of reflection data input/output channels with the subsequent loss of information, provenance and meta-data. > Whenever you’ve done any of this (and many people are using StarAniso > these days, which does all of those things), Just as a clarification for the less experienced users: there are lots of other programs that do exactly the same thing - or at least variants that follow the same underlying idea. Most (all?) data-processing programs/packages/pipelines will apply a data truncation step and several refinement programs will reject observations or apply anisotropic corrections to the observed data. STARANISO is not really unique in those underlying concepts as far as we can see. > please put in a second reflection loop containing the whole set of > intensities to the highest resolution you used, without any anisotropic > scaling or elliptical cutoffs. Then anyone wanting to re-refine your > structure or check your data for artefacts will have more information > available. Absolutely: we couldn't agree more and are trying to provide as simple to use tools as possible to users (not only of our software - see [2], since October 2020). The hope is that the deposition-preparation step becomes as painless as possible, even if it can't (yet) be fully automatic because (a) the processed diffraction data usually comes from a synchrotron system (provided as mmCIF for deposition, but usually the MTZ file is picked up for downstream work), and (b) the refinement is done in a separate system resulting in a separate, related but not directly linked reflection file after model refinement. > Of course, I hope we’re moving to a world in which we all also deposit > the intensities before merging, which in principle allows even more > quality control to be done. For the last several years we are providing that feature as part of our own software: scaled+unmerged data without cutoff, scaled+merged data without cutoff and scaled+merged data after cutoff - all in a multi-datablock mmCIF ready for deposition and/or combination with the reflection data from refinement. One can 'see' that by using gemmi grep "_*details" some_refln.cif that would then e.g. report rXXXXAsf: merged and scaled data post-processed by STARANISO for conversion from intensities to structure factor amplitudes and anomalous data. rXXXXBsf: merged and scaled EARLY (potentially least radiation-damaged) data post-processed by STARANISO for conversion from intensities to structure factor amplitudes - useful for radiation-damage detection/description maps (as e.g. done in BUSTER). rXXXXCsf: merged and scaled LATE (potentially most radiation-damaged) data post-processed by STARANISO for conversion from intensities to structure factor amplitudes - useful for radiation-damage detection/description maps (as e.g. done in BUSTER). rXXXXDsf: merged and scaled data from AIMLESS without any post-processing and/or data cut-off. rXXXXEsf: unmerged and scaled data from AIMLESS without any post-processing and/or data cut-off If any developer/user is interested in using those files or investigating how other software could produce these combo files: we are very happy to help with or discuss any aspect :-). On Thu, Apr 18, 2024 at 12:37:51PM +0000, Randy John Read wrote: > I haven’t deposited any PDB entries for a while. The last time I did, I > remember it not being completely trivial to add these loops. However, I > was hoping that someone from wwPDB or CCP4 would weigh in with advice on > how it can be done! /If/ you used data from autoPROC/STARANISO /and/ saved the full output of that processing job, /then/ running aB_deposition_combine -aP /some/autoPROC/results/dir refine_model.cif refine_refl.cif should work (not just for BUSTER, but also for REFMAC and Phenix refinements) and produce (nearly) deposition-ready files with a full set of reflection data blocks and the correct data quality metrics pushed into the model mmCIF file. Of course, there are cases were this will still need some manual work afterwards (and some data is still being lost after deposition) ... but the alternative is often much worse: for instance, using some - not always up-to-date - harvesting tool to create a syntactically correct mmCIF file that doesn't create errors/warnings within the deposition system - but contains very little and often incorrect meta data. But this is for another, separate discussion. So if we care about our (FAIR) data we should all invest that extra bit of effort to debug/improve the modern systems we have or are developing for preparing rich deposition-ready files. The more feedback we developers get, the better the tools will get, the less painful deposition will become and the richer and more useful our archived PDB entries will be. But it also has to be pushed for by the actual depositors even if under stress during last-minute deposition processes ;-). > For those who use StarAniso, the current version makes a CIF file with > the required loops, and they now have advice on their website about > this: https://staraniso.globalphasing.org/deposition_about.html. Yes, we've had those pages up for several years now (see also [2]) in the hope users would be able to follow the recommendations given there. And several users have been successful in depositing very rich sets of reflection data (see e.g. 7MBO): thanks to all of you! On Thu, Apr 18, 2024 at 09:22:04AM +0100, Harry Powell wrote: > Only comment is that (surely) any decent refinement program these days > would down-weight any reflections with negligible I/sig(I) (for example, > those in the “unobserved” high resolution regions) so that they do not > contribute (significantly) to the refinement. Not all refinement programs use sigmas as far as we know. > Doesn’t Aimless produce a table with the cumulative statistics at various > resolution limits? So you don’t even need to re-scale & merge to get > the stats to whatever your chosen high resolution limit is (I’d choose CC > -1/2 = 0.30, as Doeke suggests)? Do HKL or XSCALE do the same (sorry, I > haven’t looked at their output for quite some time)? All true, but having data (reflections) and meta-data (data-quality metrics) in separate files is not ideal at all: * the average time between data collection (and probably processing) and deposition is roughly 2.5 years nowadays * these two files might not stay close neighbours (on disk or in a user's mind) forever * data processing programs that combine data and meta-data directly as part of the actual processing run have an advantage here BUT: this combination can easily be done in mmCIF (great archiving format) but not in MTZ (great working format) ... so a user still needs to have a handle on bookkeeping * after say ~2.5 years and under publication/reviewers/PI stress to get a PDB ID, one can easily fall into the trap of just getting "something" into the deposition system to get things done ... chasing around for logfiles, harvesting whatever is available and ending up with (in the best case) incomplete meta-data or (often) completely incorrect values. Not to speak of sometimes dropping references to software packages used. Anyway, we highly welcome any "call to arms" to improve the systems of data preparation for PDB deposition in order to provide multiple datablocks of reflection data and rich metadata: the more users/depositors push for this to "just work", the better/easier it will get. Cheers Clemens [1] https://www.globalphasing.com/autoproc/wiki/index.cgi?DataProcessingAndDeposition [2] https://www.globalphasing.com/buster/wiki/index.cgi?DepositionMmCif ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/