Re: [ccp4bb] Rescale merged data?

Clemens Vonrhein Fri, 19 Apr 2024 02:33:47 -0700

Dear Randy, Harry et al,

sorry for replying to several emails in one message - but as always,
everything is connected ;-)

On Thu, Apr 18, 2024 at 10:02:40AM +0000, Randy John Read wrote:
> I’d like to add my strong agreement to what Robbie said, but also point
> out a wrinkle. When the PDB runs validation, it just takes the data that
> are in the first reflection loop of the reflections.cif file.

Just a small adjustment of naming convention here (as I understand it): you
are referring to so-called "data blocks" within an mmCIF file (delimited by
a "data_XYZ" token) while a "loop" is just a CIF format construct. So a
single datablock can have multiple loops - as visible in a model mmCIF
file. Right?

> So if you want the validation statistics to match your reported
> refinement statistics, that loop should contain the set of data you gave
> the refinement program,

Couldn't agree more - but there are a few additional creases (see below) to
try and stay in your picture ...

> especially if you’ve done something like apply an elliptical truncation,
> correct for anisotropy, or convert intensities to amplitudes, all of
> which change the data in ways that can’t be reversed later.

... some of those are actually applied by the refinement program itself
(rejection of outlier reflections or scaling of the observed amplitudes
against the model using anisotropic B-factors). That immediately creates a
problem: we require the output reflection data (containing map coefficients
for e.g. 2mFo-DFc and mFo-DFc maps) but can't rely on the observed
intensities/amplitudes in that file to accurately represent the input
data. Therefore, we might also need to collect the observed data as-is from
the input reflection file - and combine those two files into a single
datablock of a reflection mmCIF file intended for deposition.

There are a few points we could make about (a) ellipsoidal truncation and
its relation to isotropic and anisotropic truncation, and (b) correction
for anisotropy, but we placed them onto a separate page at [1] to keep this
email thread reasonably short.

Your remark about the conversion from intensities to amplitudes in the
context of multiple datablocks in mmCIF files is probably out of a concern
that one might end up just with amplitudes for a given PDB deposition -
which indeed would not be ideal. That danger lies mostly in the way we now
seem to split, shuffle, reassemble and push our original reflection data
files around different programs, online services and GUIs. Originally, any
program that we know of doing that I->F conversion would always carry over
the input intensities into the output (together with the newly created
amplitudes and other items like anomalous data etc), so there shouldn't be
a reason for losing the intensities at this point ... ideally.

The beauty of the good ol' MTZ files was that all those items were always
kept together and - once a test-set flag was added - one only ever needed
to refer to that "master" reflection file for any subsequent steps to keep
all those relations and the content intact ... no daisy-chaining of
reflection data input/output channels with the subsequent loss of
information, provenance and meta-data.

> Whenever you’ve done any of this (and many people are using StarAniso
> these days, which does all of those things),

Just as a clarification for the less experienced users: there are lots of
other programs that do exactly the same thing - or at least variants that
follow the same underlying idea. Most (all?) data-processing
programs/packages/pipelines will apply a data truncation step and several
refinement programs will reject observations or apply anisotropic
corrections to the observed data. STARANISO is not really unique in those
underlying concepts as far as we can see.

> please put in a second reflection loop containing the whole set of
> intensities to the highest resolution you used, without any anisotropic
> scaling or elliptical cutoffs. Then anyone wanting to re-refine your
> structure or check your data for artefacts will have more information
> available.

Absolutely: we couldn't agree more and are trying to provide as simple to
use tools as possible to users (not only of our software - see [2], since
October 2020). The hope is that the deposition-preparation step becomes as
painless as possible, even if it can't (yet) be fully automatic because

  (a) the processed diffraction data usually comes from a synchrotron
      system (provided as mmCIF for deposition, but usually the MTZ file is
      picked up for downstream work), and

  (b) the refinement is done in a separate system resulting in a separate,
      related but not directly linked reflection file after model
      refinement.

> Of course, I hope we’re moving to a world in which we all also deposit
> the intensities before merging, which in principle allows even more
> quality control to be done.

For the last several years we are providing that feature as part of our own
software: scaled+unmerged data without cutoff, scaled+merged data without
cutoff and scaled+merged data after cutoff - all in a multi-datablock mmCIF
ready for deposition and/or combination with the reflection data from
refinement. One can 'see' that by using

  gemmi grep "_*details"  some_refln.cif

that would then e.g. report

  rXXXXAsf:   merged and scaled data post-processed by STARANISO for conversion 
from intensities to structure factor amplitudes and anomalous data.
  rXXXXBsf:   merged and scaled EARLY (potentially least radiation-damaged) 
data post-processed by STARANISO for conversion from intensities to structure 
factor amplitudes - useful for radiation-damage detection/description maps (as 
e.g. done in BUSTER).
  rXXXXCsf:   merged and scaled LATE (potentially most radiation-damaged) data 
post-processed by STARANISO for conversion from intensities to structure factor 
amplitudes - useful for radiation-damage detection/description maps (as e.g. 
done in BUSTER).
  rXXXXDsf:   merged and scaled data from AIMLESS without any post-processing 
and/or data cut-off.
  rXXXXEsf: unmerged and scaled data from AIMLESS without any post-processing 
and/or data cut-off

If any developer/user is interested in using those files or investigating
how other software could produce these combo files: we are very happy to
help with or discuss any aspect :-).

On Thu, Apr 18, 2024 at 12:37:51PM +0000, Randy John Read wrote:
> I haven’t deposited any PDB entries for a while. The last time I did, I
> remember it not being completely trivial to add these loops. However, I
> was hoping that someone from wwPDB or CCP4 would weigh in with advice on
> how it can be done!

/If/ you used data from autoPROC/STARANISO /and/ saved the full output of
that processing job, /then/ running

  aB_deposition_combine -aP /some/autoPROC/results/dir refine_model.cif 
refine_refl.cif

should work (not just for BUSTER, but also for REFMAC and Phenix
refinements) and produce (nearly) deposition-ready files with a full set of
reflection data blocks and the correct data quality metrics pushed into the
model mmCIF file. Of course, there are cases were this will still need some
manual work afterwards (and some data is still being lost after deposition)
... but the alternative is often much worse: for instance, using some - not
always up-to-date - harvesting tool to create a syntactically correct mmCIF
file that doesn't create errors/warnings within the deposition system - but
contains very little and often incorrect meta data. But this is for
another, separate discussion.

So if we care about our (FAIR) data we should all invest that extra bit of
effort to debug/improve the modern systems we have or are developing for
preparing rich deposition-ready files. The more feedback we developers get,
the better the tools will get, the less painful deposition will become and
the richer and more useful our archived PDB entries will be. But it also
has to be pushed for by the actual depositors even if under stress during
last-minute deposition processes ;-).

> For those who use StarAniso, the current version makes a CIF file with
> the required loops, and they now have advice on their website about
> this: https://staraniso.globalphasing.org/deposition_about.html.

Yes, we've had those pages up for several years now (see also [2]) in the
hope users would be able to follow the recommendations given there. And
several users have been successful in depositing very rich sets of
reflection data (see e.g. 7MBO): thanks to all of you!

On Thu, Apr 18, 2024 at 09:22:04AM +0100, Harry Powell wrote:
> Only comment is that (surely) any decent refinement program these days
> would down-weight any reflections with negligible I/sig(I) (for example,
> those in the “unobserved” high resolution regions) so that they do not
> contribute (significantly) to the refinement.

Not all refinement programs use sigmas as far as we know.

> Doesn’t Aimless produce a table with the cumulative statistics at various
> resolution limits? So you don’t even need to re-scale & merge to get
> the stats to whatever your chosen high resolution limit is (I’d choose CC
> -1/2 = 0.30, as Doeke suggests)? Do HKL or XSCALE do the same (sorry, I
> haven’t looked at their output for quite some time)?

All true, but having data (reflections) and meta-data (data-quality
metrics) in separate files is not ideal at all:

  * the average time between data collection (and probably processing) and
    deposition is roughly 2.5 years nowadays

  * these two files might not stay close neighbours (on disk or in a user's
    mind) forever

  * data processing programs that combine data and meta-data directly as
    part of the actual processing run have an advantage here

    BUT: this combination can easily be done in mmCIF (great archiving
    format) but not in MTZ (great working format) ... so a user still needs
    to have a handle on bookkeeping

  * after say ~2.5 years and under publication/reviewers/PI stress to get a
    PDB ID, one can easily fall into the trap of just getting "something"
    into the deposition system to get things done ... chasing around for
    logfiles, harvesting whatever is available and ending up with (in the
    best case) incomplete meta-data or (often) completely incorrect
    values. Not to speak of sometimes dropping references to software
    packages used.

Anyway, we highly welcome any "call to arms" to improve the systems of data
preparation for PDB deposition in order to provide multiple datablocks of
reflection data and rich metadata: the more users/depositors push for this
to "just work", the better/easier it will get.

Cheers

Clemens

[1] 
https://www.globalphasing.com/autoproc/wiki/index.cgi?DataProcessingAndDeposition
[2] https://www.globalphasing.com/buster/wiki/index.cgi?DepositionMmCif

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] Rescale merged data?

Reply via email to