Re: [ccp4bb] Meaning of a pdb entry

Gergely Katona Wed, 02 Jun 2021 02:43:06 -0700

Dear Ethan,

This is an interesting discussion. I agree the word uncertainty covers very 
broad concepts, but I try to narrow down what I mean. 
My starting point is that a reflection file contains point estimates of 
diffraction intensities or structure factor amplitudes. Refinement and model 
building results in a point estimate of a structural model. These estimates can 
be calculated to arbitrary precision. The variation in these parameters are 
expected from sampling. Sampling of experimental data and sampling of 
(pseudo)random events in refinement and model building algorithms. I ignore the 
role of human influence and bias for now. Error models with different 
assumptions may help to quantify the expected variation, but if I want to 
verify these error models or just have an alternative way of quantifying 
uncertainty I have to go back to sampling.

1) This is definitely the mean position for this atom in this crystal but there 
is uncertainty in how much individual instances in different crystal unit cells 
within the lattice deviate from this mean.

Ultimately, this is the category "unknown" for me, I cannot narrow down the 
atomic positions towards a single point with just sampling or with any of the 
experimental methods that I am aware of. The best I can achieve is to improve 
the accuracy and precisions of the model parameters of the distribution that 
describe the distributions of atomic positions.

2) This is a best-effort description of the position of a ligand atom.  However 
it is uncertain what fraction of the unit cells contain the ligand at this 
position, or at all.

My uncertainty is tied to my model, but I can chose different models of course. 
I cannot tell the fraction of unit cells containing the ligand if it is not 
part of my model and I cannot estimate the uncertainty of this parameter by 
sampling. Should I face such question, I might try to compare the average 
B-factors of the ligand that is part of my model in from different samples and 
compare it to a set of control measurements. The control model should also 
contain parameters for the ligand otherwise I cannot perform a comparison. 
Clearly, this could be misinterpreted by someone else as a determined location 
of the ligand in the control group, because the crystallographic models in the 
PDB traditionally do not represent a tool for asking questions, but the best 
effort determination of the "true" structure or ensemble. 

I could define a dependent model parameter which integrates the omit electron 
density in a certain region of the ASU and I can compare the control and soaked 
group of crystals. This is a dependent/deterministic parameter, because its 
variation will entirely depend on the variation of reflection data and the 
variation of not omitted atom positions/parameters. These type of deterministic 
model parameters are not traditionally part of a crystallographic model in a 
PDB entry. Again, this could be misinterpreted as a lack of ligand atoms/lack 
of ligand in the treated group.

The purpose of a PDB entry evolved over time from a single type of 
crystallographic model to include, multiple NMR models, different models, 
different methods, experimental data, validation etc. I expect that this nearly 
imperceptible evolution will continue in different directions at different 
speeds. If what I try to achieve now deviates from the current perception of 
the purpose of a PDB entry then of course I have to find other means to fulfill 
my obligation to make my data open to access. Fortunately, the data I plan to 
archive is not related to the determination of ligand occupancy and perhaps 
more in line with the current purpose of the PDB.

3) It is likely that this sidechain/loop/subunit is present in different 
conformations in different copies of the unit cell.

This is again the unknown category that I cannot address by sampling. The model 
can be open to grow (non-parametric), if the refinement is coupled to automated 
rebuilding. I have to define my question differently, for example ask how many 
conformations or water molecules were built in the control and treated group. 
Interpretation may vary, but I may have sufficient evidence of significantly 
different models in the different groups.

If the variation is better represented in pdb entries, machine learning 
algorithms can also achieve better predictions, less biased towards an 
arbitrary model sample.

4) The coordinates of this specific atom/residue/conformation are well 
supported by the data for this particular crystal.
But it might be somewhere else in the next crystal from the same 
crystallization drop, or in a crystal from a different crystallization buffer, 
or at another temperature, or in solution, or in the presence of a ligand, etc.

I am interested in representing the type of variation I cannot control and when 
designing the experiment it is in my best interest to limit the variation of 
experimental conditions between the samples as much as possible. I cannot 
control that the different crystals in the same drop contain the ligand, but I 
can make sure that the control and treated group of crystals grow at the same 
temperature. I would not combine data sets and models in the same pdb entry if 
the conditions I expect to control are different.

Best wishes,

Gergely

> 
> -----Original Message-----
> From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> On Behalf Of Ethan A 
> Merritt
> Sent: 29 May, 2021 19:16
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS
> 
> On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote:
> [...snip...]
>  I think the assumption of independent variations per atoms is too strong in 
> many cases and does not give an accurate picture of uncertainty.
> [...snip...]
> 
> 
> Gergely, you are revisiting a line of thought that historically led to the 
> introduction of more global treatments of atomic displacement.
> These have distinct statistical and interpretational advantages.
> 
> Several approaches have been tried over the past 40 years or so.
> The one that has proved most successful is the use of TLS
> (Translation/Libration/Screw) models of bulk displacement to 
> supplement or replace per-atom descriptions.  As you say, a per-atom 
> treatment is often too strong and is not statistically justified by 
> the experimental data.  I explored this with specific examples in
> 
>    "To B or not to B?" [Acta Cryst. 2012, D68, 468-477]
>     http://skuld.bmsc.washington.edu/~tlsmd/references.html
> 
> An NMR-style approach that constructs and refines multiple discrete models 
> has been been re-invented several times. These treatments are generally 
> called "ensemble models".  IMHO they are statistically unjustified and 
> strictly worse than treatments based on higher level descriptions such as TLS 
> or normal-mode analysis.
> X-ray data is qualitatively different from NMR data, and optimal treatment of 
> uncertainty must take this into account.
> 
>       best regards
> 
>               Ethan
> 
> 
> > Hi,
> > 
> > It is enough to have Å² as unit to express uncertainty in 3D, but one can 
> > express it with a single number only in a very specific case when the atom 
> > is isotropic. Few atoms have a naturally isotropic distribution around 
> > their mean position in very high resolution protein crystal structures. The 
> > anisotropic atoms can be described by a 3x3 matrix, where each row and 
> > column is associated with the uncertainty in a specific spatial direction. 
> > The matrix elements are the product of the uncertainty in these directions. 
> > The diagonal elements will be the square of uncertainty in the same 
> > direction and they should be always positive, the off-diagonal combination 
> > of directions are covariances (+,0 or -). In the end, every element will 
> > have a unit distance*distance and the matrix will be symmetric. We cannot 
> > just take the square root of the matrix elements and expect something 
> > meaningful, if for no other reason the problem with negative covariances. 
> > To calculate the square root on the matrix itself one has to diagonalize it 
> > first. The height of a person in your example  sounds easy to define, but 
> > the mathematical formalism will not decide that for me. I can also define 
> > height as the longest cord of a person or the maximum elevation of a car 
> > mechanic under a car.  Through diagonalization one can at least extract 
> > some interesting, intuitive, principal directions. The final product, the 
> > sqrt(matrix), is not more intuitive to me. To convert it to something 
> > intuitive I would have to diagonalize square rooted matrix again. So shall 
> > we make an exception for the special, isotropic description? Or use general 
> > principles for isotropic and anisotropic treatments?
> > 
> > About what B-factors are, I like to think about them as necessary model 
> > parameters. Computational biologists also use them for benchmarking their 
> > molecular dynamics models. They are also reproducible to the extent that 
> > one can identify specific atoms just based on their anisotropic tensor from 
> > independent structure determinations in the same crystal form. They are of 
> > course not immune to errors and variation.
> > 
> > I also wonder how we can represent model parameter variation in the best 
> > way. I admire NMR spectroscopists' approach to deposit multiple samples 
> > from a structural distribution. One could reproduce their conclusions 
> > without assuming any sort of error model from these samples. In 
> > crystallography, we have more and more distributions to deal with because 
> > we are swimming in data. It is easy to sample/resample data sets from the 
> > same or different crystals (SFX for example). Which can lead to many 
> > replicates of structural models. I cannot really motivate to create 
> > multiple PDB entries for these replicates, it is not good for to reader to 
> > try to understand which PDB codes belong to which group of samples. Maybe 
> > it works for up to 10 structures, but how about a 100? Is it possible to 
> > deposit crystal structures as a chain of model/data pairs under the same 
> > entry? It is possible to just make a tarball and deposit in alternative 
> > services such as Zenodo, but it would be a pity to completely bypass the 
> > PDB. I can think of more compact description of structural distributions, 
> > for example mean positions and mean B-factors of atoms with their 
> > associated covariance matrices, analogously how MD trajectories can be 
> > described as average structures and covariance matrices.  I think the 
> > assumption of independent variations per atoms is too strong in many cases 
> > and does not give an accurate picture of uncertainty.
> > 
> > Best wishes,
> > 
> > Gergely
> > 
> > Gergely Katona, Professor, Chairman of the Chemistry Program Council 
> > Department of Chemistry and Molecular Biology, University of 
> > Gothenburg Box 462, 40530 Göteborg, Sweden
> > Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910
> > Web: http://katonalab.eu, Email: gergely.kat...@gu.se

--
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
MS 357742,   University of Washington, Seattle 98195-7742

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] Meaning of a pdb entry

Reply via email to