Dear Ethan, This is an interesting discussion. I agree the word uncertainty covers very broad concepts, but I try to narrow down what I mean. My starting point is that a reflection file contains point estimates of diffraction intensities or structure factor amplitudes. Refinement and model building results in a point estimate of a structural model. These estimates can be calculated to arbitrary precision. The variation in these parameters are expected from sampling. Sampling of experimental data and sampling of (pseudo)random events in refinement and model building algorithms. I ignore the role of human influence and bias for now. Error models with different assumptions may help to quantify the expected variation, but if I want to verify these error models or just have an alternative way of quantifying uncertainty I have to go back to sampling.
1) This is definitely the mean position for this atom in this crystal but there is uncertainty in how much individual instances in different crystal unit cells within the lattice deviate from this mean. Ultimately, this is the category "unknown" for me, I cannot narrow down the atomic positions towards a single point with just sampling or with any of the experimental methods that I am aware of. The best I can achieve is to improve the accuracy and precisions of the model parameters of the distribution that describe the distributions of atomic positions. 2) This is a best-effort description of the position of a ligand atom. However it is uncertain what fraction of the unit cells contain the ligand at this position, or at all. My uncertainty is tied to my model, but I can chose different models of course. I cannot tell the fraction of unit cells containing the ligand if it is not part of my model and I cannot estimate the uncertainty of this parameter by sampling. Should I face such question, I might try to compare the average B-factors of the ligand that is part of my model in from different samples and compare it to a set of control measurements. The control model should also contain parameters for the ligand otherwise I cannot perform a comparison. Clearly, this could be misinterpreted by someone else as a determined location of the ligand in the control group, because the crystallographic models in the PDB traditionally do not represent a tool for asking questions, but the best effort determination of the "true" structure or ensemble. I could define a dependent model parameter which integrates the omit electron density in a certain region of the ASU and I can compare the control and soaked group of crystals. This is a dependent/deterministic parameter, because its variation will entirely depend on the variation of reflection data and the variation of not omitted atom positions/parameters. These type of deterministic model parameters are not traditionally part of a crystallographic model in a PDB entry. Again, this could be misinterpreted as a lack of ligand atoms/lack of ligand in the treated group. The purpose of a PDB entry evolved over time from a single type of crystallographic model to include, multiple NMR models, different models, different methods, experimental data, validation etc. I expect that this nearly imperceptible evolution will continue in different directions at different speeds. If what I try to achieve now deviates from the current perception of the purpose of a PDB entry then of course I have to find other means to fulfill my obligation to make my data open to access. Fortunately, the data I plan to archive is not related to the determination of ligand occupancy and perhaps more in line with the current purpose of the PDB. 3) It is likely that this sidechain/loop/subunit is present in different conformations in different copies of the unit cell. This is again the unknown category that I cannot address by sampling. The model can be open to grow (non-parametric), if the refinement is coupled to automated rebuilding. I have to define my question differently, for example ask how many conformations or water molecules were built in the control and treated group. Interpretation may vary, but I may have sufficient evidence of significantly different models in the different groups. If the variation is better represented in pdb entries, machine learning algorithms can also achieve better predictions, less biased towards an arbitrary model sample. 4) The coordinates of this specific atom/residue/conformation are well supported by the data for this particular crystal. But it might be somewhere else in the next crystal from the same crystallization drop, or in a crystal from a different crystallization buffer, or at another temperature, or in solution, or in the presence of a ligand, etc. I am interested in representing the type of variation I cannot control and when designing the experiment it is in my best interest to limit the variation of experimental conditions between the samples as much as possible. I cannot control that the different crystals in the same drop contain the ligand, but I can make sure that the control and treated group of crystals grow at the same temperature. I would not combine data sets and models in the same pdb entry if the conditions I expect to control are different. Best wishes, Gergely > > -----Original Message----- > From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> On Behalf Of Ethan A > Merritt > Sent: 29 May, 2021 19:16 > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] AW: [ccp4bb] AW: [ccp4bb] (R)MS > > On Saturday, 29 May 2021 02:12:16 PDT Gergely Katona wrote: > [...snip...] > I think the assumption of independent variations per atoms is too strong in > many cases and does not give an accurate picture of uncertainty. > [...snip...] > > > Gergely, you are revisiting a line of thought that historically led to the > introduction of more global treatments of atomic displacement. > These have distinct statistical and interpretational advantages. > > Several approaches have been tried over the past 40 years or so. > The one that has proved most successful is the use of TLS > (Translation/Libration/Screw) models of bulk displacement to > supplement or replace per-atom descriptions. As you say, a per-atom > treatment is often too strong and is not statistically justified by > the experimental data. I explored this with specific examples in > > "To B or not to B?" [Acta Cryst. 2012, D68, 468-477] > http://skuld.bmsc.washington.edu/~tlsmd/references.html > > An NMR-style approach that constructs and refines multiple discrete models > has been been re-invented several times. These treatments are generally > called "ensemble models". IMHO they are statistically unjustified and > strictly worse than treatments based on higher level descriptions such as TLS > or normal-mode analysis. > X-ray data is qualitatively different from NMR data, and optimal treatment of > uncertainty must take this into account. > > best regards > > Ethan > > > > Hi, > > > > It is enough to have Ų as unit to express uncertainty in 3D, but one can > > express it with a single number only in a very specific case when the atom > > is isotropic. Few atoms have a naturally isotropic distribution around > > their mean position in very high resolution protein crystal structures. The > > anisotropic atoms can be described by a 3x3 matrix, where each row and > > column is associated with the uncertainty in a specific spatial direction. > > The matrix elements are the product of the uncertainty in these directions. > > The diagonal elements will be the square of uncertainty in the same > > direction and they should be always positive, the off-diagonal combination > > of directions are covariances (+,0 or -). In the end, every element will > > have a unit distance*distance and the matrix will be symmetric. We cannot > > just take the square root of the matrix elements and expect something > > meaningful, if for no other reason the problem with negative covariances. > > To calculate the square root on the matrix itself one has to diagonalize it > > first. The height of a person in your example sounds easy to define, but > > the mathematical formalism will not decide that for me. I can also define > > height as the longest cord of a person or the maximum elevation of a car > > mechanic under a car. Through diagonalization one can at least extract > > some interesting, intuitive, principal directions. The final product, the > > sqrt(matrix), is not more intuitive to me. To convert it to something > > intuitive I would have to diagonalize square rooted matrix again. So shall > > we make an exception for the special, isotropic description? Or use general > > principles for isotropic and anisotropic treatments? > > > > About what B-factors are, I like to think about them as necessary model > > parameters. Computational biologists also use them for benchmarking their > > molecular dynamics models. They are also reproducible to the extent that > > one can identify specific atoms just based on their anisotropic tensor from > > independent structure determinations in the same crystal form. They are of > > course not immune to errors and variation. > > > > I also wonder how we can represent model parameter variation in the best > > way. I admire NMR spectroscopists' approach to deposit multiple samples > > from a structural distribution. One could reproduce their conclusions > > without assuming any sort of error model from these samples. In > > crystallography, we have more and more distributions to deal with because > > we are swimming in data. It is easy to sample/resample data sets from the > > same or different crystals (SFX for example). Which can lead to many > > replicates of structural models. I cannot really motivate to create > > multiple PDB entries for these replicates, it is not good for to reader to > > try to understand which PDB codes belong to which group of samples. Maybe > > it works for up to 10 structures, but how about a 100? Is it possible to > > deposit crystal structures as a chain of model/data pairs under the same > > entry? It is possible to just make a tarball and deposit in alternative > > services such as Zenodo, but it would be a pity to completely bypass the > > PDB. I can think of more compact description of structural distributions, > > for example mean positions and mean B-factors of atoms with their > > associated covariance matrices, analogously how MD trajectories can be > > described as average structures and covariance matrices. I think the > > assumption of independent variations per atoms is too strong in many cases > > and does not give an accurate picture of uncertainty. > > > > Best wishes, > > > > Gergely > > > > Gergely Katona, Professor, Chairman of the Chemistry Program Council > > Department of Chemistry and Molecular Biology, University of > > Gothenburg Box 462, 40530 Göteborg, Sweden > > Tel: +46-31-786-3959 / M: +46-70-912-3309 / Fax: +46-31-786-3910 > > Web: http://katonalab.eu, Email: gergely.kat...@gu.se -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg MS 357742, University of Washington, Seattle 98195-7742 ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/