Re: [ccp4bb] IUCr committees, depositing images

David Waterman Thu, 27 Oct 2011 03:06:15 -0700

I agree with Colin here. Framing is simply a process of sampling an original
signal at some 'frequency' (related to the phi-width of each frame). At some
point, delta phi is small enough that the original signal is oversampled,
and can be reconstructed _within the bounds of noise_. Beyond that point I
see no advantage to sampling finer - and certainly not going to the limit of
representing your data in some unframed continuous readout form.


Perhaps I am missing something, and I realise this is another OT diversion
from this most fruitful of threads.

Cheers

-- David


On 27 October 2011 08:55, Martin M. Ripoll <xmar...@iqfr.csic.es> wrote:

> Dear Colin,
>
> I think you understood perfectly what George was saying regarding the loss
> of information, but he will probably answer better than I.
>
> In any case, and for the ones that did not understand it, what George was
> telling is related to the fact that a data collection made with a
> continuous
> crystal rotation contains more information than when this information is
> transformed into frames... The loss of information that we are referring to
> has the same meaning as when we calculate electron density maps with
> different grid sizes. The finer the grid, the greater is the information on
> the map.
>
> But you are right saying that the shorter the interval between produced
> frames, the lower the loss of information. However, the procedure that you
> are suggesting should have some limits... otherwise the amount of
> information would grow dramatically.
>
> All the best,
> Martin
> ________________________________________
> Dr. Martin Martinez-Ripoll
> Research Professor
> xmar...@iqfr.csic.es
> Department of Crystallography & Structural Biology
> www.xtal.iqfr.csic.es
> Telf.: +34 917459550
> Consejo Superior de Investigaciones Científicas
> Spanish National Research Council
> www.csic.es
>
>
> -----Mensaje original-----
> De: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] En nombre de Colin
> Nave
> Enviado el: jueves, 27 de octubre de 2011 0:49
> Para: CCP4BB@JISCMAIL.AC.UK
> Asunto: Re: [ccp4bb] IUCr committees, depositing images
>
> Dear George, Martin
>
> I don't understand the point that one is throwing away information by
> storing in frames. If the frames have sufficiently fine intervals (given by
> some sampling theorem consideration) I can't see how one loses information.
> Can one of you explain?
> Thanks
> Colin
>
>
>
> -----Original Message-----
> From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of
> Martin
> M. Ripoll
> Sent: 26 October 2011 22:50
> To: ccp4bb
> Subject: Re: [ccp4bb] IUCr committees, depositing images
>
> Dear George, dear all,
>
> I was just trying to summarize my point of view regarding this important
> issue when I got your e-mail, that reflects exactly my own opinion!
>
> Martin
> ________________________________________
> Dr. Martin Martinez-Ripoll
> Research Professor
> xmar...@iqfr.csic.es
> Department of Crystallography & Structural Biology
> www.xtal.iqfr.csic.es
> Telf.: +34 917459550
> Consejo Superior de Investigaciones Científicas
> Spanish National Research Council
> www.csic.es
>
>
>
> -----Mensaje original-----
> De: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] En nombre de George
> M. Sheldrick
> Enviado el: miércoles, 26 de octubre de 2011 11:52
> Para: CCP4BB@JISCMAIL.AC.UK
> Asunto: Re: [ccp4bb] IUCr committees, depositing images
>
> This raises an important point. The new continuous readout detectors such
> as
> the
> Pilatus for beamlines or the Bruker Photon for in-house use enable the
> crystal to
> be rotated at constant velocity, eliminating the mechanical errors
> associated with
> 'stop and go' data collection. Storing their data in 'frames' is an
> artifical
> construction that is currently required for the established data
> integration
> programs but is in fact throwing away information. Maybe in 10 years time
> 'frames'
> will be as obsolete as punched cards!
>
> George
>
> On Wed, Oct 26, 2011 at 09:39:40AM +0100, Graeme Winter wrote:
> > Hi James,
> >
> > Just to pick up on your point about the Pilatus detectors. Yesterday
> > in 2 hours of giving a beamline a workout (admittedly with Thaumatin)
> > we acquired 400 + GB of data*. Now I appreciate that this is not
> > really routine operation, but it does raise an interesting point - if
> > you have loaded a sample and centred it, collected test shots and
> > decided it's not that great, why not collect anyway as it may later
> > prove to be useful?
> >
> > Bzzt. 2 minutes or less later you have a full data set, and barely
> > even time to go get a cup of tea.
> >
> > This does to some extent move the goalposts, as you can acquire far
> > more data than you need. You never know, you may learn something
> > interesting from it - perhaps it has different symmetry or packing?
> > What it does mean is if we can have a method of tagging this data
> > there may be massively more opportunity to get also-ran data sets for
> > methods development types. What it also means however is that the cost
> > of curating this data is then an order of magnitude higher.
> >
> > Also moving it around is also rather more painful.
> >
> > Anyhow, I would try to avoid dismissing the effect that new continuous
> > readout detectors will have on data rates, from experience it is
> > pretty substantial.
> >
> > Cheerio,
> >
> > Graeme
> >
> > *by "data" here what I mean is images, rather than information which
> > is rather more time consuming to acquire. I would argue you get that
> > from processing / analysing the data...
> >
> > On 24 October 2011 22:56, James Holton <jmhol...@lbl.gov> wrote:
> > > The Pilatus is fast, but or decades now we have had detectors that can
> read
> > > out in ~1s.  This means that you can collect a typical ~100 image
> dataset in
> > > a few minutes (if flux is not limiting).  Since there are ~150
> beamlines
> > > currently operating around the world and they are open about 200
> days/year,
> > > we should be collecting ~20,000,000 datasets each year.
> > >
> > > We're not.
> > >
> > > The PDB only gets about 8000 depositions per year, which means either
> we
> > > throw away 99.96% of our images, or we don't actually collect images
> > > anywhere near the ultimate capacity of the equipment we have.  In my
> > > estimation, both of these play about equal roles, with ~50-fold
> attrition
> > > between ultimate data collection capacity and actual collected data,
> and
> > > another ~50 fold attrition between collected data sets and published
> > > structures.
> > >
> > > Personally, I think this means that the time it takes to collect the
> final
> > > dataset is not rate-limiting in a "typical" structural biology
> > > project/paper.  This does not mean that the dataset is of little value.
> > >  Quite the opposite!  About 3000x more time and energy is expended
> preparing
> > > for the final dataset than is spent collecting it, and these efforts
> require
> > > experimental feedback.  The trick is figuring out how best to compress
> the
> > > "data used to solve a structure" for archival storage.  Do the
> "previous
> > > data sets" count?  Or should the compression be "lossy" about such
> > > historical details?  Does the stuff between the spots matter?  After
> all,
> > > h,k,l,F,sigF is really just a form of data compression.  In fact, there
> is
> > > no such thing as "raw" data.  Even "raw" diffraction images are a
> > > simplification of the signals that came out of the detector
> electronics.
> > >  But we round-off and average over a lot of things to remove "noise".
> > >  Largely because "noise" is difficult to compress.  The question of how
> much
> > > compression is too much compression depends on which information (aka
> noise)
> > > you think could be important in the future.
> > >
> > > When it comes to fine-sliced data, such as that from Pilatus, the main
> > > reason why it doesn't compress very well is not because of the spots,
> but
> > > the background.  It occupies thousands of times more pixels than the
> spots.
> > >  Yes, there is diffuse scattering information in the background pixels,
> but
> > > this kind of data is MUCH smoother than the spot data (by definition),
> and
> > > therefore is optimally stored in larger pixels.  Last year, I messed
> around
> > > a bit with applying different compression protocols to the spots and
> the
> > > background, and found that ~30 fold compression can be easily achieved
> if
> > > you apply h264 to the background and store the "spots" with lossless
> png
> > > compression:
> > >
> > > http://bl831.als.lbl.gov/~jamesh/lossy_compression/
> > >
> > > I think these results "speak" to the relative information content of
> the
> > > spots and the pixels between them.  Perhaps at least the "online
> version" of
> > > archived images could be in some sort of lossy-background format?  With
> the
> > > "real images" in some sort of slower storage (like a room full of tapes
> that
> > > are available upon request)?  Would 30-fold compression make the
> storage
> of
> > > image data tractable enough for some entity like the PDB to be able to
> > > afford it?
> > >
> > >
> > > I go to a lot of methods meetings, and it pains me to see the most
> brilliant
> > > minds in the field starved for "interesting" data sets.  The problem is
> that
> > > it is very easy to get people to send you data that is so bad that it
> can't
> > > be solved by any software imaginable (I've got piles of that!).  As a
> > > developer, what you really need is a "right answer" so you can come up
> with
> > > better metrics for how close you are to it.  Ironically, bad,
> unsolvable
> > > data that is connected to a right answer (aka a PDB ID) is very
> difficult to
> > > obtain.  The explanations usually involve protestations about being in
> the
> > > middle of writing up the paper, the student graduated and we don't
> > > understand how he/she labeled the tapes, or the RAID crashed and we
> lost
> it
> > > all, etc. etc.  Then again, just finding someone who has a data set
> with
> the
> > > kind of problem you are interested in is a lot of work!  So is figuring
> out
> > > which problem affects the most people, and is therefore "interesting".
> > >
> > > Is this not exactly the kind of thing that publicly-accessible
> centralized
> > > scientific databases are created to address?
> > >
> > > -James Holton
> > > MAD Scientist
> > >
> > > On 10/16/2011 11:38 AM, Frank von Delft wrote:
> > >>
> > >> On the deposition of raw data:
> > >>
> > >> I recommend to the committee that before it convenes again, every
> member
> > >> should go collect some data on a beamline with a Pilatus detector
> [feel
> free
> > >> to join us at Diamond].  Because by the probable time any
> recommendations
> > >> actually emerge, most beamlines will have one of those (or similar),
> we'll
> > >> be generating more data than the LHC, and users will be happy just to
> have
> > >> it integrated, never mind worry about its fate.
> > >>
> > >> That's not an endorsement, btw, just an observation/prediction.
> > >>
> > >> phx.
> > >>
> > >>
> > >>
> > >>
> > >> On 14/10/2011 23:56, Thomas C. Terwilliger wrote:
> > >>>
> > >>> For those who have strong opinions on what data should be
> deposited...
> > >>>
> > >>> The IUCR is just starting a serious discussion of this subject. Two
> > >>> committees, the "Data Deposition Working Group", led by John
> Helliwell,
> > >>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong
> Su)
> > >>> are working on this.
> > >>>
> > >>> Two key issues are (1) feasibility and importance of deposition of
> raw
> > >>> images and (2) deposition of sufficient information to fully
> reproduce
> > >>> the
> > >>> crystallographic analysis.
> > >>>
> > >>> I am on both committees and would be happy to hear your ideas
> (off-list).
> > >>> I am sure the other members of the committees would welcome your
> thoughts
> > >>> as well.
> > >>>
> > >>> -Tom T
> > >>>
> > >>> Tom Terwilliger
> > >>> terwilli...@lanl.gov
> > >>>
> > >>>
> > >>>>> This is a follow up (or a digression) to James comparing test set
> to
> > >>>>> missing reflections.  I also heard this issue mentioned before but
> was
> > >>>>> always too lazy to actually pursue it.
> > >>>>>
> > >>>>> So.
> > >>>>>
> > >>>>> The role of the test set is to prevent overfitting.  Let's say I
> have
> > >>>>> the final model and I monitored the Rfree every step of the way and
> can
> > >>>>> conclude that there is no overfitting.  Should I do the final
> > >>>>> refinement
> > >>>>> against complete dataset?
> > >>>>>
> > >>>>> IMCO, I absolutely should.  The test set reflections contain
> > >>>>> information, and the "final" model is actually biased towards the
> > >>>>> working set.  Refining using all the data can only improve the
> accuracy
> > >>>>> of the model, if only slightly.
> > >>>>>
> > >>>>> The second question is practical.  Let's say I want to deposit the
> > >>>>> results of the refinement against the full dataset as my final
> model.
> > >>>>> Should I not report the Rfree and instead insert a remark
> explaining
> > >>>>> the
> > >>>>> situation?  If I report the Rfree prior to the test set removal, it
> is
> > >>>>> certain that every validation tool will report a mismatch.  It does
> not
> > >>>>> seem that the PDB has a mechanism to deal with this.
> > >>>>>
> > >>>>> Cheers,
> > >>>>>
> > >>>>> Ed.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Oh, suddenly throwing a giraffe into a volcano to make water is
> crazy?
> > >>>>>                                                 Julian, King of
> Lemurs
> > >>>>>
> > >
> >
>
> --
> Prof. George M. Sheldrick FRS
> Dept. Structural Chemistry,
> University of Goettingen,
> Tammannstr. 4,
> D37077 Goettingen, Germany
> Tel. +49-551-39-3021 or -3068
> Fax. +49-551-39-22582
>

Re: [ccp4bb] IUCr committees, depositing images

Reply via email to