I agree with Colin here. Framing is simply a process of sampling an original signal at some 'frequency' (related to the phi-width of each frame). At some point, delta phi is small enough that the original signal is oversampled, and can be reconstructed _within the bounds of noise_. Beyond that point I see no advantage to sampling finer - and certainly not going to the limit of representing your data in some unframed continuous readout form.
Perhaps I am missing something, and I realise this is another OT diversion from this most fruitful of threads. Cheers -- David On 27 October 2011 08:55, Martin M. Ripoll <xmar...@iqfr.csic.es> wrote: > Dear Colin, > > I think you understood perfectly what George was saying regarding the loss > of information, but he will probably answer better than I. > > In any case, and for the ones that did not understand it, what George was > telling is related to the fact that a data collection made with a > continuous > crystal rotation contains more information than when this information is > transformed into frames... The loss of information that we are referring to > has the same meaning as when we calculate electron density maps with > different grid sizes. The finer the grid, the greater is the information on > the map. > > But you are right saying that the shorter the interval between produced > frames, the lower the loss of information. However, the procedure that you > are suggesting should have some limits... otherwise the amount of > information would grow dramatically. > > All the best, > Martin > ________________________________________ > Dr. Martin Martinez-Ripoll > Research Professor > xmar...@iqfr.csic.es > Department of Crystallography & Structural Biology > www.xtal.iqfr.csic.es > Telf.: +34 917459550 > Consejo Superior de Investigaciones Científicas > Spanish National Research Council > www.csic.es > > > -----Mensaje original----- > De: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] En nombre de Colin > Nave > Enviado el: jueves, 27 de octubre de 2011 0:49 > Para: CCP4BB@JISCMAIL.AC.UK > Asunto: Re: [ccp4bb] IUCr committees, depositing images > > Dear George, Martin > > I don't understand the point that one is throwing away information by > storing in frames. If the frames have sufficiently fine intervals (given by > some sampling theorem consideration) I can't see how one loses information. > Can one of you explain? > Thanks > Colin > > > > -----Original Message----- > From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of > Martin > M. Ripoll > Sent: 26 October 2011 22:50 > To: ccp4bb > Subject: Re: [ccp4bb] IUCr committees, depositing images > > Dear George, dear all, > > I was just trying to summarize my point of view regarding this important > issue when I got your e-mail, that reflects exactly my own opinion! > > Martin > ________________________________________ > Dr. Martin Martinez-Ripoll > Research Professor > xmar...@iqfr.csic.es > Department of Crystallography & Structural Biology > www.xtal.iqfr.csic.es > Telf.: +34 917459550 > Consejo Superior de Investigaciones Científicas > Spanish National Research Council > www.csic.es > > > > -----Mensaje original----- > De: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] En nombre de George > M. Sheldrick > Enviado el: miércoles, 26 de octubre de 2011 11:52 > Para: CCP4BB@JISCMAIL.AC.UK > Asunto: Re: [ccp4bb] IUCr committees, depositing images > > This raises an important point. The new continuous readout detectors such > as > the > Pilatus for beamlines or the Bruker Photon for in-house use enable the > crystal to > be rotated at constant velocity, eliminating the mechanical errors > associated with > 'stop and go' data collection. Storing their data in 'frames' is an > artifical > construction that is currently required for the established data > integration > programs but is in fact throwing away information. Maybe in 10 years time > 'frames' > will be as obsolete as punched cards! > > George > > On Wed, Oct 26, 2011 at 09:39:40AM +0100, Graeme Winter wrote: > > Hi James, > > > > Just to pick up on your point about the Pilatus detectors. Yesterday > > in 2 hours of giving a beamline a workout (admittedly with Thaumatin) > > we acquired 400 + GB of data*. Now I appreciate that this is not > > really routine operation, but it does raise an interesting point - if > > you have loaded a sample and centred it, collected test shots and > > decided it's not that great, why not collect anyway as it may later > > prove to be useful? > > > > Bzzt. 2 minutes or less later you have a full data set, and barely > > even time to go get a cup of tea. > > > > This does to some extent move the goalposts, as you can acquire far > > more data than you need. You never know, you may learn something > > interesting from it - perhaps it has different symmetry or packing? > > What it does mean is if we can have a method of tagging this data > > there may be massively more opportunity to get also-ran data sets for > > methods development types. What it also means however is that the cost > > of curating this data is then an order of magnitude higher. > > > > Also moving it around is also rather more painful. > > > > Anyhow, I would try to avoid dismissing the effect that new continuous > > readout detectors will have on data rates, from experience it is > > pretty substantial. > > > > Cheerio, > > > > Graeme > > > > *by "data" here what I mean is images, rather than information which > > is rather more time consuming to acquire. I would argue you get that > > from processing / analysing the data... > > > > On 24 October 2011 22:56, James Holton <jmhol...@lbl.gov> wrote: > > > The Pilatus is fast, but or decades now we have had detectors that can > read > > > out in ~1s. This means that you can collect a typical ~100 image > dataset in > > > a few minutes (if flux is not limiting). Since there are ~150 > beamlines > > > currently operating around the world and they are open about 200 > days/year, > > > we should be collecting ~20,000,000 datasets each year. > > > > > > We're not. > > > > > > The PDB only gets about 8000 depositions per year, which means either > we > > > throw away 99.96% of our images, or we don't actually collect images > > > anywhere near the ultimate capacity of the equipment we have. In my > > > estimation, both of these play about equal roles, with ~50-fold > attrition > > > between ultimate data collection capacity and actual collected data, > and > > > another ~50 fold attrition between collected data sets and published > > > structures. > > > > > > Personally, I think this means that the time it takes to collect the > final > > > dataset is not rate-limiting in a "typical" structural biology > > > project/paper. This does not mean that the dataset is of little value. > > > Quite the opposite! About 3000x more time and energy is expended > preparing > > > for the final dataset than is spent collecting it, and these efforts > require > > > experimental feedback. The trick is figuring out how best to compress > the > > > "data used to solve a structure" for archival storage. Do the > "previous > > > data sets" count? Or should the compression be "lossy" about such > > > historical details? Does the stuff between the spots matter? After > all, > > > h,k,l,F,sigF is really just a form of data compression. In fact, there > is > > > no such thing as "raw" data. Even "raw" diffraction images are a > > > simplification of the signals that came out of the detector > electronics. > > > But we round-off and average over a lot of things to remove "noise". > > > Largely because "noise" is difficult to compress. The question of how > much > > > compression is too much compression depends on which information (aka > noise) > > > you think could be important in the future. > > > > > > When it comes to fine-sliced data, such as that from Pilatus, the main > > > reason why it doesn't compress very well is not because of the spots, > but > > > the background. It occupies thousands of times more pixels than the > spots. > > > Yes, there is diffuse scattering information in the background pixels, > but > > > this kind of data is MUCH smoother than the spot data (by definition), > and > > > therefore is optimally stored in larger pixels. Last year, I messed > around > > > a bit with applying different compression protocols to the spots and > the > > > background, and found that ~30 fold compression can be easily achieved > if > > > you apply h264 to the background and store the "spots" with lossless > png > > > compression: > > > > > > http://bl831.als.lbl.gov/~jamesh/lossy_compression/ > > > > > > I think these results "speak" to the relative information content of > the > > > spots and the pixels between them. Perhaps at least the "online > version" of > > > archived images could be in some sort of lossy-background format? With > the > > > "real images" in some sort of slower storage (like a room full of tapes > that > > > are available upon request)? Would 30-fold compression make the > storage > of > > > image data tractable enough for some entity like the PDB to be able to > > > afford it? > > > > > > > > > I go to a lot of methods meetings, and it pains me to see the most > brilliant > > > minds in the field starved for "interesting" data sets. The problem is > that > > > it is very easy to get people to send you data that is so bad that it > can't > > > be solved by any software imaginable (I've got piles of that!). As a > > > developer, what you really need is a "right answer" so you can come up > with > > > better metrics for how close you are to it. Ironically, bad, > unsolvable > > > data that is connected to a right answer (aka a PDB ID) is very > difficult to > > > obtain. The explanations usually involve protestations about being in > the > > > middle of writing up the paper, the student graduated and we don't > > > understand how he/she labeled the tapes, or the RAID crashed and we > lost > it > > > all, etc. etc. Then again, just finding someone who has a data set > with > the > > > kind of problem you are interested in is a lot of work! So is figuring > out > > > which problem affects the most people, and is therefore "interesting". > > > > > > Is this not exactly the kind of thing that publicly-accessible > centralized > > > scientific databases are created to address? > > > > > > -James Holton > > > MAD Scientist > > > > > > On 10/16/2011 11:38 AM, Frank von Delft wrote: > > >> > > >> On the deposition of raw data: > > >> > > >> I recommend to the committee that before it convenes again, every > member > > >> should go collect some data on a beamline with a Pilatus detector > [feel > free > > >> to join us at Diamond]. Because by the probable time any > recommendations > > >> actually emerge, most beamlines will have one of those (or similar), > we'll > > >> be generating more data than the LHC, and users will be happy just to > have > > >> it integrated, never mind worry about its fate. > > >> > > >> That's not an endorsement, btw, just an observation/prediction. > > >> > > >> phx. > > >> > > >> > > >> > > >> > > >> On 14/10/2011 23:56, Thomas C. Terwilliger wrote: > > >>> > > >>> For those who have strong opinions on what data should be > deposited... > > >>> > > >>> The IUCR is just starting a serious discussion of this subject. Two > > >>> committees, the "Data Deposition Working Group", led by John > Helliwell, > > >>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong > Su) > > >>> are working on this. > > >>> > > >>> Two key issues are (1) feasibility and importance of deposition of > raw > > >>> images and (2) deposition of sufficient information to fully > reproduce > > >>> the > > >>> crystallographic analysis. > > >>> > > >>> I am on both committees and would be happy to hear your ideas > (off-list). > > >>> I am sure the other members of the committees would welcome your > thoughts > > >>> as well. > > >>> > > >>> -Tom T > > >>> > > >>> Tom Terwilliger > > >>> terwilli...@lanl.gov > > >>> > > >>> > > >>>>> This is a follow up (or a digression) to James comparing test set > to > > >>>>> missing reflections. I also heard this issue mentioned before but > was > > >>>>> always too lazy to actually pursue it. > > >>>>> > > >>>>> So. > > >>>>> > > >>>>> The role of the test set is to prevent overfitting. Let's say I > have > > >>>>> the final model and I monitored the Rfree every step of the way and > can > > >>>>> conclude that there is no overfitting. Should I do the final > > >>>>> refinement > > >>>>> against complete dataset? > > >>>>> > > >>>>> IMCO, I absolutely should. The test set reflections contain > > >>>>> information, and the "final" model is actually biased towards the > > >>>>> working set. Refining using all the data can only improve the > accuracy > > >>>>> of the model, if only slightly. > > >>>>> > > >>>>> The second question is practical. Let's say I want to deposit the > > >>>>> results of the refinement against the full dataset as my final > model. > > >>>>> Should I not report the Rfree and instead insert a remark > explaining > > >>>>> the > > >>>>> situation? If I report the Rfree prior to the test set removal, it > is > > >>>>> certain that every validation tool will report a mismatch. It does > not > > >>>>> seem that the PDB has a mechanism to deal with this. > > >>>>> > > >>>>> Cheers, > > >>>>> > > >>>>> Ed. > > >>>>> > > >>>>> > > >>>>> > > >>>>> -- > > >>>>> Oh, suddenly throwing a giraffe into a volcano to make water is > crazy? > > >>>>> Julian, King of > Lemurs > > >>>>> > > > > > > > -- > Prof. George M. Sheldrick FRS > Dept. Structural Chemistry, > University of Goettingen, > Tammannstr. 4, > D37077 Goettingen, Germany > Tel. +49-551-39-3021 or -3068 > Fax. +49-551-39-22582 >