I believe that archiving original images for published data sets could be very useful, if linked to the PDB. I have downloaded SFs from the PDB to use for re-refinement of the published model (if I think the electron density maps are misinterpreted) and personally had a different interpretation of the density (ion vs small ligand). With that in mind, re-processing from the original images could be useful for catching mistakes in processing (especially if a high R-factor or low I/sigma are reported), albeit it a small percentage of the time.
As for difficult data sets, problematic cases, etc, I can see the importance of their availability by the preceding arguments. It seems to be most useful for software developers. In that case, I would suggest software developers to publicly request our difficult to process images, or create their own repository. Then they can store and use the data as they like. I would happily upload a few data sets. (Just a suggestion) Best Wishes, Kelly Daughtry ******************************************************* Kelly Daughtry, Ph.D. Post-Doctoral Fellow, Raetz Lab Biochemistry Department Duke University Alex H. Sands, Jr. Building 303 Research Drive RM 250 Durham, NC 27710 P: 919-684-5178 ******************************************************* On Mon, Oct 31, 2011 at 12:01 PM, Martin Kollmar <m...@nmr.mpibpc.mpg.de>wrote: > The point is that science is not collecting stamps. Therefore the first > question should always be "Why". If you start with "What" the discussion > immediately switches to technical issues like how many TB, PB etc. $/€, > manpower. And all the intense discussion will blow out by one single "Why". > Nothing is for free. But if it would help science and mankind, nobody would > hesitate to spend millions of $/€. > > Supporting software development / software developers is a different > question. If this were the first question that someone would have asked > the answer would have never been "archiving all datasets worldwide / > deposited structures", but how could we, the community, build up a resource > with different kind of problems (e.g. space groups, twinning, overlapping > lattices, etc.). > > I still didn't got an answer for "Why". > > Best regards, > Martin > > > > Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh: > > > I was hesitant to add my opinion so far because I'm used more to listen > this forum rather than tell others what I think. > "Why" and "what" to deposit are absolutely interconnected. Once you decide > why you want to do it, then you will probably know what will be the best > format and *vice versa*. > Whether this deposition of raw images will or will not help in future > understanding the biology better I'm not sure. > But to store those difficult datasets to help the future software > development sounds really farfetched. This assumes that in the future > crystallographers will never grow crystals that will deliver difficult > datasets. If that is the case and in 10-20-30 years next generation will be > growing much better crystals then they don't need such a software > development. > If that is not the case, and once in a while (or more often) they will be > getting something out of ordinary then software developers will take them > and develop whatever they need to develop to consider such cases. > > Am I missing a point of discussion here? > > Regards, > > Vaheh > > > > > -----Original Message----- > From: CCP4 bulletin board > [mailto:CCP4BB@JISCMAIL.AC.UK<CCP4BB@JISCMAIL.AC.UK>] > On Behalf Of Robert Esnouf > Sent: Monday, October 31, 2011 10:31 AM > To: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] To archive or not to archive, that's the question! > > Dear All, > > As someone who recently left crystallography for sequencing, I > should modify Tassos's point... > > "A full data-set is a few terabytes, but post-processing > reduces it to sub-Gb size." > > My experience from HiSeqs is that this "full" here means the > base calls - equivalent to the unmerged HKLs - hardly raw > data. NGS (short-read) sequencing is an imaging technique and > the images are more like >100TB for a 15-day run on a single > flow cell. The raw base calls are about 5TB. The compressed, > mapped data (BAM file, for a human genome, 30x coverage) is > about 120GB. It is only a variant call file (VCF, difference > from a stated human reference genome) that is sub-Gb and these > files are - unsurprisingly - unsuited to detailed statistical > analysis. Also $1k is a not yet an economic cost... > > The DNA information capacity in a single human body dwarfs the > entire world disk capacity, so storing DNA is a no brainer > here. Sequencing groups are making very hard-nosed economic > decisions about what to store - indeed it is a source of > research in itself - but the scale of the problem is very much > bigger. > > My tuppence ha'penny is that depositing "raw" images along > with everything else in the PDB is a nice idea but would have > little impact on science (human/animal/plant health or > understanding of biology). > > 1) If confined to structures in the PDB, the images would just > be the ones giving the final best data - hence the ones least > likely to have been problematic. I'd be more interested in > SFs/maps for looking at ligand-binding etc... > > 2) Unless this were done before paper acceptance they would be > of little use to referees seeking to review important > structural papers. I'd like to see PDB validation reports > (which could include automated data processing, perhaps culled > from synchrotron sites, SFs and/or maps) made available to > referees in advance of publication. This would be enabled by > deposition, but could be achieved in other ways. > > 3) The datasets of interest to methods developers are unlikely > to be the ones deposited. They should be in contact with > synchrotron archives directly. Processing multiple lattices is > a case in point here. > > 4) Remember the "average consumer" of a PDB file is not a > crystallographer. More likely to be a graduate student in a > clinical lab. For him/her things like occupancies and B- > factors are far more serious concerns... I'm not trivializing > the issue, but importance is always relative. Are there > "outsiders" on the panel to keep perspective? > > Robert > > > -- > > Dr. Robert Esnouf, > University Research Lecturer, ex-crystallographer > and Head of Research Computing, > Wellcome Trust Centre for Human Genetics, > Roosevelt Drive, Oxford OX3 7BN, UK > > Emails: rob...@strubi.ox.ac.uk Tel: (+44) - 1865 - 287783 > and rob...@esnouf.com Fax: (+44) - 1865 - 287547 > > > ---- Original message ---- > >Date: Mon, 31 Oct 2011 11:37:47 +0100 > >From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> <CCP4BB@JISCMAIL.AC.UK>(on > >behalf > of Anastassis Perrakis <a.perra...@nki.nl> <a.perra...@nki.nl>) > >Subject: Re: [ccp4bb] To archive or not to archive, that's > the question! > >To: CCP4BB@JISCMAIL.AC.UK > > > > Dear all, > > The discussion about keeping primary data, and what > > level of data can be considered 'primary', has - > > rather unsurprisingly - come up also in areas other > > than structural biology. > > An example is next generation sequencing. A > > full-dataset is a few tera bytes, but > > post-processing reduces it to sub-Gb size. However, > > the post-processed data, as in our case, > > have suffered the inadequacy of computational > > "reduction" ... At least out institute has decided > > to create double back-up of the primary data in > > triplicate. For that reason our facility bought > > three -80 freezers, one on site at the basement, on > > at the top floor, and one off-site, and they keep > > the DNA to be sequenced. A sequencing run is already > > sub-1k$ and it will not become > > more expensive. So, if its important, do it again. > > Its cheaper and its better. > > At first sight, that does not apply to MX. Or does > > it? > > So, maybe the question is not "To archive or not to > > archive" but "What to archive". > > (similarly, it never crossed my mind if I should "be > > or not be" - I always wondered "what to be") > > A. > > On Oct 30, 2011, at 11:59, Kay Diederichs wrote: > > > > Am 20:59, schrieb Jrh: > > ... > > > > So:- Universities are now establishing their > > own institutional > > > > repositories, driven largely by Open Access > > demands of funders. For > > > > these to host raw datasets that underpin > > publications is a reasonable > > > > role in my view and indeed they already have > > this category in the > > > > University of Manchester eScholar system, for > > example. I am set to > > > > explore locally here whether they would > > accommodate all our Lab's raw > > > > Xray images datasets per annum that underpin our > > published crystal > > > > structures. > > > > It would be helpful if readers of this CCP4bb > > could kindly also > > > > explore with their own universities if they have > > such an > > > > institutional repository and if raw data sets > > could be accommodated. > > > > Please do email me off list with this > > information if you prefer but > > > > within the CCP4bb is also good. > > > > Dear John, > > > > I'm pretty sure that there exists no consistent > > policy to provide an "institutional repository" > > for deposition of scientific data at German > > universities or Max-Planck institutes or Helmholtz > > institutions, at least I never heard of something > > like this. More specifically, our University of > > Konstanz certainly does not have the > > infrastructure to provide this. > > > > I don't think that Germany is the only country > > which is the exception to any rule of availability > > of "institutional repository" . Rather, I'm almost > > amazed that British and American institutions seem > > to support this. > > > > Thus I suggest to not focus exclusively on > > official institutional repositories, but to > > explore alternatives: distributed filestores like > > Google's BigTable, Bittorrent or others might be > > just as suitable - check out > > http://en.wikipedia.org/wiki/Distributed_data_store. > > I guess that any crystallographic lab could easily > > sacrifice/donate a TB of storage for the purposes > > of this project in 2011 (and maybe 2 TB in 2012, 3 > > in 2013, ...), but clearly the level of work to > > set this up should be kept as low as possible (a > > bittorrent daemon seems simple enough). > > > > Just my 2 cents, > > > > Kay > > > > P please don't print this e-mail unless you really > > need to > > Anastassis (Tassos) Perrakis, Principal Investigator > > / Staff Member > > Department of Biochemistry (B8) > > Netherlands Cancer Institute, > > Dept. B8, 1066 CX Amsterdam, The Netherlands > > Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile / > > SMS: +31 6 28 597791 > > To the extent this electronic communication or any of its attachments > contain information that is not in the public domain, such information is > considered by MedImmune to be confidential and proprietary. This > communication is expected to be read and/or used only by the individual(s) > for whom it is intended. If you have received this electronic communication > in error, please reply to the sender advising of the error in transmission > and delete the original message and any accompanying documents from your > system immediately, without copying, reviewing or otherwise using them for > any purpose. Thank you for your cooperation. > >