Re: [ccp4bb] To archive or not to archive, that's the question!

Kelly Daughtry Mon, 31 Oct 2011 10:44:19 -0700

I believe that archiving original images for published data sets could be
very useful, if linked to the PDB.
I have downloaded SFs from the PDB to use for re-refinement of the
published model (if I think the electron density maps are misinterpreted)
and personally had a different interpretation of the density (ion vs small
ligand). With that in mind, re-processing from the original images could be
useful for catching mistakes in processing (especially if a high R-factor
or low I/sigma are reported), albeit it a small percentage of the time.


As for difficult data sets, problematic cases, etc, I can see the
importance of their availability by the preceding arguments.
It seems to be most useful for software developers. In that case, I would
suggest software developers to publicly request our difficult to process
images, or create their own repository. Then they can store and use the
data as they like.  I would happily upload a few data sets.
(Just a suggestion)

Best Wishes,
Kelly Daughtry

*******************************************************
Kelly Daughtry, Ph.D.
Post-Doctoral Fellow, Raetz Lab
Biochemistry Department
Duke University
Alex H. Sands, Jr. Building
303 Research Drive
RM 250
Durham, NC 27710
P: 919-684-5178
*******************************************************


On Mon, Oct 31, 2011 at 12:01 PM, Martin Kollmar <m...@nmr.mpibpc.mpg.de>wrote:

>  The point is that science is not collecting stamps. Therefore the first
> question should always be "Why". If you start with "What" the discussion
> immediately switches to technical issues like how many TB, PB etc. $/€,
> manpower. And all the intense discussion will blow out by one single "Why".
> Nothing is for free. But if it would help science and mankind, nobody would
> hesitate to spend millions of $/€.
>
> Supporting software development / software developers is a different
> question. If this were the  first question that someone would have asked
> the answer would have never been "archiving all datasets worldwide /
> deposited structures", but how could we, the community, build up a resource
> with different kind of problems (e.g. space groups, twinning, overlapping
> lattices, etc.).
>
> I still didn't got an answer for "Why".
>
> Best regards,
> Martin
>
>
>
> Am 31.10.2011 16:18, schrieb Oganesyan, Vaheh:
>
>
> I was hesitant to add my opinion so far because I'm used more to listen
> this forum rather than tell others what I think.
> "Why" and "what" to deposit are absolutely interconnected. Once you decide
> why you want to do it, then you will probably know what will be the best
> format and *vice versa*.
> Whether this deposition of raw images will or will not help in future
> understanding the biology better I'm not sure.
> But to store those difficult datasets to help the future software
> development sounds really farfetched. This assumes that in the future
> crystallographers will never grow crystals that will deliver difficult
> datasets. If that is the case and in 10-20-30 years next generation will be
> growing much better crystals then they don't need such a software
> development.
> If that is not the case, and once in a while (or more often) they will be
> getting something out of ordinary then software developers will take them
> and develop whatever they need to develop to consider such cases.
>
> Am I missing a point of discussion here?
>
> Regards,
>
>       Vaheh
>
>
>
>
> -----Original Message-----
> From: CCP4 bulletin board 
> [mailto:CCP4BB@JISCMAIL.AC.UK<CCP4BB@JISCMAIL.AC.UK>]
> On Behalf Of Robert Esnouf
> Sent: Monday, October 31, 2011 10:31 AM
> To: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] To archive or not to archive, that's the question!
>
> Dear All,
>
> As someone who recently left crystallography for sequencing, I
> should modify Tassos's point...
>
> "A full data-set is a few terabytes, but post-processing
> reduces it to sub-Gb size."
>
> My experience from HiSeqs is that this "full" here means the
> base calls - equivalent to the unmerged HKLs - hardly raw
> data. NGS (short-read) sequencing is an imaging technique and
> the images are more like >100TB for a 15-day run on a single
> flow cell. The raw base calls are about 5TB. The compressed,
> mapped data (BAM file, for a human genome, 30x coverage) is
> about 120GB. It is only a variant call file (VCF, difference
> from a stated human reference genome) that is sub-Gb and these
> files are - unsurprisingly - unsuited to detailed statistical
> analysis. Also $1k is a not yet an economic cost...
>
> The DNA information capacity in a single human body dwarfs the
> entire world disk capacity, so storing DNA is a no brainer
> here. Sequencing groups are making very hard-nosed economic
> decisions about what to store - indeed it is a source of
> research in itself - but the scale of the problem is very much
> bigger.
>
> My tuppence ha'penny is that depositing "raw" images along
> with everything else in the PDB is a nice idea but would have
> little impact on science (human/animal/plant health or
> understanding of biology).
>
> 1) If confined to structures in the PDB, the images would just
> be the ones giving the final best data - hence the ones least
> likely to have been problematic. I'd be more interested in
> SFs/maps for looking at ligand-binding etc...
>
> 2) Unless this were done before paper acceptance they would be
> of little use to referees seeking to review important
> structural papers. I'd like to see PDB validation reports
> (which could include automated data processing, perhaps culled
> from synchrotron sites, SFs and/or maps) made available to
> referees in advance of publication. This would be enabled by
> deposition, but could be achieved in other ways.
>
> 3) The datasets of interest to methods developers are unlikely
> to be the ones deposited. They should be in contact with
> synchrotron archives directly. Processing multiple lattices is
> a case in point here.
>
> 4) Remember the "average consumer" of a PDB file is not a
> crystallographer. More likely to be a graduate student in a
> clinical lab. For him/her things like occupancies and B-
> factors are far more serious concerns... I'm not trivializing
> the issue, but importance is always relative. Are there
> "outsiders" on the panel to keep perspective?
>
> Robert
>
>
> --
>
> Dr. Robert Esnouf,
> University Research Lecturer, ex-crystallographer
> and Head of Research Computing,
> Wellcome Trust Centre for Human Genetics,
> Roosevelt Drive, Oxford OX3 7BN, UK
>
> Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
>     and rob...@esnouf.com        Fax: (+44) - 1865 - 287547
>
>
> ---- Original message ----
> >Date: Mon, 31 Oct 2011 11:37:47 +0100
> >From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> <CCP4BB@JISCMAIL.AC.UK>(on 
> >behalf
> of Anastassis Perrakis <a.perra...@nki.nl> <a.perra...@nki.nl>)
> >Subject: Re: [ccp4bb] To archive or not to archive, that's
> the question!
> >To: CCP4BB@JISCMAIL.AC.UK
> >
> >   Dear all,
> >   The discussion about keeping primary data, and what
> >   level of data can be considered 'primary', has -
> >   rather unsurprisingly - come up also in areas other
> >   than structural biology.
> >   An example is next generation sequencing. A
> >   full-dataset is a few tera bytes, but
> >   post-processing reduces it to sub-Gb size. However,
> >   the post-processed data, as in our case,
> >   have suffered the inadequacy of computational
> >   "reduction" ... At least out institute has decided
> >   to create double back-up of the primary data in
> >   triplicate. For that reason our facility bought
> >   three -80 freezers, one on site at the basement, on
> >   at the top floor, and one off-site, and they keep
> >   the DNA to be sequenced. A sequencing run is already
> >   sub-1k$ and it will not become
> >   more expensive. So, if its important, do it again.
> >   Its cheaper and its better.
> >   At first sight, that does not apply to MX. Or does
> >   it?
> >   So, maybe the question is not "To archive or not to
> >   archive" but "What to archive".
> >   (similarly, it never crossed my mind if I should "be
> >   or not be" - I always wondered "what to be")
> >   A.
> >   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
> >
> >     Am 20:59, schrieb Jrh:
> >     ...
> >
> >       So:-  Universities are now establishing their
> >       own institutional
> >
> >       repositories, driven largely by Open Access
> >       demands of funders. For
> >
> >       these to host raw datasets that underpin
> >       publications is a reasonable
> >
> >       role in my view and indeed they already have
> >       this category in the
> >
> >       University of Manchester eScholar system, for
> >       example.  I am set to
> >
> >       explore locally here whether they would
> >       accommodate all our Lab's raw
> >
> >       Xray images datasets per annum that underpin our
> >       published crystal
> >
> >       structures.
> >
> >       It would be helpful if readers of this CCP4bb
> >       could kindly also
> >
> >       explore with their own universities if they have
> >       such an
> >
> >       institutional repository and if raw data sets
> >       could be accommodated.
> >
> >       Please do email me off list with this
> >       information if you prefer but
> >
> >       within the CCP4bb is also good.
> >
> >     Dear John,
> >
> >     I'm pretty sure that there exists no consistent
> >     policy to provide an "institutional repository"
> >     for deposition of scientific data at German
> >     universities or Max-Planck institutes or Helmholtz
> >     institutions, at least I never heard of something
> >     like this. More specifically, our University of
> >     Konstanz certainly does not have the
> >     infrastructure to provide this.
> >
> >     I don't think that Germany is the only country
> >     which is the exception to any rule of availability
> >     of "institutional repository" . Rather, I'm almost
> >     amazed that British and American institutions seem
> >     to support this.
> >
> >     Thus I suggest to not focus exclusively on
> >     official institutional repositories, but to
> >     explore alternatives: distributed filestores like
> >     Google's BigTable, Bittorrent or others might be
> >     just as suitable - check out
> >     http://en.wikipedia.org/wiki/Distributed_data_store.
> >     I guess that any crystallographic lab could easily
> >     sacrifice/donate a TB of storage for the purposes
> >     of this project in 2011 (and maybe 2 TB in 2012, 3
> >     in 2013, ...), but clearly the level of work to
> >     set this up should be kept as low as possible (a
> >     bittorrent daemon seems simple enough).
> >
> >     Just my 2 cents,
> >
> >     Kay
> >
> >   P please don't print this e-mail unless you really
> >   need to
> >   Anastassis (Tassos) Perrakis, Principal Investigator
> >   / Staff Member
> >   Department of Biochemistry (B8)
> >   Netherlands Cancer Institute,
> >   Dept. B8, 1066 CX Amsterdam, The Netherlands
> >   Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
> >   SMS: +31 6 28 597791
>
> To the extent this electronic communication or any of its attachments
> contain information that is not in the public domain, such information is
> considered by MedImmune to be confidential and proprietary. This
> communication is expected to be read and/or used only by the individual(s)
> for whom it is intended. If you have received this electronic communication
> in error, please reply to the sender advising of the error in transmission
> and delete the original message and any accompanying documents from your
> system immediately, without copying, reviewing or otherwise using them for
> any purpose. Thank you for your cooperation.
>
>

Re: [ccp4bb] To archive or not to archive, that's the question!

Reply via email to