Dear All,

As someone who recently left crystallography for sequencing, I 
should modify Tassos's point...

"A full data-set is a few terabytes, but post-processing 
reduces it to sub-Gb size."

My experience from HiSeqs is that this "full" here means the 
base calls - equivalent to the unmerged HKLs - hardly raw 
data. NGS (short-read) sequencing is an imaging technique and 
the images are more like >100TB for a 15-day run on a single 
flow cell. The raw base calls are about 5TB. The compressed, 
mapped data (BAM file, for a human genome, 30x coverage) is 
about 120GB. It is only a variant call file (VCF, difference 
from a stated human reference genome) that is sub-Gb and these 
files are - unsurprisingly - unsuited to detailed statistical 
analysis. Also $1k is a not yet an economic cost...

The DNA information capacity in a single human body dwarfs the 
entire world disk capacity, so storing DNA is a no brainer 
here. Sequencing groups are making very hard-nosed economic 
decisions about what to store - indeed it is a source of 
research in itself - but the scale of the problem is very much 
bigger.

My tuppence ha'penny is that depositing "raw" images along 
with everything else in the PDB is a nice idea but would have 
little impact on science (human/animal/plant health or 
understanding of biology).

1) If confined to structures in the PDB, the images would just 
be the ones giving the final best data - hence the ones least 
likely to have been problematic. I'd be more interested in 
SFs/maps for looking at ligand-binding etc...

2) Unless this were done before paper acceptance they would be 
of little use to referees seeking to review important 
structural papers. I'd like to see PDB validation reports 
(which could include automated data processing, perhaps culled 
from synchrotron sites, SFs and/or maps) made available to 
referees in advance of publication. This would be enabled by 
deposition, but could be achieved in other ways.

3) The datasets of interest to methods developers are unlikely 
to be the ones deposited. They should be in contact with 
synchrotron archives directly. Processing multiple lattices is 
a case in point here.

4) Remember the "average consumer" of a PDB file is not a 
crystallographer. More likely to be a graduate student in a 
clinical lab. For him/her things like occupancies and B-
factors are far more serious concerns... I'm not trivializing 
the issue, but importance is always relative. Are there 
"outsiders" on the panel to keep perspective?

Robert


--

Dr. Robert Esnouf,
University Research Lecturer, ex-crystallographer
and Head of Research Computing,
Wellcome Trust Centre for Human Genetics,
Roosevelt Drive, Oxford OX3 7BN, UK

Emails: rob...@strubi.ox.ac.uk   Tel: (+44) - 1865 - 287783
    and rob...@esnouf.com        Fax: (+44) - 1865 - 287547


---- Original message ----
>Date: Mon, 31 Oct 2011 11:37:47 +0100
>From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> (on behalf 
of Anastassis Perrakis <a.perra...@nki.nl>)
>Subject: Re: [ccp4bb] To archive or not to archive, that's 
the question!  
>To: CCP4BB@JISCMAIL.AC.UK
>
>   Dear all,
>   The discussion about keeping primary data, and what
>   level of data can be considered 'primary', has -
>   rather unsurprisingly - come up also in areas other
>   than structural biology.
>   An example is next generation sequencing. A
>   full-dataset is a few tera bytes, but
>   post-processing reduces it to sub-Gb size. However,
>   the post-processed data, as in our case,
>   have suffered the inadequacy of computational
>   "reduction" ... At least out institute has decided
>   to create double back-up of the primary data in
>   triplicate. For that reason our facility bought
>   three -80 freezers, one on site at the basement, on
>   at the top floor, and one off-site, and they keep
>   the DNA to be sequenced. A sequencing run is already
>   sub-1k$ and it will not become
>   more expensive. So, if its important, do it again.
>   Its cheaper and its better.
>   At first sight, that does not apply to MX. Or does
>   it?
>   So, maybe the question is not "To archive or not to
>   archive" but "What to archive".
>   (similarly, it never crossed my mind if I should "be
>   or not be" - I always wondered "what to be")
>   A.
>   On Oct 30, 2011, at 11:59, Kay Diederichs wrote:
>
>     Am 20:59, schrieb Jrh:
>     ...
>
>       So:-  Universities are now establishing their
>       own institutional
>
>       repositories, driven largely by Open Access
>       demands of funders. For
>
>       these to host raw datasets that underpin
>       publications is a reasonable
>
>       role in my view and indeed they already have
>       this category in the
>
>       University of Manchester eScholar system, for
>       example.  I am set to
>
>       explore locally here whether they would
>       accommodate all our Lab's raw
>
>       Xray images datasets per annum that underpin our
>       published crystal
>
>       structures.
>
>       It would be helpful if readers of this CCP4bb
>       could kindly also
>
>       explore with their own universities if they have
>       such an
>
>       institutional repository and if raw data sets
>       could be accommodated.
>
>       Please do email me off list with this
>       information if you prefer but
>
>       within the CCP4bb is also good.
>
>     Dear John,
>
>     I'm pretty sure that there exists no consistent
>     policy to provide an "institutional repository"
>     for deposition of scientific data at German
>     universities or Max-Planck institutes or Helmholtz
>     institutions, at least I never heard of something
>     like this. More specifically, our University of
>     Konstanz certainly does not have the
>     infrastructure to provide this.
>
>     I don't think that Germany is the only country
>     which is the exception to any rule of availability
>     of "institutional repository" . Rather, I'm almost
>     amazed that British and American institutions seem
>     to support this.
>
>     Thus I suggest to not focus exclusively on
>     official institutional repositories, but to
>     explore alternatives: distributed filestores like
>     Google's BigTable, Bittorrent or others might be
>     just as suitable - check out
>     http://en.wikipedia.org/wiki/Distributed_data_store.
>     I guess that any crystallographic lab could easily
>     sacrifice/donate a TB of storage for the purposes
>     of this project in 2011 (and maybe 2 TB in 2012, 3
>     in 2013, ...), but clearly the level of work to
>     set this up should be kept as low as possible (a
>     bittorrent daemon seems simple enough).
>
>     Just my 2 cents,
>
>     Kay
>
>   P please don't print this e-mail unless you really
>   need to
>   Anastassis (Tassos) Perrakis, Principal Investigator
>   / Staff Member
>   Department of Biochemistry (B8)
>   Netherlands Cancer Institute,
>   Dept. B8, 1066 CX Amsterdam, The Netherlands
>   Tel: +31 20 512 1951 Fax: +31 20 512 1954 Mobile /
>   SMS: +31 6 28 597791

Reply via email to