Still, after hundreds (?) of emails to this topic, I haven't seen any
convincing argument in favor of archiving data. The only convincing
arguments are against, and are from Gerard K and Tassos.
Why?
The question is not what to archive, but still why should we archive all
the data.
Because software developers need more data? Should we raise all these
efforts and costs because 10 developers worldwide need the data to ALL
protein structures? Do they really need so much data, wouldn't it be
enough to build a repository of maybe 1000 datasets for developments?
Does really someone believe that our view on the actual problem, the
function of the proteins, changes with the analysis of whatsoever
scattering is still in the images but not used by today's software?
Crystal structures are static, snapshots, and obtained under artificial
conditions. In solution (still the physiologic state) they might look
different, not much, but at least far more dynamic. Does it therefore
matter whether we know some sidechain positions better (in the crystal
structure) when re-analysing the data? In turn, are our current software
programs such bad that we would expect strong difference when
re-analysing the data? No. And if the structures change upon reanalysis
(more or less) who does re-interpret the structures, re-writes the papers?
There are many many cases where researchers re-did structures (or did
closely related structures to already available structures like mutants,
structures of closely related species, etc.), also after 10 years. I
guess they used the latest software in the different cases, thus they
incorporated all the software development of the 10 years. And are the
structures really different (beyond the introduced changes, mutations,
etc.)? Different because of the software used?
The comparison with next-generation sequencing data is useful here, but
only in the sense Tassos explained. Well, of course not every position
in the genomic sequence is fixed. Therefore it is sometimes useful to
look at the original data (the traces, as Gerard B pointed out). But we
already know, that every single organism is different (especially
eukaryotes) and therefore it is absolutely enough to store the
computationally reduced and merged data. If one needs better,
position-specific data, sequencing and comparing single species becomes
necessary, like in the ENCODE project, the sequencing of about 100
Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc.
Discussion about single positions are useless if they are not
statistically relevant. They need to be analysed in the context of
populations, large cohorts of patients, etc. If we need personalized
medicine adapted to personal genomes, we would also need personal sets
of protein structures which we cannot provide yet. Therefore, storing
the DNA in the freezer is better and cheaper than storing all the
sequencing raw data. Do you think a reviewer re-sequences, or
re-assembles, or re-annotates a genome, even if access to the raw reads
would be available? If you trust these data why don't we trust our
structure factors? Do you trust electron microscopy images, movies of
GFP-tagged proteins? Do you think what is presented for a single or a
few visible cells is also found in all cells?
And now, who many of you (if not everybody) uses structures from yeast,
Drosophila, mouse etc. as MODEL for human proteins? If we stick to this
thinking, who would care about potential minor changes in the structures
upon re-analysis (and in the light of this discussion, arguing about
specific genomic sequence positions becomes unimportant as well)?
Is any of the archived data useful without manual evaluation upon
archiving? This is especially relevant for structures not solved yet. Do
the images belong to the structure factors, if only images are
available, where is the corresponding protein sequence, has it been
sequenced, what has been in the buffer/crystallization condition, what
has been used during protein purification, what was the intention for
crystallization - e.g. a certain functional state, that the protein was
forced to by artificial conditions, etc. etc. Who want's to evaluate
that, and how? The question is not that we could do it. We could do it,
but wouldn't it advance science far more if we would spend the time and
money in new projects rather than evaluation, administration, etc?
Be honest: How many of you have really, and completely, reanalysed your
own data, that you have deposited 10 years ago, with the latest
software? What changes did you find? Did you have to re-write your
former discussions in the publications? Do you think that the changes
justify the efforts and costs of worldwide archiving of all data?
Well, for all cases there are always (and have been mentioned in earlier
emails) single cases where these things matter or mattered. But does
this really justify all the future efforts and costs to archive the
exponentially (!) increasing amount of data? Do we need all this effort
for better statistics tables? Do you believe the standard lab biologist
will look into all the images at all? Is the effort just for us
crystallographers? As long as just a few dozen users would re-analyse
the data it is not worth it.
I like question marks, and maybe someone can give me an argument for
archiving images. At the moment I would vote for not archiving.
With best regards,
Martin
P.S. For the next-gen sequencing data, they have found a new way of
transferring the data, called VAN (the newbies might google for it) in
analogy to the old-fashioned and slow LAN and WLAN. Maybe we will also
adopt to this when archiving our data?
--
Priv. Doz. Dr. Martin Kollmar
Max-Planck-Institute for Biophysical Chemistry
Group Systems Biology of Motor Proteins
Department NMR-based Structural Biology
Am Fassberg 11
37077 Goettingen
Deutschland
Tel.: +49 551 2012260 / 2235
Fax.: +49 551 2012202
www.motorprotein.de (Homepage)
www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
www.diark.org (diArk - a resource for eukaryotic genome research)
www.webscipio.org (Scipio - eukaryotic gene identification)