Still, after hundreds (?) of emails to this topic, I haven't seen any convincing argument in favor of archiving data. The only convincing arguments are against, and are from Gerard K and Tassos.

Why?
The question is not what to archive, but still why should we archive all the data.

Because software developers need more data? Should we raise all these efforts and costs because 10 developers worldwide need the data to ALL protein structures? Do they really need so much data, wouldn't it be enough to build a repository of maybe 1000 datasets for developments?

Does really someone believe that our view on the actual problem, the function of the proteins, changes with the analysis of whatsoever scattering is still in the images but not used by today's software? Crystal structures are static, snapshots, and obtained under artificial conditions. In solution (still the physiologic state) they might look different, not much, but at least far more dynamic. Does it therefore matter whether we know some sidechain positions better (in the crystal structure) when re-analysing the data? In turn, are our current software programs such bad that we would expect strong difference when re-analysing the data? No. And if the structures change upon reanalysis (more or less) who does re-interpret the structures, re-writes the papers?

There are many many cases where researchers re-did structures (or did closely related structures to already available structures like mutants, structures of closely related species, etc.), also after 10 years. I guess they used the latest software in the different cases, thus they incorporated all the software development of the 10 years. And are the structures really different (beyond the introduced changes, mutations, etc.)? Different because of the software used?

The comparison with next-generation sequencing data is useful here, but only in the sense Tassos explained. Well, of course not every position in the genomic sequence is fixed. Therefore it is sometimes useful to look at the original data (the traces, as Gerard B pointed out). But we already know, that every single organism is different (especially eukaryotes) and therefore it is absolutely enough to store the computationally reduced and merged data. If one needs better, position-specific data, sequencing and comparing single species becomes necessary, like in the ENCODE project, the sequencing of about 100 Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc. Discussion about single positions are useless if they are not statistically relevant. They need to be analysed in the context of populations, large cohorts of patients, etc. If we need personalized medicine adapted to personal genomes, we would also need personal sets of protein structures which we cannot provide yet. Therefore, storing the DNA in the freezer is better and cheaper than storing all the sequencing raw data. Do you think a reviewer re-sequences, or re-assembles, or re-annotates a genome, even if access to the raw reads would be available? If you trust these data why don't we trust our structure factors? Do you trust electron microscopy images, movies of GFP-tagged proteins? Do you think what is presented for a single or a few visible cells is also found in all cells?

And now, who many of you (if not everybody) uses structures from yeast, Drosophila, mouse etc. as MODEL for human proteins? If we stick to this thinking, who would care about potential minor changes in the structures upon re-analysis (and in the light of this discussion, arguing about specific genomic sequence positions becomes unimportant as well)?

Is any of the archived data useful without manual evaluation upon archiving? This is especially relevant for structures not solved yet. Do the images belong to the structure factors, if only images are available, where is the corresponding protein sequence, has it been sequenced, what has been in the buffer/crystallization condition, what has been used during protein purification, what was the intention for crystallization - e.g. a certain functional state, that the protein was forced to by artificial conditions, etc. etc. Who want's to evaluate that, and how? The question is not that we could do it. We could do it, but wouldn't it advance science far more if we would spend the time and money in new projects rather than evaluation, administration, etc?

Be honest: How many of you have really, and completely, reanalysed your own data, that you have deposited 10 years ago, with the latest software? What changes did you find? Did you have to re-write your former discussions in the publications? Do you think that the changes justify the efforts and costs of worldwide archiving of all data?

Well, for all cases there are always (and have been mentioned in earlier emails) single cases where these things matter or mattered. But does this really justify all the future efforts and costs to archive the exponentially (!) increasing amount of data? Do we need all this effort for better statistics tables? Do you believe the standard lab biologist will look into all the images at all? Is the effort just for us crystallographers? As long as just a few dozen users would re-analyse the data it is not worth it.

I like question marks, and maybe someone can give me an argument for archiving images. At the moment I would vote for not archiving.

With best regards,

Martin


P.S. For the next-gen sequencing data, they have found a new way of transferring the data, called VAN (the newbies might google for it) in analogy to the old-fashioned and slow LAN and WLAN. Maybe we will also adopt to this when archiving our data?

--
Priv. Doz. Dr. Martin Kollmar

Max-Planck-Institute for Biophysical Chemistry
Group Systems Biology of Motor Proteins
Department NMR-based Structural Biology
Am Fassberg 11
37077 Goettingen
Deutschland

Tel.: +49 551 2012260 / 2235
Fax.: +49 551 2012202

www.motorprotein.de (Homepage)
www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
www.diark.org (diArk - a resource for eukaryotic genome research)
www.webscipio.org (Scipio - eukaryotic gene identification)

Reply via email to