Re: [ccp4bb] To archive or not to archive, that's the question!

Martin Kollmar Mon, 31 Oct 2011 07:46:06 -0700

Still, after hundreds (?) of emails to this topic, I haven't seen anyconvincing argument in favor of archiving data. The only convincingarguments are against, and are from Gerard K and Tassos.


Why?

The question is not what to archive, but still why should we archive allthe data.

Because software developers need more data? Should we raise all theseefforts and costs because 10 developers worldwide need the data to ALLprotein structures? Do they really need so much data, wouldn't it beenough to build a repository of maybe 1000 datasets for developments?

Does really someone believe that our view on the actual problem, thefunction of the proteins, changes with the analysis of whatsoeverscattering is still in the images but not used by today's software?Crystal structures are static, snapshots, and obtained under artificialconditions. In solution (still the physiologic state) they might lookdifferent, not much, but at least far more dynamic. Does it thereforematter whether we know some sidechain positions better (in the crystalstructure) when re-analysing the data? In turn, are our current softwareprograms such bad that we would expect strong difference whenre-analysing the data? No. And if the structures change upon reanalysis(more or less) who does re-interpret the structures, re-writes the papers?

There are many many cases where researchers re-did structures (or didclosely related structures to already available structures like mutants,structures of closely related species, etc.), also after 10 years. Iguess they used the latest software in the different cases, thus theyincorporated all the software development of the 10 years. And are thestructures really different (beyond the introduced changes, mutations,etc.)? Different because of the software used?

The comparison with next-generation sequencing data is useful here, butonly in the sense Tassos explained. Well, of course not every positionin the genomic sequence is fixed. Therefore it is sometimes useful tolook at the original data (the traces, as Gerard B pointed out). But wealready know, that every single organism is different (especiallyeukaryotes) and therefore it is absolutely enough to store thecomputationally reduced and merged data. If one needs better,position-specific data, sequencing and comparing single species becomesnecessary, like in the ENCODE project, the sequencing of about 100Saccharomyces strains, the sequencing of 1000 Arabidopsis strains, etc.Discussion about single positions are useless if they are notstatistically relevant. They need to be analysed in the context ofpopulations, large cohorts of patients, etc. If we need personalizedmedicine adapted to personal genomes, we would also need personal setsof protein structures which we cannot provide yet. Therefore, storingthe DNA in the freezer is better and cheaper than storing all thesequencing raw data. Do you think a reviewer re-sequences, orre-assembles, or re-annotates a genome, even if access to the raw readswould be available? If you trust these data why don't we trust ourstructure factors? Do you trust electron microscopy images, movies ofGFP-tagged proteins? Do you think what is presented for a single or afew visible cells is also found in all cells?

And now, who many of you (if not everybody) uses structures from yeast,Drosophila, mouse etc. as MODEL for human proteins? If we stick to thisthinking, who would care about potential minor changes in the structuresupon re-analysis (and in the light of this discussion, arguing aboutspecific genomic sequence positions becomes unimportant as well)?

Is any of the archived data useful without manual evaluation uponarchiving? This is especially relevant for structures not solved yet. Dothe images belong to the structure factors, if only images areavailable, where is the corresponding protein sequence, has it beensequenced, what has been in the buffer/crystallization condition, whathas been used during protein purification, what was the intention forcrystallization - e.g. a certain functional state, that the protein wasforced to by artificial conditions, etc. etc. Who want's to evaluatethat, and how? The question is not that we could do it. We could do it,but wouldn't it advance science far more if we would spend the time andmoney in new projects rather than evaluation, administration, etc?

Be honest: How many of you have really, and completely, reanalysed yourown data, that you have deposited 10 years ago, with the latestsoftware? What changes did you find? Did you have to re-write yourformer discussions in the publications? Do you think that the changesjustify the efforts and costs of worldwide archiving of all data?

Well, for all cases there are always (and have been mentioned in earlieremails) single cases where these things matter or mattered. But doesthis really justify all the future efforts and costs to archive theexponentially (!) increasing amount of data? Do we need all this effortfor better statistics tables? Do you believe the standard lab biologistwill look into all the images at all? Is the effort just for uscrystallographers? As long as just a few dozen users would re-analysethe data it is not worth it.

I like question marks, and maybe someone can give me an argument forarchiving images. At the moment I would vote for not archiving.


With best regards,

Martin

P.S. For the next-gen sequencing data, they have found a new way oftransferring the data, called VAN (the newbies might google for it) inanalogy to the old-fashioned and slow LAN and WLAN. Maybe we will alsoadopt to this when archiving our data?


--
Priv. Doz. Dr. Martin Kollmar

Max-Planck-Institute for Biophysical Chemistry
Group Systems Biology of Motor Proteins
Department NMR-based Structural Biology
Am Fassberg 11
37077 Goettingen
Deutschland

Tel.: +49 551 2012260 / 2235
Fax.: +49 551 2012202

www.motorprotein.de (Homepage)
www.cymobase.org (Database of Cytoskeletal and Motor Proteins)
www.diark.org (diArk - a resource for eukaryotic genome research)
www.webscipio.org (Scipio - eukaryotic gene identification)

Re: [ccp4bb] To archive or not to archive, that's the question!

Reply via email to