Hello, sorry, but out E-mailing system doesn't support the usual way of citating the message replied to.
First, we are using svn in chem. laboratory to save, archive and version data and methods of our measurements. We must ensure that the data in the repository is, without any concerns, the data we have once measured or written. So, only the totally reliable Solution for the rep-sharing cache would be acceptable to us. Yes, and i am interested in helping you to improve Subversion by writing needed code. But i am not sure that i will be able to compile subversion completely here at work, i could try. Perhaps someone is willing to help me testing my code? Thanks for the hint to svn_fs_fs__set_rep_reference, because it didn't expected the additional checks to be there. I locked there, but couldn't get at a first glance, when this check is performed. I will go deeper later. I think it's better to add an check on md5 than any on part of fulltext, because it's calculated on the hole data, too. But is isn't imported to me, it only reduces the risk it does not eliminate it. Greetings P.S. I am also sorry for the signature, we are recommended to use. Michael Felke Telefon +49 2151 38-1453 Telefax +49 2151 38-1094 michael.fe...@evonik.com Evonik Stockhausen GmbH Bäkerpfad 25 47805 Krefeld http://www.evonik.com Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen Sitz der Gesellschaft: Krefeld Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791 This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender by telephone or return e-mail and delete the original transmission and its attachments without reading or saving in any manner. Thank you. Daniel Shahaf <d...@daniel.shahaf.name> 24.06.2010 16:38 An: Julian Foad <julian.f...@wandisco.com> Kopie: michael.fe...@evonik.com, dev@subversion.apache.org Thema: Re: dangerous implementation of rep-sharing cache for fsfs Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000: > I am not sure whether the "representation" whose SHA-1 sum is stored is > ever an exact copy of the user's file. If it is - if it does not > include an extra header and is not stored in a delta format - then the That is not the case: [[[ A representation begins with a line containing either "PLAIN\n" or "DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>, and <length> give the location of the delta base of the representation and the amount of data it contains (not counting the header or trailer). If no base location is given for a delta, the base is the empty stream. After the initial line comes raw svndiff data, followed by a cosmetic trailer "ENDREP\n". ]]] So, there are header, trailer, and it's possibly deltified or self-deltified. > chance of collision would depend directly on the content of the user's > files. If that is the case, it *might* be advisable to disable the > rep-cache feature if you are storing files that have a higher than usual > chance of SHA-1 collisions - data files for SHA-1 research, for example. > > We should find out the answer to that question before going further. > > > > Indeed, the number of hash collisions is only finite for a given file > > size, but is still increasing dramatically with the file size. > > So additional checking of the file size helps but is not a completely > > satisfying solution. > > > > The number of undetected hash collisions could be reduced easily by also > > checking the md5-checksum, the size and the expanded-size. > Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert that the size and expanded size match. It's indeed possible to also use md5 there. Another option is to use practically any statistic about the fulltext: the first N bytes, the number of '#' characters, ... > True. This approach could be beneficial if there are cases where the > perfect solution (below) is not feasible. > > > To make this feature totally reliable, a complete comparison of the files > > content with the content of the old representation found, is necessary > > Yes, it would be good if Subversion could do this extra check. Would > you be interested in helping to improve Subversion by writing code to do > this? If so, you will be very welcome and we will try to help you. > +1 from me too. > (I recall reading about an option in Git (?) to switch on full-text > comparisons to check for SHA-1 collisions. I can't find a reference to > it now.) > > > Regards, > - Julian > > >