Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000: > I am not sure whether the "representation" whose SHA-1 sum is stored is > ever an exact copy of the user's file. If it is - if it does not > include an extra header and is not stored in a delta format - then the
That is not the case: [[[ A representation begins with a line containing either "PLAIN\n" or "DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>, and <length> give the location of the delta base of the representation and the amount of data it contains (not counting the header or trailer). If no base location is given for a delta, the base is the empty stream. After the initial line comes raw svndiff data, followed by a cosmetic trailer "ENDREP\n". ]]] So, there are header, trailer, and it's possibly deltified or self-deltified. > chance of collision would depend directly on the content of the user's > files. If that is the case, it *might* be advisable to disable the > rep-cache feature if you are storing files that have a higher than usual > chance of SHA-1 collisions - data files for SHA-1 research, for example. > > We should find out the answer to that question before going further. > > > > Indeed, the number of hash collisions is only finite for a given file > > size, but is still increasing dramatically with the file size. > > So additional checking of the file size helps but is not a completely > > satisfying solution. > > > > The number of undetected hash collisions could be reduced easily by also > > checking the md5-checksum, the size and the expanded-size. > Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert that the size and expanded size match. It's indeed possible to also use md5 there. Another option is to use practically any statistic about the fulltext: the first N bytes, the number of '#' characters, ... > True. This approach could be beneficial if there are cases where the > perfect solution (below) is not feasible. > > > To make this feature totally reliable, a complete comparison of the files > > content with the content of the old representation found, is necessary > > Yes, it would be good if Subversion could do this extra check. Would > you be interested in helping to improve Subversion by writing code to do > this? If so, you will be very welcome and we will try to help you. > +1 from me too. > (I recall reading about an option in Git (?) to switch on full-text > comparisons to check for SHA-1 collisions. I can't find a reference to > it now.) > > > Regards, > - Julian > > >