Re: dangerous implementation of rep-sharing cache for fsfs

Daniel Shahaf Thu, 24 Jun 2010 07:38:12 -0700

Julian Foad wrote on Thu, 24 Jun 2010 at 17:21 -0000:
> I am not sure whether the "representation" whose SHA-1 sum is stored is
> ever an exact copy of the user's file.  If it is - if it does not
> include an extra header and is not stored in a delta format - then the


That is not the case:

    [[[
    A representation begins with a line containing either "PLAIN\n" or
    "DELTA\n" or "DELTA <rev> <offset> <length>\n", where <rev>, <offset>,
    and <length> give the location of the delta base of the representation
    and the amount of data it contains (not counting the header or
    trailer).  If no base location is given for a delta, the base is the
    empty stream.  After the initial line comes raw svndiff data, followed
    by a cosmetic trailer "ENDREP\n".
    ]]]

So, there are header, trailer, and it's possibly deltified or self-deltified.

> chance of collision would depend directly on the content of the user's
> files.  If that is the case, it *might* be advisable to disable the
> rep-cache feature if you are storing files that have a higher than usual
> chance of SHA-1 collisions - data files for SHA-1 research, for example.
> 
> We should find out the answer to that question before going further.
> 
> 
> > Indeed, the number of hash collisions is only finite for a given file 
> > size, but is still increasing dramatically with the file size.
> > So additional checking of the file size helps but is not a completely 
> > satisfying solution.
> > 
> > The number of undetected hash collisions could be reduced easily by also 
> > checking the md5-checksum, the size and the expanded-size.
> 

Check svn_fs_fs__set_rep_reference in rep-cache.c; we already assert
that the size and expanded size match.

It's indeed possible to also use md5 there.  Another option is to use
practically any statistic about the fulltext: the first N bytes, the
number of '#' characters, ...

> True.  This approach could be beneficial if there are cases where the
> perfect solution (below) is not feasible.
> 
> > To make this feature totally reliable, a complete comparison of the files 
> > content with the content of the old representation found, is necessary
> 
> Yes, it would be good if Subversion could do this extra check.  Would
> you be interested in helping to improve Subversion by writing code to do
> this?  If so, you will be very welcome and we will try to help you.
> 

+1 from me too.

> (I recall reading about an option in Git (?) to switch on full-text
> comparisons to check for SHA-1 collisions.  I can't find a reference to
> it now.)
> 
> 
> Regards,
> - Julian
> 
> 
>

Re: dangerous implementation of rep-sharing cache for fsfs

Reply via email to