The whole idea of the pristine cache is handling duplicate files… I don’t see what you are trying to solve by adding a salt.
The pristine store is not a password store, where expensive hashing is a good feature… not a user facing slowdown and we never designed Subversion as a storage area for collision files. That there is a collision now doesn’t change that we always assumed there would be collisions, and designed the current behavior with that in mind. Bert Sent from Mail for Windows 10 From: Stefan Sperling Sent: vrijdag 24 februari 2017 21:47 To: Mark Phippard Cc: Øyvind A. Holm; Subversion Development Subject: Re: Files with identical SHA1 breaks the repo On Fri, Feb 24, 2017 at 01:03:09PM -0500, Mark Phippard wrote: > Note that while this does fix the error, but because of the sha1 storage > sharing in the working copy you actually do not get the correct files. > Both PDF's wind up being the same file, I imagine whichever one you receive > first is the one you get. > > So not only does rep sharing need to be fixed, the WC pristine storage is > also broken by this. Yes, indeed. I believe we should prepare a new working format for 1.10.0 which addresses this problem. I don't see a good way of fixing it without a format bump. The bright side of this is that it gives us a good reason to get 1.10.0 ready ASAP. We can switch to a better hash algorithm with a WC format bump. If we are willing to dispose of de-duplication in the pristine store we could make the pristine store future proof by adding a "salt" to each row in the pristine table. Say 64 bytes of data prepended to file content, which are random but stay fixed throughout the lifetime of a pristine. This way, there are 64 bytes of data not controlled by repository content which affect the hash algorithm's result before data from repository content gets mixed in. Now hash collisions in repository content become much less of a problem for the working copy. However, the pristine store would stop de-duplicating content. So perhaps this is not the best approach. The rep-cache uses hashes only for de-duplication so it very much relies on hash collisions being negligible. We should upgrade the hashing algorithm in a way that 'svnadmin upgrade' can take care of (for new revisions). Perhaps we should disable the feature by default in a 1.9.x patch release and advise users to turn it off until they can upgrade to 1.10. We might have to give up on ra_serf's approach of avoiding retransmissions of content which is already stored in the pristine store. This is now just as broken as the rep-cache is. We might be able to salvage it for future clients, but we should probably send multiple hashes and make it as easy as possible to add newer hash algorithms in future versions without disturbing older clients. Perhaps as a first step we should just disable this feature?