I really like this idea. And we could take a copy of APR's sha1 code, and rejigger it to perform *both* hashes during the same scan of the raw bytes. I would expect the time taken to extend by (say) 1.1X rather than a full 2X. The inner loop might cost a bit more, but we'd only scan the bytes once. Very handy, when you're talking about megabytes in a stream-y environment.
(and medium-term, push this dual-sha1 computation back into APR) On Sun, Feb 26, 2017 at 10:08 AM, Garance A Drosehn <dro...@rpi.edu> wrote: > On 24 Feb 2017, at 15:46, Stefan Sperling wrote: > > > > I believe we should prepare a new working format for 1.10.0 > > which addresses this problem. I don't see a good way of fixing > > it without a format bump. The bright side of this is that it > > gives us a good reason to get 1.10.0 ready ASAP. > > > > We can switch to a better hash algorithm with a WC format > > bump. > > One of the previous messages mentioned that better hash > algorithms are more expensive. So let me mention a tactic > that I used many years ago, when MD5 was the best digest > algorithm that I knew of, and I didn't trust it for the > larger files I was working with at the time: > > Instead of going with a completely different hash algorithm, > just double-down on the one you're using. What I did was to > calculate one digest the standard way, and then a second one > which summed up every-other-byte (or every 3rd byte, or ...). > So to get a collision, not only do two files have to get the > same digest-result for all their data, but they have to also > get the same digest-result when exactly half the data is > skipped over. > > (I did this a long time ago, and forget the details. What > I may have done for performance reasons was every-other-word, > not every-other-byte) > > My thinking was that *any* single algorithm which processes > all the data is going to get collisions, eventually. But it > will be much harder for someone to generate a duplicate file > where there will also be a collision when summing up only > half of the data. > > I'm not claiming this is great cure-all solution, but just > that it's an alternate tactic which might be interesting. > People could create repositories with just the one digest, > or upgrade it to use multiple digests if they have the need. > > I found a few benchmarks which suggest that sha-256 is maybe > twice as expensive as sha-1, so calculating two sha-1 digests > might be a reasonable alternative. > > -- > Garance Alistair Drosehn = dro...@rpi.edu > Senior Systems Programmer or g...@freebsd.org > Rensselaer Polytechnic Institute; Troy, NY; USA >