On Fri, 25 Jun 2010, Mark Phippard wrote:

On Fri, Jun 25, 2010 at 8:45 AM,  <michael.fe...@evonik.com> wrote:
4. you under estimate the error done by misusing math. methods.

  As I already said in my first e-mail. SHA-1 is developed
  to detected random and willful data manipulation.
  It's a cryptographic hash, so that there is a low chance of
  guessing or calculation a derived data sequence,
  which generates the same hash value as the original data.
  But this is the only thing it ensures.
  There is no evidence that the hash vales are
  equally distributed on the data sets, which is import for
  the us of hashing method in data fetching.
  In fact, as it's a cryptographic hash,
  you should not be able to calculate it,
  because this would mean that you are able
  to calculate sets of data resulting in the same hash value.
  So you can't conclude from the low chance of
  guessing or calculation a derived data sequence to
  a low chance of hash collisions in general.

I am in favor of making our software more reliable, I just do not want
to see us handicap ourselves by programming against a problem that is
unlikely to ever happen.  If this is so risky, then why are so many
people using git?  Isn't it built entirely on this concept of using
sha-1 hashes to identify content?  While I notice if you Google for
this you can find plenty of flame wars over this topic with Git, but I
also notice blog posts like this one:

http://theblogthatnoonereads.davegrijalva.com/2009/09/25/sha-1-collision-probability/

It's not the probability which concerns me, it's what happens when a file collides. If I understood the current algorithm right the new file will be silently replaced by an unrelated one and there will be no error and no warning at all. If it's some kind of machine verifyable file like source code the next build in a different working copy will notice. But if it's something else like documents or images it can go unnoticed for a very long time. The work may be lost by then.

That would be a reason to use CRC32 instead of SHA1 since then users get used to losing files and making sure themselves that the contents of the repos are what they expect ;o>

We are already performance-challenged.  Doing extra hash calculations
for a problem that is not going to happen does not seem like a sound
decision.

No extra hash calculations are needed. What's needed is extra file comparisions with the already existing files with the same hash. I guess that's more expensive than calculating a hash since you have to read the file from disk which may need applying lots of deltas etc.

ZFS does a similar thing which they call deduplication:
http://blogs.sun.com/bonwick/entry/zfs_dedup

The 'verify' feature is optional. With a faster but weaker hash performance could be regained:
http://valhenson.livejournal.com/48227.html

An optional 'verify' feature would be a nice way to silence paranoid people like me and keep the performance the same for those who blindly trust hash functions.


Martin

Reply via email to