Re: Re: dangerous implementation of rep-sharing cache for fsfs

Martin Furter Fri, 25 Jun 2010 07:41:40 -0700


On Fri, 25 Jun 2010, Mark Phippard wrote:

On Fri, Jun 25, 2010 at 8:45 AM,  <michael.fe...@evonik.com> wrote:

4. you under estimate the error done by misusing math. methods.

  As I already said in my first e-mail. SHA-1 is developed
  to detected random and willful data manipulation.
  It's a cryptographic hash, so that there is a low chance of
  guessing or calculation a derived data sequence,
  which generates the same hash value as the original data.
  But this is the only thing it ensures.
  There is no evidence that the hash vales are
  equally distributed on the data sets, which is import for
  the us of hashing method in data fetching.
  In fact, as it's a cryptographic hash,
  you should not be able to calculate it,
  because this would mean that you are able
  to calculate sets of data resulting in the same hash value.
  So you can't conclude from the low chance of
  guessing or calculation a derived data sequence to
  a low chance of hash collisions in general.


I am in favor of making our software more reliable, I just do not want
to see us handicap ourselves by programming against a problem that is
unlikely to ever happen.  If this is so risky, then why are so many
people using git?  Isn't it built entirely on this concept of using
sha-1 hashes to identify content?  While I notice if you Google for
this you can find plenty of flame wars over this topic with Git, but I
also notice blog posts like this one:

http://theblogthatnoonereads.davegrijalva.com/2009/09/25/sha-1-collision-probability/

It's not the probability which concerns me, it's what happens when a filecollides. If I understood the current algorithm right the new file will besilently replaced by an unrelated one and there will be no error and nowarning at all. If it's some kind of machine verifyable file like sourcecode the next build in a different working copy will notice. But if it'ssomething else like documents or images it can go unnoticed for a verylong time. The work may be lost by then.

That would be a reason to use CRC32 instead of SHA1 since then users getused to losing files and making sure themselves that the contents of therepos are what they expect ;o>

We are already performance-challenged.  Doing extra hash calculations
for a problem that is not going to happen does not seem like a sound
decision.

No extra hash calculations are needed. What's needed is extra filecomparisions with the already existing files with the same hash. I guessthat's more expensive than calculating a hash since you have to read thefile from disk which may need applying lots of deltas etc.


ZFS does a similar thing which they call deduplication:
http://blogs.sun.com/bonwick/entry/zfs_dedup

The 'verify' feature is optional. With a faster but weaker hashperformance could be regained:

http://valhenson.livejournal.com/48227.html

An optional 'verify' feature would be a nice way to silence paranoidpeople like me and keep the performance the same for those who blindlytrust hash functions.



Martin

Re: Re: dangerous implementation of rep-sharing cache for fsfs

Reply via email to