Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs

michael . felke Wed, 30 Jun 2010 02:59:41 -0700

Hello,

O.K., it seems there is really a need to discuss the problem of 
SHA-1 collisions more deeply.

Mark already stated out most of the necessary points.

1. A collision is unlikely to come from small differences. 
   As I agree with all of you, a SHA-1 is practically impossible on 
   data derived form one another. It?s designed to do so.

2. Usable Data is not randomly chosen form the sample set of all possible
   data. Every kind of data, e.g. C code, is always a very tiny sample 
   set and SHA-1 is not likely to collide in such a tiny sample set from 
   one kind of data.

3. Data is always highly structured according to the kind it belongs to.
   That means that a whole set of data of one kind, even it is an 
   infinite space it self, is always a narrow set, each element mostly
   resembling the others..

But one is missing!

4. The set of one kind of data and that of another kind are overlapping 
   very infrequent, if at all. They could be seen as highly discriminable 
   and separated parts of the sample set of all possible data. 
   So SHA-1 hashes will wildly spread on the first set, doing the best 
   of its job, and also, but independently, spread on the other set as 
   wide as it?s expected to do.

What is the result, when two or more sets of hash values, each widely 
spread of the same value range, are used together in one fetch index?

Perhaps, some can see a danger now, too.

I? am working on a practical demonstration, which everybody could 
reproduce with his or her spreadsheet program.
But please be patient, I have other things to do, as well.

Greetings,

P.S. Thanks for the warning; we are not going to use 1.7. 
     At the Moment we are not using 1.6 either, 
     because of the SHA-1 rep-share cache.

Michael Felke
Telefon +49 2151 38-1453
Telefax +49 2151 38-1094
michael.fe...@evonik.com
Evonik Stockhausen GmbH
Bäkerpfad 25
47805 Krefeld
http://www.evonik.com

Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen

Sitz der Gesellschaft: Krefeld
Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791

This e-mail transmission, and any documents, files or previous e-mail 
messages attached to it may contain information that is confidential or 
legally privileged. If you are not the intended recipient, or a person 
responsible for delivering it to the intended recipient, you are hereby 
notified that you must not read this transmission and that any disclosure, 
copying, printing, distribution or use of any of the information contained 
in or attached to this transmission is STRICTLY PROHIBITED. If you have 
received this transmission in error, please immediately notify the sender 
by telephone or return e-mail and delete the original transmission and its 
attachments without reading or saving in any manner. Thank you. 

Mark Mielke <m...@mark.mielke.cc>
26.06.2010 00:13

        An:     Daniel Shahaf <d...@daniel.shahaf.name>
        Kopie:  michael.fe...@evonik.com, m...@rola.ch, 
"dev@subversion.apache.org" <dev@subversion.apache.org>, ghud...@mit.edu, 
Mark Phippard <markp...@gmail.com>
        Thema:  Re: Antwort: Re: Re: dangerous implementation of 
rep-sharing cache for fsfs

On 06/25/2010 03:34 PM, Daniel Shahaf wrote:
> [1] apparently, no SHA-1 collisions have been found to date.  (see
> #svn-dev log today)
> 

We know SHA-1 collisions must exist, however - they are also likely to 
take unlikely form. The algorithms were specifically chosen so that 
small changes in bits would result in major changes to the resulting 
digest. A collision is unlikely to come from a single character 
difference. It's far more likely to come from a completely different bit 
set, likely a bit set that isn't even used in practical real world 
applications.

File data tends to take a higher structured form - whether it be C code 
or a Microsoft Office document. Huge portions of the sample set will 
NEVER be used, because they will not be higher structured documents of 
value to anybody. Take C code - it is likely to be a restricted set of 
7-bit data with characters weighted towards the alphanumerics and 
certain symbols. If you take all the C code in the world - it will not 
represent a huge fraction of the sample set. If you take all the C code 
in a particular repository - it will be a tiny sample set. Images have a 
similar pattern. One could say that image data is random - but it's not. 
Only certain images, which contain data, are worth saving. That data 
means that a subset of the bit patterns are even being considered 
valuable and worth storing.

Pick a repository with 1,000,000 commits with 1000 new file versions in 
each commit.

This is 1 billion samples. 1 billion samples / (2^160) is still an 
incredibly small number - 6.8 x 10^-40.

What real life repositories come close to this size? We work with some 
very large repositories in ClearCase, and they don't come close to this...

It only takes one, you say? How are hard disks, memory, and other 
factors considered acceptable then? All of these have documented chances 
of failure. There is nothing that guarantees that if you write a certain 
block to disk, that when you read it back, it will either fail or return 
the original data. Some small percentage of the time, it will return new 
data. With 2 Tbyte disks, the chances are becoming significantly higher 
to the point where one can almost statistically guarantee a single bit 
error over the entire surface of the disk that will go undetected and 
un-error corrected. Again, though - most people don't use the entire 
disk, and much of the data stored won't even be noticed if a single bit 
error is introduced.

Personally, I don't want a performance hit introduced due to paranoia. 
If a patch is introduced, I'd like it to be optional, so people can 
choose whether to take the verification hit or not. I remain unconvinced 
that rep-sharing is the greatest chance of detectable or undetectable 
fsfs corruption problems. I think it is firmly in the realm of theory, 
and that other products such as GIT have all but proven this.

Cheers,
mark

-- 
Mark Mielke<m...@mielke.cc>

Antwort: Re: ... Re: dangerous implementation of rep-sharing cache for fsfs

Reply via email to