Hello, O.K., it seems there is really a need to discuss the problem of SHA-1 collisions more deeply.
Mark already stated out most of the necessary points. 1. A collision is unlikely to come from small differences. As I agree with all of you, a SHA-1 is practically impossible on data derived form one another. It?s designed to do so. 2. Usable Data is not randomly chosen form the sample set of all possible data. Every kind of data, e.g. C code, is always a very tiny sample set and SHA-1 is not likely to collide in such a tiny sample set from one kind of data. 3. Data is always highly structured according to the kind it belongs to. That means that a whole set of data of one kind, even it is an infinite space it self, is always a narrow set, each element mostly resembling the others.. But one is missing! 4. The set of one kind of data and that of another kind are overlapping very infrequent, if at all. They could be seen as highly discriminable and separated parts of the sample set of all possible data. So SHA-1 hashes will wildly spread on the first set, doing the best of its job, and also, but independently, spread on the other set as wide as it?s expected to do. What is the result, when two or more sets of hash values, each widely spread of the same value range, are used together in one fetch index? Perhaps, some can see a danger now, too. I? am working on a practical demonstration, which everybody could reproduce with his or her spreadsheet program. But please be patient, I have other things to do, as well. Greetings, P.S. Thanks for the warning; we are not going to use 1.7. At the Moment we are not using 1.6 either, because of the SHA-1 rep-share cache. Michael Felke Telefon +49 2151 38-1453 Telefax +49 2151 38-1094 michael.fe...@evonik.com Evonik Stockhausen GmbH Bäkerpfad 25 47805 Krefeld http://www.evonik.com Geschäftsführung: Gunther Wittmer (Sprecher), Willibrord Lampen Sitz der Gesellschaft: Krefeld Registergericht: Amtsgericht Krefeld; Handelsregister HRB 5791 This e-mail transmission, and any documents, files or previous e-mail messages attached to it may contain information that is confidential or legally privileged. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that you must not read this transmission and that any disclosure, copying, printing, distribution or use of any of the information contained in or attached to this transmission is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender by telephone or return e-mail and delete the original transmission and its attachments without reading or saving in any manner. Thank you. Mark Mielke <m...@mark.mielke.cc> 26.06.2010 00:13 An: Daniel Shahaf <d...@daniel.shahaf.name> Kopie: michael.fe...@evonik.com, m...@rola.ch, "dev@subversion.apache.org" <dev@subversion.apache.org>, ghud...@mit.edu, Mark Phippard <markp...@gmail.com> Thema: Re: Antwort: Re: Re: dangerous implementation of rep-sharing cache for fsfs On 06/25/2010 03:34 PM, Daniel Shahaf wrote: > [1] apparently, no SHA-1 collisions have been found to date. (see > #svn-dev log today) > We know SHA-1 collisions must exist, however - they are also likely to take unlikely form. The algorithms were specifically chosen so that small changes in bits would result in major changes to the resulting digest. A collision is unlikely to come from a single character difference. It's far more likely to come from a completely different bit set, likely a bit set that isn't even used in practical real world applications. File data tends to take a higher structured form - whether it be C code or a Microsoft Office document. Huge portions of the sample set will NEVER be used, because they will not be higher structured documents of value to anybody. Take C code - it is likely to be a restricted set of 7-bit data with characters weighted towards the alphanumerics and certain symbols. If you take all the C code in the world - it will not represent a huge fraction of the sample set. If you take all the C code in a particular repository - it will be a tiny sample set. Images have a similar pattern. One could say that image data is random - but it's not. Only certain images, which contain data, are worth saving. That data means that a subset of the bit patterns are even being considered valuable and worth storing. Pick a repository with 1,000,000 commits with 1000 new file versions in each commit. This is 1 billion samples. 1 billion samples / (2^160) is still an incredibly small number - 6.8 x 10^-40. What real life repositories come close to this size? We work with some very large repositories in ClearCase, and they don't come close to this... It only takes one, you say? How are hard disks, memory, and other factors considered acceptable then? All of these have documented chances of failure. There is nothing that guarantees that if you write a certain block to disk, that when you read it back, it will either fail or return the original data. Some small percentage of the time, it will return new data. With 2 Tbyte disks, the chances are becoming significantly higher to the point where one can almost statistically guarantee a single bit error over the entire surface of the disk that will go undetected and un-error corrected. Again, though - most people don't use the entire disk, and much of the data stored won't even be noticed if a single bit error is introduced. Personally, I don't want a performance hit introduced due to paranoia. If a patch is introduced, I'd like it to be optional, so people can choose whether to take the verification hit or not. I remain unconvinced that rep-sharing is the greatest chance of detectable or undetectable fsfs corruption problems. I think it is firmly in the realm of theory, and that other products such as GIT have all but proven this. Cheers, mark -- Mark Mielke<m...@mielke.cc>