On 06/25/2010 03:34 PM, Daniel Shahaf wrote:
[1] apparently, no SHA-1 collisions have been found to date. (see #svn-dev log today)
We know SHA-1 collisions must exist, however - they are also likely to take unlikely form. The algorithms were specifically chosen so that small changes in bits would result in major changes to the resulting digest. A collision is unlikely to come from a single character difference. It's far more likely to come from a completely different bit set, likely a bit set that isn't even used in practical real world applications.
File data tends to take a higher structured form - whether it be C code or a Microsoft Office document. Huge portions of the sample set will NEVER be used, because they will not be higher structured documents of value to anybody. Take C code - it is likely to be a restricted set of 7-bit data with characters weighted towards the alphanumerics and certain symbols. If you take all the C code in the world - it will not represent a huge fraction of the sample set. If you take all the C code in a particular repository - it will be a tiny sample set. Images have a similar pattern. One could say that image data is random - but it's not. Only certain images, which contain data, are worth saving. That data means that a subset of the bit patterns are even being considered valuable and worth storing.
Pick a repository with 1,000,000 commits with 1000 new file versions in each commit.
This is 1 billion samples. 1 billion samples / (2^160) is still an incredibly small number - 6.8 x 10^-40.
What real life repositories come close to this size? We work with some very large repositories in ClearCase, and they don't come close to this...
It only takes one, you say? How are hard disks, memory, and other factors considered acceptable then? All of these have documented chances of failure. There is nothing that guarantees that if you write a certain block to disk, that when you read it back, it will either fail or return the original data. Some small percentage of the time, it will return new data. With 2 Tbyte disks, the chances are becoming significantly higher to the point where one can almost statistically guarantee a single bit error over the entire surface of the disk that will go undetected and un-error corrected. Again, though - most people don't use the entire disk, and much of the data stored won't even be noticed if a single bit error is introduced.
Personally, I don't want a performance hit introduced due to paranoia. If a patch is introduced, I'd like it to be optional, so people can choose whether to take the verification hit or not. I remain unconvinced that rep-sharing is the greatest chance of detectable or undetectable fsfs corruption problems. I think it is firmly in the realm of theory, and that other products such as GIT have all but proven this.
Cheers, mark -- Mark Mielke<m...@mielke.cc>