Re: Antwort: Re: Re: dangerous implementation of rep-sharing cache for fsfs

Mark Mielke Fri, 25 Jun 2010 15:14:07 -0700

On 06/25/2010 03:34 PM, Daniel Shahaf wrote:

[1] apparently, no SHA-1 collisions have been found to date.  (see
#svn-dev log today)

We know SHA-1 collisions must exist, however - they are also likely totake unlikely form. The algorithms were specifically chosen so thatsmall changes in bits would result in major changes to the resultingdigest. A collision is unlikely to come from a single characterdifference. It's far more likely to come from a completely different bitset, likely a bit set that isn't even used in practical real worldapplications.

File data tends to take a higher structured form - whether it be C codeor a Microsoft Office document. Huge portions of the sample set willNEVER be used, because they will not be higher structured documents ofvalue to anybody. Take C code - it is likely to be a restricted set of7-bit data with characters weighted towards the alphanumerics andcertain symbols. If you take all the C code in the world - it will notrepresent a huge fraction of the sample set. If you take all the C codein a particular repository - it will be a tiny sample set. Images have asimilar pattern. One could say that image data is random - but it's not.Only certain images, which contain data, are worth saving. That datameans that a subset of the bit patterns are even being consideredvaluable and worth storing.

Pick a repository with 1,000,000 commits with 1000 new file versions ineach commit.

This is 1 billion samples. 1 billion samples / (2^160) is still anincredibly small number - 6.8 x 10^-40.

What real life repositories come close to this size? We work with somevery large repositories in ClearCase, and they don't come close to this...

It only takes one, you say? How are hard disks, memory, and otherfactors considered acceptable then? All of these have documented chancesof failure. There is nothing that guarantees that if you write a certainblock to disk, that when you read it back, it will either fail or returnthe original data. Some small percentage of the time, it will return newdata. With 2 Tbyte disks, the chances are becoming significantly higherto the point where one can almost statistically guarantee a single biterror over the entire surface of the disk that will go undetected andun-error corrected. Again, though - most people don't use the entiredisk, and much of the data stored won't even be noticed if a single biterror is introduced.

Personally, I don't want a performance hit introduced due to paranoia.If a patch is introduced, I'd like it to be optional, so people canchoose whether to take the verification hit or not. I remain unconvincedthat rep-sharing is the greatest chance of detectable or undetectablefsfs corruption problems. I think it is firmly in the realm of theory,and that other products such as GIT have all but proven this.


Cheers,
mark

--
Mark Mielke<m...@mielke.cc>

Re: Antwort: Re: Re: dangerous implementation of rep-sharing cache for fsfs

Reply via email to