On Wed, Feb 27, 2013 at 05:40:53PM +0100, Kevin Wolf wrote: > Am 27.02.2013 um 16:58 hat Benoît Canet geschrieben: > > > > The current prototype of the QCOW2 deduplication uses 32 bytes SHA256 > > > > or SKEIN > > > > hashes to identify each 4KB clusters with a very low probability of > > > > collisions. > > > > > > How do you handle the rare collision cases? Do you read the original > > > cluster and compare the exact contents when the hashes match? > > > > Stefan found a paper with the math required to compute the collision > > probability: http://http://plan9.bell-labs.com/sys/doc/venti/venti.html > > (Section 3.1) > > Doing the math for 1 Exabyte of stored data with 4KB clusters and 256 bits > > hashes gives a probability of 2.57E-49. > > The probability being low enough I plan to code the read/compare as an > > option that the users would toggle. > > The people who wrote the deduplication in ZFS have done it this way. > > Fair enough. If you want to gamble with your data for some more > performance, you can turn it off. Should we add some comptaible taint > flag after the image has been used without collision detection?
If the verification setting is stored in the qcow2 image header then it's essentially a taint flag. Stefan