On Sat, Jan 15, 2011 at 10:19:23AM -0600, Bob Friesenhahn wrote: > On Fri, 14 Jan 2011, Peter Taps wrote: > > >Thank you for sharing the calculations. In lay terms, for Sha256, > >how many blocks of data would be needed to have one collision? > > Two.
Pretty funny. In this thread some of you are treating SHA-256 as an idealized hash function. The odds of accidentally finding collisions in an idealized 256-bit hash function are minute because the distribution of hash function outputs over inputs is random (or, rather, pseudo-random). But cryptographic hash functions are generally only approximations of idealized hash functions. There's nothing to say that there aren't pathological corner cases where a given hash function produces lots of collisions that would be semantically meaningful to people -- i.e., a set of inputs over which the outputs are not randomly distributed. Now, of course, we don't know of such pathological corner cases for SHA-256, but not that long ago we didn't know of any for SHA-1 or MD5 either. The question of whether disabling verification would improve performance is pretty simple: if you have highly deduplicatious, _synchronous_ (or nearly so, due to frequent fsync()s or NFS close operations) writes, and the "working set" did not fit in the ARC nor L2ARC, then yes, disabling verification will help significantly, by removing an average of at least half a disk rotation from the write latency. Or if you have the same work load but with asynchronous writes that might as well be synchronous due to an undersized cache (relative to the workload). Otherwise the cost of verification should be hidden by caching. Another way to put this would be that you should first determine that verification is actually affecting performance, and only _then_ should you consider disabling it. But if you want to have the freedom to disable verficiation, then you should be using SHA-256 (or switch to it when disabling verification). Safety features that cost nothing are not worth turning off, so make sure their cost is significant before even thinking of turning them off. Similarly, the cost of SHA-256 vs. Fletcher should also be lost in the noise if the system has enough CPU, but if the choice of hash function could make the system CPU-bound instead of I/O-bound, then the choice of hash function would make an impact on performance. The choice of hash functions will have a different performance impact than verification: a slower hash function will affect non-deduplicatious workloads more than highly deduplicatious workloads (since the latter will require more I/O for verification, which will overwhelm the cost of the hash function). Again, measure first. Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss