> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Gregg Wonderly > > But this is precisely the kind of "observation" that some people seem to miss > out on the importance of. As Tomas suggested in his post, if this was true, > then we could have a huge compression ratio as well. And even if there was > 10% of the bit patterns that created non-unique hashes, you could use the > fact that a block hashed to a known bit pattern that didn't have collisions, to > compress the other 90% of your data.
In fact, if you were to assume hash value X corresponded to data block Y, you could use the hash function as a form of compression, but for decompression you would need a lookup table. So if you were going to compress Y once, you would not gain anything. But if you had a data stream with a bunch of duplicates in it, you might hash Y many times, discovering X is already in your lookup table, you only need to store another copy of X. So in fact, dedup is a form of hash-based compression. But we know it's not a lossless compression, so the decision to verify or not to verify is a matter of probability of collision. Many of us have done the math on this probability, and many of us are comfortable with neglecting the verify, because we know the probability of collision is so small. But the decision to verify or not verify is very much dependent on your intended purpose (the type of data you're storing, and what its significance is). More importantly, the decision to verify or not verify depends on who you would need to justify your decision to. Nobody will ever get in trouble for enabling verify. But if you have to convince your CEO that you made the right choice by disabling verify ... You might just choose to enable verify for the sake of not explaining your decision. > I'm serious about this from a number of perspectives. We worry about the > time it would take to reverse SHA or RSA hashes to passwords, not even > thinking that what if someone has been quietly computing all possible hashes > for the past 10-20 years into a database some where, with every 5-16 > character password, and now has an instantly searchable hash-to-password > database. That is called a rainbow table, and it's commonly used for cracking poorly implemented password schemes. Even with massively powerful computers and a whole lot of storage, you only have enough time and space to compute a rainbow table for commonly used (easily guessable) passwords. To prevent attackers from successfully using rainbow tables, even when users might choose bad passwords, a security conscious developer will use salting and stretching. This increases the randomness and the time to compute the password hashes, thwarting rainbow tables. > If no one has computed the hashes for every single 4K and 8K block, then > fine. But, if that was done, and we had that data, we'd know for sure which > algorithm was going to work the best for the number of bits we are > considering. Let's imagine computing a 256-bit hash (32 bytes = 2^5 bytes) for every 1K (8Kbit) block. Storage for this table would require 2^5 * 2^8192 bytes = 2^8197 bytes = 3.5 * 10^2467 bytes. Recall, a petabyte is 10^15 bytes. So ... Storing the rainbow table for merely 1K possibilities ... I don't know if it would fit in all the storage on planet Earth. Maybe. But having the rainbow table for *precisely* 1K block size isn't useful unless you happen to have precisely a 1K block you want to lookup... If you happen to be looking up a 1025 byte data hash, or a 4K data hash ... You're out of luck. This table won't help you. It's physically impossible to do, even on the supremely simplified problem of 1K block size and 256-bit hash. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss