> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Gregg Wonderly
> 
> But this is precisely the kind of "observation" that some people seem to
miss
> out on the importance of.  As Tomas suggested in his post, if this was
true,
> then we could have a huge compression ratio as well.  And even if there
was
> 10% of the bit patterns that created non-unique hashes, you could use the
> fact that a block hashed to a known bit pattern that didn't have
collisions, to
> compress the other 90% of your data.

In fact, if you were to assume hash value X corresponded to data block Y,
you could use the hash function as a form of compression, but for
decompression you would need a lookup table.  So if you were going to
compress Y once, you would not gain anything.  But if you had a data stream
with a bunch of duplicates in it, you might hash Y many times, discovering X
is already in your lookup table, you only need to store another copy of X.

So in fact, dedup is a form of hash-based compression.  But we know it's not
a lossless compression, so the decision to verify or not to verify is a
matter of probability of collision.  Many of us have done the math on this
probability, and many of us are comfortable with neglecting the verify,
because we know the probability of collision is so small.  But the decision
to verify or not verify is very much dependent on your intended purpose (the
type of data you're storing, and what its significance is). 

More importantly, the decision to verify or not verify depends on who you
would need to justify your decision to.  Nobody will ever get in trouble for
enabling verify.  But if you have to convince your CEO that you made the
right choice by disabling verify ... You might just choose to enable verify
for the sake of not explaining your decision.


> I'm serious about this from a number of perspectives.  We worry about the
> time it would take to reverse SHA or RSA hashes to passwords, not even
> thinking that what if someone has been quietly computing all possible
hashes
> for the past 10-20 years into a database some where, with every 5-16
> character password, and now has an instantly searchable hash-to-password
> database.

That is called a rainbow table, and it's commonly used for cracking poorly
implemented password schemes.  Even with massively powerful computers and a
whole lot of storage, you only have enough time and space to compute a
rainbow table for commonly used (easily guessable) passwords.  To prevent
attackers from successfully using rainbow tables, even when users might
choose bad passwords, a security conscious developer will use salting and
stretching.  This increases the randomness and the time to compute the
password hashes, thwarting rainbow tables.


> If no one has computed the hashes for every single 4K and 8K block, then
> fine.  But, if that was done, and we had that data, we'd know for sure
which
> algorithm was going to work the best for the number of bits we are
> considering.

Let's imagine computing a 256-bit hash (32 bytes = 2^5 bytes) for every 1K
(8Kbit) block.  Storage for this table would require 2^5 * 2^8192 bytes =
2^8197 bytes = 3.5 * 10^2467 bytes.  Recall, a petabyte is 10^15 bytes.  So
... Storing the rainbow table for merely 1K possibilities ... I don't know
if it would fit in all the storage on planet Earth.  Maybe.  But having the
rainbow table for *precisely* 1K block size isn't useful unless you happen
to have precisely a 1K block you want to lookup...  If you happen to be
looking up a 1025 byte data hash, or a 4K data hash ... You're out of luck.
This table won't help you.

It's physically impossible to do, even on the supremely simplified problem
of 1K block size and 256-bit hash.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to