On Tue, 8 Jul 2008, Nathan Kroenert wrote: > Even better would be using the ZFS block checksums (assuming we are only > summing the data, not it's position or time :)... > > Then we could have two files that have 90% the same blocks, and still > get some dedup value... ;)
It seems that the hard problem is not if ZFS has the structure to support it (the implementation seems pretty obvious), but rather that ZFS is supposed to be able to scale to extremely large sizes. If you have a petabyte of storage in the pool, then the data structure to keep track of block similarity could grow exceedingly large. The block checksums are designed to be as random as possible so their value does not suggest anything regarding the similarity of the data unless the values are identical. The checksums have enough bits and randomness that binary trees would not scale. Except for the special case of backups or cloned server footprints, it does not seem that data deduplication is going to save the 90% (or more) space that Quantum claims at http://www.quantum.com/Solutions/datadeduplication/Index.aspx. ZFS clones already provide a form of data deduplication. The actual benefit of data deduplication to an enterprise seems negligible unless the backup system directly supports it. In the enterprise the cost of storage has more to do with backing up the data than the amount of storage media consumed. Bob ====================================== Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss