Good points. I see the archival process as a good candidate for adding dedup because it is essentially doing what a stage/release archiving system already does - "faking" the existence of data via metadata. Those blocks aren't actually there, but they're still "accessible" because they're *somewhere* the system knows about (i.e. the "other twin").
Currently in SAMFS, if I store two identical files on the archiving filesystem and my policy generates 4 copies, I will have created 8 copies of the file (albeit with different metadata). Dedup would help immensely here. And as archiving (data management) is inherently a "costly" operation, it's used where potentially slower access to data is acceptable. Another system that comes to mind that utilizes dedup is Xythos WebFS. As Bob points out, keeping track of dupes is a chore. IIRC, WebFS uses a relational database to track this (among much of its other metadata). Charles On 7/7/08 7:40 PM, "Bob Friesenhahn" <[EMAIL PROTECTED]> wrote: > On Tue, 8 Jul 2008, Nathan Kroenert wrote: > >> Even better would be using the ZFS block checksums (assuming we are only >> summing the data, not it's position or time :)... >> >> Then we could have two files that have 90% the same blocks, and still >> get some dedup value... ;) > > It seems that the hard problem is not if ZFS has the structure to > support it (the implementation seems pretty obvious), but rather that > ZFS is supposed to be able to scale to extremely large sizes. If you > have a petabyte of storage in the pool, then the data structure to > keep track of block similarity could grow exceedingly large. The > block checksums are designed to be as random as possible so their > value does not suggest anything regarding the similarity of the data > unless the values are identical. The checksums have enough bits and > randomness that binary trees would not scale. > > Except for the special case of backups or cloned server footprints, > it does not seem that data deduplication is going to save the 90% (or > more) space that Quantum claims at > http://www.quantum.com/Solutions/datadeduplication/Index.aspx. > > ZFS clones already provide a form of data deduplication. > > The actual benefit of data deduplication to an enterprise seems > negligible unless the backup system directly supports it. In the > enterprise the cost of storage has more to do with backing up the data > than the amount of storage media consumed. > > Bob _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss