Good points.  I see the archival process as a good candidate for adding
dedup because it is essentially doing what a stage/release archiving system
already does - "faking" the existence of data via metadata.  Those blocks
aren't actually there, but they're still "accessible" because they're
*somewhere* the system knows about (i.e. the "other twin").

Currently in SAMFS, if I store two identical files on the archiving
filesystem and my policy generates 4 copies, I will have created 8 copies of
the file (albeit with different metadata).  Dedup would help immensely here.
And as archiving (data management) is inherently a "costly" operation, it's
used where potentially slower access to data is acceptable.

Another system that comes to mind that utilizes dedup is Xythos WebFS.  As
Bob points out, keeping track of dupes is a chore.  IIRC, WebFS uses a
relational database to track this (among much of its other metadata).

Charles

On 7/7/08 7:40 PM, "Bob Friesenhahn" <[EMAIL PROTECTED]> wrote:

> On Tue, 8 Jul 2008, Nathan Kroenert wrote:
> 
>> Even better would be using the ZFS block checksums (assuming we are only
>> summing the data, not it's position or time :)...
>> 
>> Then we could have two files that have 90% the same blocks, and still
>> get some dedup value... ;)
> 
> It seems that the hard problem is not if ZFS has the structure to
> support it (the implementation seems pretty obvious), but rather that
> ZFS is supposed to be able to scale to extremely large sizes.  If you
> have a petabyte of storage in the pool, then the data structure to
> keep track of block similarity could grow exceedingly large.  The
> block checksums are designed to be as random as possible so their
> value does not suggest anything regarding the similarity of the data
> unless the values are identical.  The checksums have enough bits and
> randomness that binary trees would not scale.
> 
> Except for the special case of backups or cloned server footprints,
> it does not seem that data deduplication is going to save the 90% (or
> more) space that Quantum claims at
> http://www.quantum.com/Solutions/datadeduplication/Index.aspx.
> 
> ZFS clones already provide a form of data deduplication.
> 
> The actual benefit of data deduplication to an enterprise seems
> negligible unless the backup system directly supports it.  In the
> enterprise the cost of storage has more to do with backing up the data
> than the amount of storage media consumed.
> 
> Bob


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to