On Mon, 7 Jul 2008, Mike Gerdts wrote:
>
> As I have considered deduplication for application data I see several
> things happen in various areas.

You have provided an excellent description of gross inefficiencies in 
the way systems and software are deployed today, resulting in massive 
duplication.  Massive duplication is used to ease service deployment 
and management.  Most of this massive duplication is not technically 
necessary.

>  There tend to be organizational walls between those that manage
>  storage and those that consume it.  As storage is distributed across
>  a network (NFS, iSCSI, FC) things like delegated datasets and RBAC
>  are of limited practical use.  Due to these factors and likely

It seems that deduplication on the server does not provide much 
benefit to the client since the client always sees a duplicate.  It 
does not know that it doesn't need to cache or copy a block twice 
because it is a duplicate.  Only the server benefits from the 
deduplication except that maybe server-side caching improves and 
provides the client with a bit more performance.

While deduplication can obviously save server storage space, it does 
not seem to help much for backups, and it does not really help the 
user manage all of that data.  It does help the user in terms of less 
raw storage space but there is surely a substantial run-time cost 
associated with the deduplication mechanism.  None of the existing 
applications (based on POSIX standards) has any understanding of 
deduplication so they won't benefit from it.  If you use tar, cpio, or 
'cp -r', to copy the contents of a directory tree, they will transmit 
just as much data as before and if the destintation does real-time 
deduplication, then the copy will be slower.  If the copy is to 
another server, then the copy time will be huge, just like before.

Unless the backup system fully understands and has access to the 
filesystem deduplication mechanism, it will be grossly inefficient 
just like before.  Recovery from a backup stored in a sequential (e.g. 
tape) format which does understand deduplication would be quite 
interesting indeed.

Raw storage space is cheap.  Managing the data is what is expensive.

Perhaps deduplication is a response to an issue which should be solved 
elsewhere?

Bob
======================================
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to