On Mon, Jul 7, 2008 at 7:40 PM, Bob Friesenhahn <[EMAIL PROTECTED]> wrote: > The actual benefit of data deduplication to an enterprise seems > negligible unless the backup system directly supports it. ?In the > enterprise the cost of storage has more to do with backing up the data > than the amount of storage media consumed.
Real data... I did a survey of about 120 (mostly sparse) zone roots installed over an 18 month period and used for normal enterprise activity. Each zone root is installed into its own SVM soft partition with a strong effort to isolate application data elsewhere. Each zone's /var (including /var/tmp) was included in the survey. My mechanism involved calculating the md5 checksum of every 4 KB block from the SVM raw device. This size was chosen because it is the fixed block size of the player in the market that does deduplication of live data today. My results were that I found that I had 75% duplicate data - with no special effort to minimize duplicate data. If other techniques were applied to minimize duplicate data (e.g. periodic write of zeros over free space, extend file system to do the same for freed blocks, mount with noatime, etc.) or full root zones (or LDoms) were the subject of the test I would expect a higher level of duplication. Supposition... As I have considered deduplication for application data I see several things happen in various areas. - Multiple business application areas use very similar software. When looking at various applications that directly (conscious choice) or indirectly (embedded in some other product) use various web servers, application servers, databases, etc. each application administrator uses the same installation media to perform an installation into a private (but commonly NFS mounted) area. Many/most of these applications do a full installation of java which is a statistically significant size of the installation. - Maintenance activity creates duplicate data. When patching, upgrading, or otherwise performing maintenance, it is common to make a full copy or a fresh installation of the software. This allows most of the maintenance activity to be performed when the workload is live as well as rapid fallback by making small configuration changes. The vast majority of the data in these multiple versions are identical (e.g. small percentage of jars updated, maybe a bit of the included documentation, etc.) - Application distribution tools create duplicate data Some application-level clustering technologies cause a significant amount of data to be sent from the administrative server to the various cluster members. By application server design, this is duplicate data. If that data all resides on the same highly redundant storage frame, it could be reduced back down to one (or fewer copies). - Multiple development and release trees are duplicate When various developers check out code from a source code repository or a single developer has multiple copies to work on different releases, the checked out code is nearly 100% duplicate and objects that are created during builds may be highly duplicate. - Relying on storage-based snapshots and clones is impractical There tend to be organizational walls between those that manage storage and those that consume it. As storage is distributed across a network (NFS, iSCSI, FC) things like delegated datasets and RBAC are of limited practical use. Due to these factors and likely others, storage snapshots and clones are only used for the few cases where there is a huge financial incentive with minimal administrative effort. Deduplication could be deployed on the back end to do what clones can't do due to non-technical reasons. - Clones diverge permanently but shouldn't If I have a 3 GB OS image (inside an 8 GB block device) that I am patching, there is a reasonable chance that I unzip 500 MB of patches to the system, apply a the patches, then remove them. If deduplication is done at the block device level (e.g. iSCSI LUNs shared from a storage server) the space "uncloned" by extracting the patches remains per-server used space. Additionally the other space used by the installed patches remains used. Deduplication can reclaim the majority of the space. -- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss