Re: [zfs-discuss] ZFS deduplication

Mike Gerdts Mon, 07 Jul 2008 18:28:32 -0700

On Mon, Jul 7, 2008 at 7:40 PM, Bob Friesenhahn
<[EMAIL PROTECTED]> wrote:
> The actual benefit of data deduplication to an enterprise seems
> negligible unless the backup system directly supports it. ?In the
> enterprise the cost of storage has more to do with backing up the data
> than the amount of storage media consumed.


Real data...

I did a survey of about 120 (mostly sparse) zone roots installed over
an 18 month period and used for normal enterprise activity.  Each zone
root is installed into its own SVM soft partition with a strong effort
to isolate application data elsewhere.  Each zone's /var (including
/var/tmp) was included in the survey. My mechanism involved
calculating the md5 checksum of every 4 KB block from the SVM raw
device.  This size was chosen because it is the fixed block size of
the player in the market that does deduplication of live data today.

My results were that I found that I had 75% duplicate data - with no
special effort to minimize duplicate data.  If other techniques were
applied to minimize duplicate data (e.g. periodic write of zeros over
free space, extend file system to do the same for freed blocks, mount
with noatime, etc.) or full root zones (or LDoms) were the subject of
the test I would expect a higher level of duplication.

Supposition...

As I have considered deduplication for application data I see several
things happen in various areas.

- Multiple business application areas use very similar software.

  When looking at various applications that directly (conscious
  choice) or indirectly (embedded in some other product) use various
  web servers, application servers, databases, etc. each application
  administrator uses the same installation media to perform an
  installation into a private (but commonly NFS mounted) area.
  Many/most of these applications do a full installation of java which
  is a statistically significant size of the installation.

- Maintenance activity creates duplicate data.

  When patching, upgrading, or otherwise performing maintenance, it is
  common to make a full copy or a fresh installation of the software.
  This allows most of the maintenance activity to be performed when
  the workload is live as well as rapid fallback by making small
  configuration changes.  The vast majority of the data in these
  multiple versions are identical (e.g. small percentage of jars
  updated, maybe a bit of the included documentation, etc.)

- Application distribution tools create duplicate data

  Some application-level clustering technologies cause a significant
  amount of data to be sent from the administrative server to the
  various cluster members.  By application server design, this is
  duplicate data.  If that data all resides on the same highly
  redundant storage frame, it could be reduced back down to one (or
  fewer copies).

- Multiple development and release trees are duplicate

  When various developers check out code from a source code repository
  or a single developer has multiple copies to work on different
  releases, the checked out code is nearly 100% duplicate and objects
  that are created during builds may be highly duplicate.

- Relying on storage-based snapshots and clones is impractical

  There tend to be organizational walls between those that manage
  storage and those that consume it.  As storage is distributed across
  a network (NFS, iSCSI, FC) things like delegated datasets and RBAC
  are of limited practical use.  Due to these factors and likely
  others, storage snapshots and clones are only used for the few cases
  where there is a huge financial incentive with minimal
  administrative effort.  Deduplication could be deployed on the back
  end to do what clones can't do due to non-technical reasons.

- Clones diverge permanently but shouldn't

  If I have a 3 GB OS image (inside an 8 GB block device) that I am
  patching, there is a reasonable chance that I unzip 500 MB of
  patches to the system, apply a the patches, then remove them.  If
  deduplication is done at the block device level (e.g. iSCSI LUNs
  shared from a storage server) the space "uncloned" by extracting the
  patches remains per-server used space.  Additionally the other space
  used by the installed patches remains used.  Deduplication can
  reclaim the majority of the space.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS deduplication

Reply via email to