Item :  Support for file-system / volume / san dedup for file devices

Date:   10 Feb 2010

Origin: Darren Mackay (Velitium)

Status:

What:   File devices should provide support for block based deduplication
provided by the underlying file-systems / volume manager / san.

Why:    A number of file-systems / volume managers / sans now provide block
based deduplication. For block level dedup, it is not uncommon for
deduplication ratios to be to be 3x, 4x, or 5x for unstructured data.

Currently it appears (forgive me and advise if this is actually incorrect,
as this is drawn upon a number of forum posts) that that bacula storage
daemon is packing the data-stream back-2-back, which prevents block based
duplication as the data-stream is not aligned to blocks as defined by the
underlying storage device. I have also read several posts that indicate that
bacula may multiplex data streams, which in the case of underlying dedup,
would further prevent dedup from be performed.

Allowing for dedup in the underlying file-system / volume / san would also
alleviate the need for sysadmins to tune baselines between different hosts
which use the same storage daemon file device(s).

Notes:

Based on limited testing, some dedup is able be performed, but the number of
duplicate blocks detected is limited. For instance,  consecutive full backs
from a single client machine (approx 200GB, both o/s and unstructured file
data) for only a single concurrent job should have resulted in a significant
portion of the backup to be detected as duplicate blocks by the underlying
storage (OpenSolaris ZFS in this case), however, the actual ration of dedup
detected for the 2nd full backup was approx 70k blocks (~ 8.5GB). Subsequent
runs of the full backup yielded similar results. Allowing for metadata, I
would have expected at least 80% of the full backup to dedup.

Several levels of dedup support, which could be implemented in a staged
approached.

Phase 1 - File device dedup support
- This would allow for dedup between file devices on the same system)
- Add padding at the end of each file to a user configurable block size.

   DedupBlockSize = 8k (configurable, in bytes)

- If the configuration options is missing, then disable all support for
underlying dedup for file devices.

Phase 2 - Autodetection of dedup supported file-systems
- When dedup is provided by the host o/s of the file system device, the
storage daemon should detect if dedup is enabled for the file device
location. For Solaris / Opensolaris ZFS, this value is available through the
filesystem extended properties. In this case, if dedup is enabled for the
ZFS filesystem, the storage daemon should read the filesystem block size as
use this value. (note - ZFS also uses variable block sizes, and thus will
only allocate the require size if the requirement is less than the actual
block size)

Phase 3 - Alignment of the datastream to underlying file-system blocks and
separate of bacula metadata to separate blocks
- This would allow for underlying storage system deduplication between both
bacula file devices and real data stored elsewhere on the file-system /
volume / san.
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Bacula-users mailing list
Bacula-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/bacula-users

Reply via email to