On 5/4/2011 9:57 AM, Ray Van Dolson wrote:
There are a number of threads (this one[1] for example) that describe
memory requirements for deduplication. They're pretty high.
I'm trying to get a better understanding... on our NetApps we use 4K
block sizes with their post-process deduplication and get pretty good
dedupe ratios for VM content.
Using ZFS we are using 128K record sizes by default, which nets us less
impressive savings... however, to drop to a 4K record size would
theoretically require that we have nearly 40GB of memory for only 1TB
of storage (based on 150 bytes per block for the DDT).
This obviously becomes prohibitively higher for 10+ TB file systems.
I will note that our NetApps are using only 2TB FlexVols, but would
like to better understand ZFS's (apparently) higher memory
requirements... or maybe I'm missing something entirely.
Thanks,
Ray
[1] http://markmail.org/message/wile6kawka6qnjdw
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I'm not familiar with NetApp's implementation, so I can't speak to why
it might appear to use less resources.
However, there are a couple of possible issues here:
(1) Pre-write vs Post-write Deduplication.
ZFS does pre-write dedup, where it looks for duplicates before
it writes anything to disk. In order to do pre-write dedup, you really
have to store the ENTIRE deduplication block lookup table in some sort
of fast (random) access media, realistically Flash or RAM. The win is
that you get significantly lower disk utilization (i.e. better I/O
performance), as (potentially) much less data is actually written to disk.
Post-write Dedup is done via batch processing - that is, such a
design has the system periodically scan the saved data, looking for
duplicates. While this method also greatly benefits from being able to
store the dedup table in fast random storage, it's not anywhere as
critical. The downside here is that you see much higher disk utilization
- the system must first write all new data to disk (without looking for
dedup), and then must also perform significant I/O later on to do the dedup.
(2) Block size: a 4k block size will yield better dedup than a 128k
block size, presuming reasonable data turnover. This is inherent, as
any single bit change in a block will make it non-duplicated. With 32x
the block size, there is a much greater chance that a small change in
data will require a large loss of dedup ratio. That is, 4k blocks
should almost always yield much better dedup ratios than larger ones.
Also, remember that the ZFS block size is a SUGGESTION for zfs
filesystems (i.e. it will use UP TO that block size, but not always that
size), but is FIXED for zvols.
(3) Method of storing (and data stored in) the dedup table.
ZFS's current design is (IMHO) rather piggy on DDT and L2ARC
lookup requirements. Right now, ZFS requires a record in the ARC (RAM)
for each L2ARC (cache) entire, PLUS the actual L2ARC entry. So, it
boils down to 500+ bytes of combined L2ARC & RAM usage per block entry
in the DDT. Also, the actual DDT entry itself is perhaps larger than
absolutely necessary.
I suspect that NetApp does the following to limit their
resource usage: they presume the presence of some sort of cache that
can be dedicated to the DDT (and, since they also control the hardware,
they can make sure there is always one present). Thus, they can make
their code completely avoid the need for an equivalent to the ARC-based
lookup. In addition, I suspect they have a smaller DDT entry itself.
Which boils down to probably needing 50% of the total resource
consumption of ZFS, and NO (or extremely small, and fixed) RAM requirement.
Honestly, ZFS's cache (L2ARC) requirements aren't really a problem. The
big issue is the ARC requirements, which, until they can be seriously
reduced (or, best case, simply eliminated), really is a significant
barrier to adoption of ZFS dedup.
Right now, ZFS treats DDT entries like any other data or metadata in how
it ages from ARC to L2ARC to gone. IMHO, the better way to do this is
simply require the DDT to be entirely stored on the L2ARC (if present),
and not ever keep any DDT info in the ARC at all (that is, the ARC
should contain a pointer to the DDT in the L2ARC, and that's it,
regardless of the amount or frequency of access of the DDT). Frankly,
at this point, I'd almost change the design to REQUIRE a L2ARC device in
order to turn on Dedup.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss