On 12/12/2011 12:23 PM, Richard Elling wrote:
On Dec 11, 2011, at 2:59 PM, Mertol Ozyoney wrote:

Not exactly. What is dedup'ed is the stream only, which is infect not very
efficient. Real dedup aware replication is taking the necessary steps to
avoid sending a block that exists on the other storage system.
These exist outside of ZFS (eg rsync) and scale poorly.

Given that dedup is done at the pool level and ZFS send/receive is done at
the dataset level, how would you propose implementing a dedup-aware
ZFS send command?
  -- richard

I'm with Richard.

There is no practical "optimally efficient" way to dedup a stream from one system to another. The only way to do so would be to have total information about the pool composition on BOTH the receiver and sender side. That would involve sending the checksums for the complete pool blocks between the receiver and sender, which is a non-trivial overhead, and, indeed, would usually be far worse than simply doing what 'zfs send -D' does now (dedup the sending stream itself). The only possible way that such a scheme would work would be if the receiver and sender were the same machine (note: not VMs or Zones on the same machine, but the same OS instance, since you would need the DDT to be shared). And, that's not a use case that 'zfs send' is generally optimized for - that is, while it's entirely possible, it's not the primary use case for 'zfs send'

Given the overhead of network communications, there's no way that sending block checksums between hosts can ever be more efficient than just sending the self-deduped whole stream (except in pedantic cases). Let's look at possible implementations (all assume that the local sending machine does its own dedup - that is, the stream-to-be-sent is already deduped within itself):

(1) when constructing the stream, every time a block is read from a fileset (or volume), its checksum is sent to the receiving machine. The receiving machine then looks up that checksum in its DDT, and sends back a "needed" or "not-needed" reply to the sender. While this lookup is being done, the sender must hold the original block in RAM, and cannot write it out to the to-be-sent-stream.

(2) The sending machine reads all the to-be-sent blocks, creates a stream, AND creates a checksum table (a mini-DDT, if you will). The sender communicates to the receiver this mini-DDT. The receiver diffs this against its own master pool DDT, and then sends back an edited mini-DDT containing only the checksums that match blocks which aren't on the receiver. The original sending machine must then go back and re-construct the stream (either as a whole, or parse the stream as it is being sent) to leave out the unneeded blocks.

(3) some combo of #1 and #2 where several checksums are stuffed into a packet, and sent over the wire to be checked at the destination, with the receiver sending back only those to be included in the stream.


In the first scenario, you produce a huge amount of small network packet traffic, which trashes network throughput, with no real expectation that the reduction in the send stream will be worth it. In the second case, you induce a huge amount of latency into the construction of the sending stream - that is, the "sender" has to wait around and then spend a non-trivial amount of processing power on essentially double processing the send stream, when, in the current implementation, it just sends out stuff as soon as it gets it. The third scenario is only an optimization of #1 and #2, and doesn't avoid the pitfalls of either.

That is, even if ZFS did pool-level sends, you're still trapped by the need to share the DDT, which induces an overhead that can't be reasonably made up vs simply sending an internally-deduped souce stream in the first place. I'm sure I can construct an instance where such DDT sharing would be better than the current 'zfs send' implementation; I'm just as sure that such an instance would be the small minority of usage, and that such a required implementation would radically alter the "typical" use case's performance to the negative.

In any case, as 'zfs send' works on filesets and volumes, and ZFS maintains DDT information on a pool-level, there's no way to share an existing whole DDT between two systems (and, given the potential size of a pool-level DDT, that's a bad idea anyway).

I see no ability to optimize the 'zfs send/receive' concept beyond what is currently done.

-Erik
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to