Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

Nico Williams Thu, 29 Dec 2011 17:12:55 -0800

On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens <mahr...@delphix.com> wrote:
> On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble <tr...@netdemons.com> wrote:
>> (1) when constructing the stream, every time a block is read from a fileset
>> (or volume), its checksum is sent to the receiving machine. The receiving
>> machine then looks up that checksum in its DDT, and sends back a "needed" or
>> "not-needed" reply to the sender. While this lookup is being done, the
>> sender must hold the original block in RAM, and cannot write it out to the
>> to-be-sent-stream.
> ...
>> you produce a huge amount of small network packet
>> traffic, which trashes network throughput
>
> This seems like a valid approach to me.  When constructing the stream,
> the sender need not read the actual data, just the checksum in the
> indirect block.  So there is nothing that the sender "must hold in
> RAM".  There is no need to create small (or synchronous) network
> packets, because sender need not wait for the receiver to determine if
> it needs the block or not.  There can be multiple asynchronous
> communication streams:  one where the sender sends all the checksums
> to the receiver; another where the receiver requests blocks that it
> does not have from the sender; and another where the sender sends
> requested blocks back to the receiver.  Implementing this may not be
> trivial, and in some cases it will not improve on the current
> implementation.  But in others it would be a considerable improvement.


Right, you'd want to let the socket/transport buffer/flow control
writes of "I have this new block checksum" messages from the zfs
sender and "I need the block with this checksum" messages from the zfs
receiver.

I like this.

A separate channel for bulk data definitely comes recommended for flow
control reasons, but if you do that then securing the transport gets
complicated: you couldn't just zfs send .. | ssh ... zfs receive.  You
could use SSH channel multiplexing, but that will net you lousy
performance (well, no lousier than one already gets with SSH
anyways)[*].  (And SunSSH lacks this feature anyways)  It'd then begin
to pay to have have a bonafide zfs send network protocol, and now
we're talking about significantly more work.  Another option would be
to have send/receive options to create the two separate channels, so
one would do something like:

% zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs
receive --dedup-control-channel ... &
% zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive
--dedup-bulk-channel
% wait

The second zfs receive would rendezvous with the first and go from there.

[*] The problem with SSHv2 is that it has flow controlled channels
layered over a flow controlled congestion channel (TCP), and there's
not enough information flowing from TCP to SSHv2 to make this work
well, but also, the SSHv2 channels cannot have their window shrink
except by the sender consuming it, which makes it impossible to mix
high-bandwidth bulk and small control data over a congested link.
This means that in practice SSHv2 channels have to have relatively
small windows, which then forces the protocol to work very
synchronously (i.e., with effectively synchronous ACKs of bulk data).
I now believe the idea of mixing bulk and non-bulk data over a single
TCP connection in SSHv2 is a failure.  SSHv2 over SCTP, or over
multiple TCP connections, would be much better.

Nico
--
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Improving L1ARC cache efficiency with dedup

Reply via email to