On Thu, Dec 29, 2011 at 6:44 PM, Matthew Ahrens <mahr...@delphix.com> wrote: > On Mon, Dec 12, 2011 at 11:04 PM, Erik Trimble <tr...@netdemons.com> wrote: >> (1) when constructing the stream, every time a block is read from a fileset >> (or volume), its checksum is sent to the receiving machine. The receiving >> machine then looks up that checksum in its DDT, and sends back a "needed" or >> "not-needed" reply to the sender. While this lookup is being done, the >> sender must hold the original block in RAM, and cannot write it out to the >> to-be-sent-stream. > ... >> you produce a huge amount of small network packet >> traffic, which trashes network throughput > > This seems like a valid approach to me. When constructing the stream, > the sender need not read the actual data, just the checksum in the > indirect block. So there is nothing that the sender "must hold in > RAM". There is no need to create small (or synchronous) network > packets, because sender need not wait for the receiver to determine if > it needs the block or not. There can be multiple asynchronous > communication streams: one where the sender sends all the checksums > to the receiver; another where the receiver requests blocks that it > does not have from the sender; and another where the sender sends > requested blocks back to the receiver. Implementing this may not be > trivial, and in some cases it will not improve on the current > implementation. But in others it would be a considerable improvement.
Right, you'd want to let the socket/transport buffer/flow control writes of "I have this new block checksum" messages from the zfs sender and "I need the block with this checksum" messages from the zfs receiver. I like this. A separate channel for bulk data definitely comes recommended for flow control reasons, but if you do that then securing the transport gets complicated: you couldn't just zfs send .. | ssh ... zfs receive. You could use SSH channel multiplexing, but that will net you lousy performance (well, no lousier than one already gets with SSH anyways)[*]. (And SunSSH lacks this feature anyways) It'd then begin to pay to have have a bonafide zfs send network protocol, and now we're talking about significantly more work. Another option would be to have send/receive options to create the two separate channels, so one would do something like: % zfs send --dedup-control-channel ... | ssh-or-netcat-or... zfs receive --dedup-control-channel ... & % zfs send --dedup-bulk-channel ... | ssh-or-netcat-or... zfs receive --dedup-bulk-channel % wait The second zfs receive would rendezvous with the first and go from there. [*] The problem with SSHv2 is that it has flow controlled channels layered over a flow controlled congestion channel (TCP), and there's not enough information flowing from TCP to SSHv2 to make this work well, but also, the SSHv2 channels cannot have their window shrink except by the sender consuming it, which makes it impossible to mix high-bandwidth bulk and small control data over a congested link. This means that in practice SSHv2 channels have to have relatively small windows, which then forces the protocol to work very synchronously (i.e., with effectively synchronous ACKs of bulk data). I now believe the idea of mixing bulk and non-bulk data over a single TCP connection in SSHv2 is a failure. SSHv2 over SCTP, or over multiple TCP connections, would be much better. Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss