On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote: > > On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote: > > > > > > > > > On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote: > > > > > > !-------------------------------------------------------------------| > > > CAUTION: External Email > > > > > > |-------------------------------------------------------------------! > > > > > > On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote: > > >> > > >> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled > > >> the module parameter for the handle_tx_zerocopy path back in 2019, > > >> nothing that many downstream distributions (e.g., RHEL7 and later) had > > >> already done the same. > > >> > > >> Both upstream and downstream disablement suggest this path is rarely > > >> used. > > >> > > >> Testing the module parameter shows that while the path allows packet > > >> forwarding, the zerocopy functionality itself is broken. On outbound > > >> traffic (guest TX -> external), zerocopy SKBs are orphaned by either > > >> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit()) > > > > > > This is by design to avoid DOS. > > > > I understand that, but it makes ZC non-functional in general, as ZC fails > > and immediately increments the error counters. > > The main issue is HOL, but zerocopy may still work in some setups that > don't need to care about HOL. One example the macvtap passthrough > mode. > > > > > > > > >> or > > >> skb_orphan_frags() elsewhere in the stack, > > > > > > Basically zerocopy is expected to work for guest -> remote case, so > > > could we still hit skb_orphan_frags() in this case? > > > > Yes, you’d hit that in tun_net_xmit(). > > Only for local VM to local VM communication. > > > If you punch a hole in that *and* in the > > zc error counter (such that failed ZC doesn’t disable ZC in vhost), you get > > ZC > > from vhost; however, the network interrupt handler under net_tx_action and > > eventually incurs the memcpy under dev_queue_xmit_nit(). > > Well, yes, we need a copy if there's a packet socket. But if there's > no network interface taps, we don't need to do the copy here. >
Hi! I need more time diving into the issues. As Jon mentioned, vhost ZC is so little used I didn't have the chance to experiment with this until now :). But yes, I expect to be able to overcome these for pasta, by adapting buffer sizes or modifying code etc. > > > > This is no more performant, and in fact is actually worse since the time > > spent > > waiting on that memcpy to resolve is longer. > > > > > > > >> as vhost_net does not set > > >> SKBFL_DONT_ORPHAN. > > Maybe we can try to set this as vhost-net can hornor ulimit now. > > > >> > > >> Orphaning enforces a memcpy and triggers the completion callback, which > > >> increments the failed TX counter, effectively disabling zerocopy again. > > >> > > >> Even after addressing these issues to prevent SKB orphaning and error > > >> counter increments, performance remains poor. By default, only 64 > > >> messages can be zerocopied, which is immediately exhausted by workloads > > >> like iperf, resulting in most messages being memcpy'd anyhow. > > >> > > >> Additionally, memcpy'd messages do not benefit from the XDP batching > > >> optimizations present in the handle_tx_copy path. > > >> > > >> Given these limitations and the lack of any tangible benefits, remove > > >> zerocopy entirely to simplify the code base. > > >> > > >> Signed-off-by: Jon Kohler <j...@nutanix.com> > > > > > > Any chance we can fix those issues? Actually, we had a plan to make > > > use of vhost-net and its tx zerocopy (or even implement the rx > > > zerocopy) in pasta. > > > > Happy to take direction and ideas here, but I don’t see a clear way to fix > > these > > issues, without dealing with the assertions that skb_orphan_frags_rx calls > > out. > > > > Said another way, I’d be interested in hearing if there is a config where > > ZC in > > current host-net implementation works, as I was driving myself crazy trying > > to > > reverse engineer. > > See above. > > > > > Happy to collaborate if there is something we could do here. > > Great, we can start here by seeking a way to fix the known issues of > the vhost-net zerocopy code. > Happy to help here :). Jon, could you share more details about the orphan problem so I can speed up the help? For example, can you describe the code changes and the code path that would lead to that assertion of skb_orphan_frags_rx? Thanks!