> On Apr 30, 2025, at 9:21 PM, Jon Kohler <j...@nutanix.com> wrote: > > > >> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> >> wrote: >> >> !-------------------------------------------------------------------| >> CAUTION: External Email >> >> |-------------------------------------------------------------------! >> >> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote: >>> >>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote: >>>> >>>> >>>> >>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote: >>>>> >>>>> !-------------------------------------------------------------------| >>>>> CAUTION: External Email >>>>> >>>>> |-------------------------------------------------------------------! >>>>> >>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote: >>>>>> >>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled >>>>>> the module parameter for the handle_tx_zerocopy path back in 2019, >>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had >>>>>> already done the same. >>>>>> >>>>>> Both upstream and downstream disablement suggest this path is rarely >>>>>> used. >>>>>> >>>>>> Testing the module parameter shows that while the path allows packet >>>>>> forwarding, the zerocopy functionality itself is broken. On outbound >>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either >>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit()) >>>>> >>>>> This is by design to avoid DOS. >>>> >>>> I understand that, but it makes ZC non-functional in general, as ZC fails >>>> and immediately increments the error counters. >>> >>> The main issue is HOL, but zerocopy may still work in some setups that >>> don't need to care about HOL. One example the macvtap passthrough >>> mode. >>> >>>> >>>>> >>>>>> or >>>>>> skb_orphan_frags() elsewhere in the stack, >>>>> >>>>> Basically zerocopy is expected to work for guest -> remote case, so >>>>> could we still hit skb_orphan_frags() in this case? >>>> >>>> Yes, you’d hit that in tun_net_xmit(). >>> >>> Only for local VM to local VM communication. > > Sure, but the tricky bit here is that if you have a mix of VM-VM and > VM-external > traffic patterns, any time the error path is hit, the zc error counter will > go up. > > When that happens, ZC will get silently disabled anyhow, so it leads to > sporadic > success / non-deterministic performance. > >>> >>>> If you punch a hole in that *and* in the >>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you >>>> get ZC >>>> from vhost; however, the network interrupt handler under net_tx_action and >>>> eventually incurs the memcpy under dev_queue_xmit_nit(). >>> >>> Well, yes, we need a copy if there's a packet socket. But if there's >>> no network interface taps, we don't need to do the copy here. >>> > > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that > prevented > this specific case; however, there are still other trip wires spread out > across the > stack that would need to be addressed. > > [1] > https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717 > >> >> Hi! >> >> I need more time diving into the issues. As Jon mentioned, vhost ZC is >> so little used I didn't have the chance to experiment with this until >> now :). But yes, I expect to be able to overcome these for pasta, by >> adapting buffer sizes or modifying code etc. > > Another tricky bit here is that it has been disabled both upstream and > downstream > for so long, the code naturally has a bit of wrench-in-the-engine. > > RE Buffer sizes: I tried this as well, because I think on sufficiently fast > systems, > zero copy gets especially interesting in GSO/TSO cases where you have mega > payloads. > > I tried playing around with the good copy value such that ZC restricted > itself to > only lets say 32K payloads and above, and while it *does* work (with enough > holes punched in), absolute t-put doesn’t actually go up, its just that CPU > utilization > goes down a pinch. Not a bad thing for certain, but still not great. > > In fact, I found that tput actually went down with this path, even with ZC > occurring > successfully, as there was still a mix of ZC and non-ZC because you can only > have so many pending at any given time before the copy path kicks in again. > > >> >>>> >>>> This is no more performant, and in fact is actually worse since the time >>>> spent >>>> waiting on that memcpy to resolve is longer. >>>> >>>>> >>>>>> as vhost_net does not set >>>>>> SKBFL_DONT_ORPHAN. >>> >>> Maybe we can try to set this as vhost-net can hornor ulimit now. > > Yea I tried that, and while it helps kick things further down the stack, its > not actually > faster in any testing I’ve drummed up. > >>> >>>>>> >>>>>> Orphaning enforces a memcpy and triggers the completion callback, which >>>>>> increments the failed TX counter, effectively disabling zerocopy again. >>>>>> >>>>>> Even after addressing these issues to prevent SKB orphaning and error >>>>>> counter increments, performance remains poor. By default, only 64 >>>>>> messages can be zerocopied, which is immediately exhausted by workloads >>>>>> like iperf, resulting in most messages being memcpy'd anyhow. >>>>>> >>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching >>>>>> optimizations present in the handle_tx_copy path. >>>>>> >>>>>> Given these limitations and the lack of any tangible benefits, remove >>>>>> zerocopy entirely to simplify the code base. >>>>>> >>>>>> Signed-off-by: Jon Kohler <j...@nutanix.com> >>>>> >>>>> Any chance we can fix those issues? Actually, we had a plan to make >>>>> use of vhost-net and its tx zerocopy (or even implement the rx >>>>> zerocopy) in pasta. >>>> >>>> Happy to take direction and ideas here, but I don’t see a clear way to fix >>>> these >>>> issues, without dealing with the assertions that skb_orphan_frags_rx calls >>>> out. >>>> >>>> Said another way, I’d be interested in hearing if there is a config where >>>> ZC in >>>> current host-net implementation works, as I was driving myself crazy >>>> trying to >>>> reverse engineer. >>> >>> See above. >>> >>>> >>>> Happy to collaborate if there is something we could do here. >>> >>> Great, we can start here by seeking a way to fix the known issues of >>> the vhost-net zerocopy code. >>> >> >> Happy to help here :). >> >> Jon, could you share more details about the orphan problem so I can >> speed up the help? For example, can you describe the code changes and >> the code path that would lead to that assertion of >> skb_orphan_frags_rx? >> >> Thanks! >> > > Sorry for the slow response, getting back from holiday and catching up. > > When running through tun.c, there are a handful of places where ZC turns into > a full copy, whether that is in the tun code itself, or in the interrupt > handler when > tun xmit is running. > > For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything > with frags will get this memcpy, which are of course the “juicy” targets here > as > they would take up the most memory bandwidth in general. Nasty catch22 :) > > There are also plenty of places that call normal skb_orphan_frags, which > triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy > fix, but still something to think about. > > Then there is the issue of packet sockets, which throw a king sized wrench > into > this. Its slightly insidious, but it isn’t directly apparent that loading > some user > space app nukes zero copy, but it happens. > > See my previous comment about LLDPD, where a simply compiler snafu caused > one socket option to get silently break, and it then ripped out ZC > capability. Easy > fix, but its an example of how this can fall over. > > Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to > ideas > here :) (up to and including both A) fixing it and B) deleting it)
Hey Eugenio - wondering if you had a chance to check out my notes on this?