On Mon, May 12, 2025 at 5:21 PM Jon Kohler <j...@nutanix.com> wrote: > > > > > On Apr 30, 2025, at 9:21 PM, Jon Kohler <j...@nutanix.com> wrote: > > > > > > > >> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> > >> wrote: > >> > >> !-------------------------------------------------------------------| > >> CAUTION: External Email > >> > >> |-------------------------------------------------------------------! > >> > >> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote: > >>> > >>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote: > >>>> > >>>> > >>>> > >>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote: > >>>>> > >>>>> !-------------------------------------------------------------------| > >>>>> CAUTION: External Email > >>>>> > >>>>> |-------------------------------------------------------------------! > >>>>> > >>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote: > >>>>>> > >>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled > >>>>>> the module parameter for the handle_tx_zerocopy path back in 2019, > >>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had > >>>>>> already done the same. > >>>>>> > >>>>>> Both upstream and downstream disablement suggest this path is rarely > >>>>>> used. > >>>>>> > >>>>>> Testing the module parameter shows that while the path allows packet > >>>>>> forwarding, the zerocopy functionality itself is broken. On outbound > >>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either > >>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit()) > >>>>> > >>>>> This is by design to avoid DOS. > >>>> > >>>> I understand that, but it makes ZC non-functional in general, as ZC fails > >>>> and immediately increments the error counters. > >>> > >>> The main issue is HOL, but zerocopy may still work in some setups that > >>> don't need to care about HOL. One example the macvtap passthrough > >>> mode. > >>> > >>>> > >>>>> > >>>>>> or > >>>>>> skb_orphan_frags() elsewhere in the stack, > >>>>> > >>>>> Basically zerocopy is expected to work for guest -> remote case, so > >>>>> could we still hit skb_orphan_frags() in this case? > >>>> > >>>> Yes, you’d hit that in tun_net_xmit(). > >>> > >>> Only for local VM to local VM communication. > > > > Sure, but the tricky bit here is that if you have a mix of VM-VM and > > VM-external > > traffic patterns, any time the error path is hit, the zc error counter will > > go up. > > > > When that happens, ZC will get silently disabled anyhow, so it leads to > > sporadic > > success / non-deterministic performance. > > > >>> > >>>> If you punch a hole in that *and* in the > >>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you > >>>> get ZC > >>>> from vhost; however, the network interrupt handler under net_tx_action > >>>> and > >>>> eventually incurs the memcpy under dev_queue_xmit_nit(). > >>> > >>> Well, yes, we need a copy if there's a packet socket. But if there's > >>> no network interface taps, we don't need to do the copy here. > >>> > > > > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] > > that prevented > > this specific case; however, there are still other trip wires spread out > > across the > > stack that would need to be addressed. > > > > [1] > > https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717 > > > >> > >> Hi! > >> > >> I need more time diving into the issues. As Jon mentioned, vhost ZC is > >> so little used I didn't have the chance to experiment with this until > >> now :). But yes, I expect to be able to overcome these for pasta, by > >> adapting buffer sizes or modifying code etc. > > > > Another tricky bit here is that it has been disabled both upstream and > > downstream > > for so long, the code naturally has a bit of wrench-in-the-engine. > > > > RE Buffer sizes: I tried this as well, because I think on sufficiently fast > > systems, > > zero copy gets especially interesting in GSO/TSO cases where you have mega > > payloads. > > > > I tried playing around with the good copy value such that ZC restricted > > itself to > > only lets say 32K payloads and above, and while it *does* work (with enough > > holes punched in), absolute t-put doesn’t actually go up, its just that CPU > > utilization > > goes down a pinch. Not a bad thing for certain, but still not great. > > > > In fact, I found that tput actually went down with this path, even with ZC > > occurring > > successfully, as there was still a mix of ZC and non-ZC because you can only > > have so many pending at any given time before the copy path kicks in again. > > > > > >> > >>>> > >>>> This is no more performant, and in fact is actually worse since the time > >>>> spent > >>>> waiting on that memcpy to resolve is longer. > >>>> > >>>>> > >>>>>> as vhost_net does not set > >>>>>> SKBFL_DONT_ORPHAN. > >>> > >>> Maybe we can try to set this as vhost-net can hornor ulimit now. > > > > Yea I tried that, and while it helps kick things further down the stack, > > its not actually > > faster in any testing I’ve drummed up. > > > >>> > >>>>>> > >>>>>> Orphaning enforces a memcpy and triggers the completion callback, which > >>>>>> increments the failed TX counter, effectively disabling zerocopy again. > >>>>>> > >>>>>> Even after addressing these issues to prevent SKB orphaning and error > >>>>>> counter increments, performance remains poor. By default, only 64 > >>>>>> messages can be zerocopied, which is immediately exhausted by workloads > >>>>>> like iperf, resulting in most messages being memcpy'd anyhow. > >>>>>> > >>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching > >>>>>> optimizations present in the handle_tx_copy path. > >>>>>> > >>>>>> Given these limitations and the lack of any tangible benefits, remove > >>>>>> zerocopy entirely to simplify the code base. > >>>>>> > >>>>>> Signed-off-by: Jon Kohler <j...@nutanix.com> > >>>>> > >>>>> Any chance we can fix those issues? Actually, we had a plan to make > >>>>> use of vhost-net and its tx zerocopy (or even implement the rx > >>>>> zerocopy) in pasta. > >>>> > >>>> Happy to take direction and ideas here, but I don’t see a clear way to > >>>> fix these > >>>> issues, without dealing with the assertions that skb_orphan_frags_rx > >>>> calls out. > >>>> > >>>> Said another way, I’d be interested in hearing if there is a config > >>>> where ZC in > >>>> current host-net implementation works, as I was driving myself crazy > >>>> trying to > >>>> reverse engineer. > >>> > >>> See above. > >>> > >>>> > >>>> Happy to collaborate if there is something we could do here. > >>> > >>> Great, we can start here by seeking a way to fix the known issues of > >>> the vhost-net zerocopy code. > >>> > >> > >> Happy to help here :). > >> > >> Jon, could you share more details about the orphan problem so I can > >> speed up the help? For example, can you describe the code changes and > >> the code path that would lead to that assertion of > >> skb_orphan_frags_rx? > >> > >> Thanks! > >> > > > > Sorry for the slow response, getting back from holiday and catching up. > > > > When running through tun.c, there are a handful of places where ZC turns > > into > > a full copy, whether that is in the tun code itself, or in the interrupt > > handler when > > tun xmit is running. > > > > For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything > > with frags will get this memcpy, which are of course the “juicy” targets > > here as > > they would take up the most memory bandwidth in general. Nasty catch22 :) > > > > There are also plenty of places that call normal skb_orphan_frags, which > > triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy > > fix, but still something to think about. > > > > Then there is the issue of packet sockets, which throw a king sized wrench > > into > > this. Its slightly insidious, but it isn’t directly apparent that loading > > some user > > space app nukes zero copy, but it happens. > > > > See my previous comment about LLDPD, where a simply compiler snafu caused > > one socket option to get silently break, and it then ripped out ZC > > capability. Easy > > fix, but its an example of how this can fall over. > > > > Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to > > ideas > > here :) (up to and including both A) fixing it and B) deleting it) > > Hey Eugenio - wondering if you had a chance to check out my notes on this? >
Sorry I thought I was going to have the code ready by now :). I'll need more time to go through the items.