On Mon, May 12, 2025 at 5:21 PM Jon Kohler <j...@nutanix.com> wrote:
>
>
>
> > On Apr 30, 2025, at 9:21 PM, Jon Kohler <j...@nutanix.com> wrote:
> >
> >
> >
> >> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> 
> >> wrote:
> >>
> >> !-------------------------------------------------------------------|
> >> CAUTION: External Email
> >>
> >> |-------------------------------------------------------------------!
> >>
> >> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote:
> >>>
> >>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote:
> >>>>>
> >>>>> !-------------------------------------------------------------------|
> >>>>> CAUTION: External Email
> >>>>>
> >>>>> |-------------------------------------------------------------------!
> >>>>>
> >>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote:
> >>>>>>
> >>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
> >>>>>> the module parameter for the handle_tx_zerocopy path back in 2019,
> >>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had
> >>>>>> already done the same.
> >>>>>>
> >>>>>> Both upstream and downstream disablement suggest this path is rarely
> >>>>>> used.
> >>>>>>
> >>>>>> Testing the module parameter shows that while the path allows packet
> >>>>>> forwarding, the zerocopy functionality itself is broken. On outbound
> >>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
> >>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
> >>>>>
> >>>>> This is by design to avoid DOS.
> >>>>
> >>>> I understand that, but it makes ZC non-functional in general, as ZC fails
> >>>> and immediately increments the error counters.
> >>>
> >>> The main issue is HOL, but zerocopy may still work in some setups that
> >>> don't need to care about HOL. One example the macvtap passthrough
> >>> mode.
> >>>
> >>>>
> >>>>>
> >>>>>> or
> >>>>>> skb_orphan_frags() elsewhere in the stack,
> >>>>>
> >>>>> Basically zerocopy is expected to work for guest -> remote case, so
> >>>>> could we still hit skb_orphan_frags() in this case?
> >>>>
> >>>> Yes, you’d hit that in tun_net_xmit().
> >>>
> >>> Only for local VM to local VM communication.
> >
> > Sure, but the tricky bit here is that if you have a mix of VM-VM and 
> > VM-external
> > traffic patterns, any time the error path is hit, the zc error counter will 
> > go up.
> >
> > When that happens, ZC will get silently disabled anyhow, so it leads to 
> > sporadic
> > success / non-deterministic performance.
> >
> >>>
> >>>> If you punch a hole in that *and* in the
> >>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you 
> >>>> get ZC
> >>>> from vhost; however, the network interrupt handler under net_tx_action 
> >>>> and
> >>>> eventually incurs the memcpy under dev_queue_xmit_nit().
> >>>
> >>> Well, yes, we need a copy if there's a packet socket. But if there's
> >>> no network interface taps, we don't need to do the copy here.
> >>>
> >
> > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] 
> > that prevented
> > this specific case; however, there are still other trip wires spread out 
> > across the
> > stack that would need to be addressed.
> >
> > [1] 
> > https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717
> >
> >>
> >> Hi!
> >>
> >> I need more time diving into the issues. As Jon mentioned, vhost ZC is
> >> so little used I didn't have the chance to experiment with this until
> >> now :). But yes, I expect to be able to overcome these for pasta, by
> >> adapting buffer sizes or modifying code etc.
> >
> > Another tricky bit here is that it has been disabled both upstream and 
> > downstream
> > for so long, the code naturally has a bit of wrench-in-the-engine.
> >
> > RE Buffer sizes: I tried this as well, because I think on sufficiently fast 
> > systems,
> > zero copy gets especially interesting in GSO/TSO cases where you have mega
> > payloads.
> >
> > I tried playing around with the good copy value such that ZC restricted 
> > itself to
> > only lets say 32K payloads and above, and while it *does* work (with enough
> > holes punched in), absolute t-put doesn’t actually go up, its just that CPU 
> > utilization
> > goes down a pinch. Not a bad thing for certain, but still not great.
> >
> > In fact, I found that tput actually went down with this path, even with ZC 
> > occurring
> > successfully, as there was still a mix of ZC and non-ZC because you can only
> > have so many pending at any given time before the copy path kicks in again.
> >
> >
> >>
> >>>>
> >>>> This is no more performant, and in fact is actually worse since the time 
> >>>> spent
> >>>> waiting on that memcpy to resolve is longer.
> >>>>
> >>>>>
> >>>>>> as vhost_net does not set
> >>>>>> SKBFL_DONT_ORPHAN.
> >>>
> >>> Maybe we can try to set this as vhost-net can hornor ulimit now.
> >
> > Yea I tried that, and while it helps kick things further down the stack, 
> > its not actually
> > faster in any testing I’ve drummed up.
> >
> >>>
> >>>>>>
> >>>>>> Orphaning enforces a memcpy and triggers the completion callback, which
> >>>>>> increments the failed TX counter, effectively disabling zerocopy again.
> >>>>>>
> >>>>>> Even after addressing these issues to prevent SKB orphaning and error
> >>>>>> counter increments, performance remains poor. By default, only 64
> >>>>>> messages can be zerocopied, which is immediately exhausted by workloads
> >>>>>> like iperf, resulting in most messages being memcpy'd anyhow.
> >>>>>>
> >>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching
> >>>>>> optimizations present in the handle_tx_copy path.
> >>>>>>
> >>>>>> Given these limitations and the lack of any tangible benefits, remove
> >>>>>> zerocopy entirely to simplify the code base.
> >>>>>>
> >>>>>> Signed-off-by: Jon Kohler <j...@nutanix.com>
> >>>>>
> >>>>> Any chance we can fix those issues? Actually, we had a plan to make
> >>>>> use of vhost-net and its tx zerocopy (or even implement the rx
> >>>>> zerocopy) in pasta.
> >>>>
> >>>> Happy to take direction and ideas here, but I don’t see a clear way to 
> >>>> fix these
> >>>> issues, without dealing with the assertions that skb_orphan_frags_rx 
> >>>> calls out.
> >>>>
> >>>> Said another way, I’d be interested in hearing if there is a config 
> >>>> where ZC in
> >>>> current host-net implementation works, as I was driving myself crazy 
> >>>> trying to
> >>>> reverse engineer.
> >>>
> >>> See above.
> >>>
> >>>>
> >>>> Happy to collaborate if there is something we could do here.
> >>>
> >>> Great, we can start here by seeking a way to fix the known issues of
> >>> the vhost-net zerocopy code.
> >>>
> >>
> >> Happy to help here :).
> >>
> >> Jon, could you share more details about the orphan problem so I can
> >> speed up the help? For example, can you describe the code changes and
> >> the code path that would lead to that assertion of
> >> skb_orphan_frags_rx?
> >>
> >> Thanks!
> >>
> >
> > Sorry for the slow response, getting back from holiday and catching up.
> >
> > When running through tun.c, there are a handful of places where ZC turns 
> > into
> > a full copy, whether that is in the tun code itself, or in the interrupt 
> > handler when
> > tun xmit is running.
> >
> > For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything
> > with frags will get this memcpy, which are of course the “juicy” targets 
> > here as
> > they would take up the most memory bandwidth in general. Nasty catch22 :)
> >
> > There are also plenty of places that call normal skb_orphan_frags, which
> > triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy
> > fix, but still something to think about.
> >
> > Then there is the issue of packet sockets, which throw a king sized wrench 
> > into
> > this. Its slightly insidious, but it isn’t directly apparent that loading 
> > some user
> > space app nukes zero copy, but it happens.
> >
> > See my previous comment about LLDPD, where a simply compiler snafu caused
> > one socket option to get silently break, and it then ripped out ZC 
> > capability. Easy
> > fix, but its an example of how this can fall over.
> >
> > Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to 
> > ideas
> > here :) (up to and including both A) fixing it and B) deleting it)
>
> Hey Eugenio - wondering if you had a chance to check out my notes on this?
>

Sorry I thought I was going to have the code ready by now :). I'll
need more time to go through the items.


Reply via email to