On Thu, May 1, 2025 at 9:21 AM Jon Kohler <j...@nutanix.com> wrote:
>
>
>
> > On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> 
> > wrote:
> >
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> >
> > |-------------------------------------------------------------------!
> >
> > On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote:
> >>
> >> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote:
> >>>
> >>>
> >>>
> >>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote:
> >>>>
> >>>> !-------------------------------------------------------------------|
> >>>> CAUTION: External Email
> >>>>
> >>>> |-------------------------------------------------------------------!
> >>>>
> >>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote:
> >>>>>
> >>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
> >>>>> the module parameter for the handle_tx_zerocopy path back in 2019,
> >>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had
> >>>>> already done the same.
> >>>>>
> >>>>> Both upstream and downstream disablement suggest this path is rarely
> >>>>> used.
> >>>>>
> >>>>> Testing the module parameter shows that while the path allows packet
> >>>>> forwarding, the zerocopy functionality itself is broken. On outbound
> >>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
> >>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
> >>>>
> >>>> This is by design to avoid DOS.
> >>>
> >>> I understand that, but it makes ZC non-functional in general, as ZC fails
> >>> and immediately increments the error counters.
> >>
> >> The main issue is HOL, but zerocopy may still work in some setups that
> >> don't need to care about HOL. One example the macvtap passthrough
> >> mode.
> >>
> >>>
> >>>>
> >>>>> or
> >>>>> skb_orphan_frags() elsewhere in the stack,
> >>>>
> >>>> Basically zerocopy is expected to work for guest -> remote case, so
> >>>> could we still hit skb_orphan_frags() in this case?
> >>>
> >>> Yes, you’d hit that in tun_net_xmit().
> >>
> >> Only for local VM to local VM communication.
>
> Sure, but the tricky bit here is that if you have a mix of VM-VM and 
> VM-external
> traffic patterns, any time the error path is hit, the zc error counter will 
> go up.
>
> When that happens, ZC will get silently disabled anyhow, so it leads to 
> sporadic
> success / non-deterministic performance.

As discussed, the main issue is safety: a malicious VM can refuse to
receive packets which may cause tx to be stuck. So scarfing
performance for safety is the only way to go. If we can find a way to
fix them, it could be relaxed.

>
> >>
> >>> If you punch a hole in that *and* in the
> >>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you 
> >>> get ZC
> >>> from vhost; however, the network interrupt handler under net_tx_action and
> >>> eventually incurs the memcpy under dev_queue_xmit_nit().
> >>
> >> Well, yes, we need a copy if there's a packet socket. But if there's
> >> no network interface taps, we don't need to do the copy here.
> >>
>
> Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that 
> prevented
> this specific case; however, there are still other trip wires spread out 
> across the
> stack that would need to be addressed.
>
> [1] 
> https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717

I see.

>
> >
> > Hi!
> >
> > I need more time diving into the issues. As Jon mentioned, vhost ZC is
> > so little used I didn't have the chance to experiment with this until
> > now :). But yes, I expect to be able to overcome these for pasta, by
> > adapting buffer sizes or modifying code etc.
>
> Another tricky bit here is that it has been disabled both upstream and 
> downstream
> for so long, the code naturally has a bit of wrench-in-the-engine.
>
> RE Buffer sizes: I tried this as well, because I think on sufficiently fast 
> systems,
> zero copy gets especially interesting in GSO/TSO cases where you have mega
> payloads.

Yes.

>
> I tried playing around with the good copy value such that ZC restricted 
> itself to
> only lets say 32K payloads and above, and while it *does* work (with enough
> holes punched in), absolute t-put doesn’t actually go up, its just that CPU 
> utilization
> goes down a pinch. Not a bad thing for certain, but still not great.
>
> In fact, I found that tput actually went down with this path, even with ZC 
> occurring
> successfully, as there was still a mix of ZC and non-ZC because you can only
> have so many pending at any given time before the copy path kicks in again.

TCP has hursitisc that may try to coalesce more writes into a larger
packet so we might get "worse" throughput when vhost becomes faster.
For example, only my testing environment, qemu, is faster than vhost
single TCP stream testing of netperf.

>
>
> >
> >>>
> >>> This is no more performant, and in fact is actually worse since the time 
> >>> spent
> >>> waiting on that memcpy to resolve is longer.
> >>>
> >>>>
> >>>>> as vhost_net does not set
> >>>>> SKBFL_DONT_ORPHAN.
> >>
> >> Maybe we can try to set this as vhost-net can hornor ulimit now.
>
> Yea I tried that, and while it helps kick things further down the stack, its 
> not actually
> faster in any testing I’ve drummed up.

Can I see the codes?

>
> >>
> >>>>>
> >>>>> Orphaning enforces a memcpy and triggers the completion callback, which
> >>>>> increments the failed TX counter, effectively disabling zerocopy again.
> >>>>>
> >>>>> Even after addressing these issues to prevent SKB orphaning and error
> >>>>> counter increments, performance remains poor. By default, only 64
> >>>>> messages can be zerocopied, which is immediately exhausted by workloads
> >>>>> like iperf, resulting in most messages being memcpy'd anyhow.
> >>>>>
> >>>>> Additionally, memcpy'd messages do not benefit from the XDP batching
> >>>>> optimizations present in the handle_tx_copy path.
> >>>>>
> >>>>> Given these limitations and the lack of any tangible benefits, remove
> >>>>> zerocopy entirely to simplify the code base.
> >>>>>
> >>>>> Signed-off-by: Jon Kohler <j...@nutanix.com>
> >>>>
> >>>> Any chance we can fix those issues? Actually, we had a plan to make
> >>>> use of vhost-net and its tx zerocopy (or even implement the rx
> >>>> zerocopy) in pasta.
> >>>
> >>> Happy to take direction and ideas here, but I don’t see a clear way to 
> >>> fix these
> >>> issues, without dealing with the assertions that skb_orphan_frags_rx 
> >>> calls out.
> >>>
> >>> Said another way, I’d be interested in hearing if there is a config where 
> >>> ZC in
> >>> current host-net implementation works, as I was driving myself crazy 
> >>> trying to
> >>> reverse engineer.
> >>
> >> See above.
> >>
> >>>
> >>> Happy to collaborate if there is something we could do here.
> >>
> >> Great, we can start here by seeking a way to fix the known issues of
> >> the vhost-net zerocopy code.
> >>
> >
> > Happy to help here :).
> >
> > Jon, could you share more details about the orphan problem so I can
> > speed up the help? For example, can you describe the code changes and
> > the code path that would lead to that assertion of
> > skb_orphan_frags_rx?
> >
> > Thanks!
> >
>
> Sorry for the slow response, getting back from holiday and catching up.
>
> When running through tun.c, there are a handful of places where ZC turns into
> a full copy, whether that is in the tun code itself, or in the interrupt 
> handler when
> tun xmit is running.
>
> For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything
> with frags will get this memcpy, which are of course the “juicy” targets here 
> as
> they would take up the most memory bandwidth in general. Nasty catch22 :)
>
> There are also plenty of places that call normal skb_orphan_frags, which
> triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy
> fix, but still something to think about.
>
> Then there is the issue of packet sockets, which throw a king sized wrench 
> into
> this. Its slightly insidious, but it isn’t directly apparent that loading 
> some user
> space app nukes zero copy, but it happens.
>
> See my previous comment about LLDPD, where a simply compiler snafu caused
> one socket option to get silently break, and it then ripped out ZC 
> capability. Easy
> fix, but its an example of how this can fall over.
>
> Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to 
> ideas
> here :) (up to and including both A) fixing it and B) deleting it)

Zerocopy still has its advantages in some specific setups. Maybe a way
to start is to make it work automatically depending on the lower
setups? For example, enable it by default with macvtap.

Thanks

>


Reply via email to