On Thu, May 1, 2025 at 9:21 AM Jon Kohler <j...@nutanix.com> wrote: > > > > > On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> > > wrote: > > > > !-------------------------------------------------------------------| > > CAUTION: External Email > > > > |-------------------------------------------------------------------! > > > > On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote: > >> > >> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote: > >>> > >>> > >>> > >>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote: > >>>> > >>>> !-------------------------------------------------------------------| > >>>> CAUTION: External Email > >>>> > >>>> |-------------------------------------------------------------------! > >>>> > >>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote: > >>>>> > >>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled > >>>>> the module parameter for the handle_tx_zerocopy path back in 2019, > >>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had > >>>>> already done the same. > >>>>> > >>>>> Both upstream and downstream disablement suggest this path is rarely > >>>>> used. > >>>>> > >>>>> Testing the module parameter shows that while the path allows packet > >>>>> forwarding, the zerocopy functionality itself is broken. On outbound > >>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either > >>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit()) > >>>> > >>>> This is by design to avoid DOS. > >>> > >>> I understand that, but it makes ZC non-functional in general, as ZC fails > >>> and immediately increments the error counters. > >> > >> The main issue is HOL, but zerocopy may still work in some setups that > >> don't need to care about HOL. One example the macvtap passthrough > >> mode. > >> > >>> > >>>> > >>>>> or > >>>>> skb_orphan_frags() elsewhere in the stack, > >>>> > >>>> Basically zerocopy is expected to work for guest -> remote case, so > >>>> could we still hit skb_orphan_frags() in this case? > >>> > >>> Yes, you’d hit that in tun_net_xmit(). > >> > >> Only for local VM to local VM communication. > > Sure, but the tricky bit here is that if you have a mix of VM-VM and > VM-external > traffic patterns, any time the error path is hit, the zc error counter will > go up. > > When that happens, ZC will get silently disabled anyhow, so it leads to > sporadic > success / non-deterministic performance.
As discussed, the main issue is safety: a malicious VM can refuse to receive packets which may cause tx to be stuck. So scarfing performance for safety is the only way to go. If we can find a way to fix them, it could be relaxed. > > >> > >>> If you punch a hole in that *and* in the > >>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you > >>> get ZC > >>> from vhost; however, the network interrupt handler under net_tx_action and > >>> eventually incurs the memcpy under dev_queue_xmit_nit(). > >> > >> Well, yes, we need a copy if there's a packet socket. But if there's > >> no network interface taps, we don't need to do the copy here. > >> > > Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that > prevented > this specific case; however, there are still other trip wires spread out > across the > stack that would need to be addressed. > > [1] > https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717 I see. > > > > > Hi! > > > > I need more time diving into the issues. As Jon mentioned, vhost ZC is > > so little used I didn't have the chance to experiment with this until > > now :). But yes, I expect to be able to overcome these for pasta, by > > adapting buffer sizes or modifying code etc. > > Another tricky bit here is that it has been disabled both upstream and > downstream > for so long, the code naturally has a bit of wrench-in-the-engine. > > RE Buffer sizes: I tried this as well, because I think on sufficiently fast > systems, > zero copy gets especially interesting in GSO/TSO cases where you have mega > payloads. Yes. > > I tried playing around with the good copy value such that ZC restricted > itself to > only lets say 32K payloads and above, and while it *does* work (with enough > holes punched in), absolute t-put doesn’t actually go up, its just that CPU > utilization > goes down a pinch. Not a bad thing for certain, but still not great. > > In fact, I found that tput actually went down with this path, even with ZC > occurring > successfully, as there was still a mix of ZC and non-ZC because you can only > have so many pending at any given time before the copy path kicks in again. TCP has hursitisc that may try to coalesce more writes into a larger packet so we might get "worse" throughput when vhost becomes faster. For example, only my testing environment, qemu, is faster than vhost single TCP stream testing of netperf. > > > > > >>> > >>> This is no more performant, and in fact is actually worse since the time > >>> spent > >>> waiting on that memcpy to resolve is longer. > >>> > >>>> > >>>>> as vhost_net does not set > >>>>> SKBFL_DONT_ORPHAN. > >> > >> Maybe we can try to set this as vhost-net can hornor ulimit now. > > Yea I tried that, and while it helps kick things further down the stack, its > not actually > faster in any testing I’ve drummed up. Can I see the codes? > > >> > >>>>> > >>>>> Orphaning enforces a memcpy and triggers the completion callback, which > >>>>> increments the failed TX counter, effectively disabling zerocopy again. > >>>>> > >>>>> Even after addressing these issues to prevent SKB orphaning and error > >>>>> counter increments, performance remains poor. By default, only 64 > >>>>> messages can be zerocopied, which is immediately exhausted by workloads > >>>>> like iperf, resulting in most messages being memcpy'd anyhow. > >>>>> > >>>>> Additionally, memcpy'd messages do not benefit from the XDP batching > >>>>> optimizations present in the handle_tx_copy path. > >>>>> > >>>>> Given these limitations and the lack of any tangible benefits, remove > >>>>> zerocopy entirely to simplify the code base. > >>>>> > >>>>> Signed-off-by: Jon Kohler <j...@nutanix.com> > >>>> > >>>> Any chance we can fix those issues? Actually, we had a plan to make > >>>> use of vhost-net and its tx zerocopy (or even implement the rx > >>>> zerocopy) in pasta. > >>> > >>> Happy to take direction and ideas here, but I don’t see a clear way to > >>> fix these > >>> issues, without dealing with the assertions that skb_orphan_frags_rx > >>> calls out. > >>> > >>> Said another way, I’d be interested in hearing if there is a config where > >>> ZC in > >>> current host-net implementation works, as I was driving myself crazy > >>> trying to > >>> reverse engineer. > >> > >> See above. > >> > >>> > >>> Happy to collaborate if there is something we could do here. > >> > >> Great, we can start here by seeking a way to fix the known issues of > >> the vhost-net zerocopy code. > >> > > > > Happy to help here :). > > > > Jon, could you share more details about the orphan problem so I can > > speed up the help? For example, can you describe the code changes and > > the code path that would lead to that assertion of > > skb_orphan_frags_rx? > > > > Thanks! > > > > Sorry for the slow response, getting back from holiday and catching up. > > When running through tun.c, there are a handful of places where ZC turns into > a full copy, whether that is in the tun code itself, or in the interrupt > handler when > tun xmit is running. > > For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything > with frags will get this memcpy, which are of course the “juicy” targets here > as > they would take up the most memory bandwidth in general. Nasty catch22 :) > > There are also plenty of places that call normal skb_orphan_frags, which > triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy > fix, but still something to think about. > > Then there is the issue of packet sockets, which throw a king sized wrench > into > this. Its slightly insidious, but it isn’t directly apparent that loading > some user > space app nukes zero copy, but it happens. > > See my previous comment about LLDPD, where a simply compiler snafu caused > one socket option to get silently break, and it then ripped out ZC > capability. Easy > fix, but its an example of how this can fall over. > > Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to > ideas > here :) (up to and including both A) fixing it and B) deleting it) Zerocopy still has its advantages in some specific setups. Maybe a way to start is to make it work automatically depending on the lower setups? For example, enable it by default with macvtap. Thanks >