> On Apr 30, 2025, at 9:21 PM, Jon Kohler <j...@nutanix.com> wrote:
> 
> 
> 
>> On Apr 16, 2025, at 6:15 AM, Eugenio Perez Martin <epere...@redhat.com> 
>> wrote:
>> 
>> !-------------------------------------------------------------------|
>> CAUTION: External Email
>> 
>> |-------------------------------------------------------------------!
>> 
>> On Tue, Apr 8, 2025 at 8:28 AM Jason Wang <jasow...@redhat.com> wrote:
>>> 
>>> On Tue, Apr 8, 2025 at 9:18 AM Jon Kohler <j...@nutanix.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Apr 6, 2025, at 7:14 PM, Jason Wang <jasow...@redhat.com> wrote:
>>>>> 
>>>>> !-------------------------------------------------------------------|
>>>>> CAUTION: External Email
>>>>> 
>>>>> |-------------------------------------------------------------------!
>>>>> 
>>>>> On Fri, Apr 4, 2025 at 10:24 PM Jon Kohler <j...@nutanix.com> wrote:
>>>>>> 
>>>>>> Commit 098eadce3c62 ("vhost_net: disable zerocopy by default") disabled
>>>>>> the module parameter for the handle_tx_zerocopy path back in 2019,
>>>>>> nothing that many downstream distributions (e.g., RHEL7 and later) had
>>>>>> already done the same.
>>>>>> 
>>>>>> Both upstream and downstream disablement suggest this path is rarely
>>>>>> used.
>>>>>> 
>>>>>> Testing the module parameter shows that while the path allows packet
>>>>>> forwarding, the zerocopy functionality itself is broken. On outbound
>>>>>> traffic (guest TX -> external), zerocopy SKBs are orphaned by either
>>>>>> skb_orphan_frags_rx() (used with the tun driver via tun_net_xmit())
>>>>> 
>>>>> This is by design to avoid DOS.
>>>> 
>>>> I understand that, but it makes ZC non-functional in general, as ZC fails
>>>> and immediately increments the error counters.
>>> 
>>> The main issue is HOL, but zerocopy may still work in some setups that
>>> don't need to care about HOL. One example the macvtap passthrough
>>> mode.
>>> 
>>>> 
>>>>> 
>>>>>> or
>>>>>> skb_orphan_frags() elsewhere in the stack,
>>>>> 
>>>>> Basically zerocopy is expected to work for guest -> remote case, so
>>>>> could we still hit skb_orphan_frags() in this case?
>>>> 
>>>> Yes, you’d hit that in tun_net_xmit().
>>> 
>>> Only for local VM to local VM communication.
> 
> Sure, but the tricky bit here is that if you have a mix of VM-VM and 
> VM-external
> traffic patterns, any time the error path is hit, the zc error counter will 
> go up.
> 
> When that happens, ZC will get silently disabled anyhow, so it leads to 
> sporadic
> success / non-deterministic performance.
> 
>>> 
>>>> If you punch a hole in that *and* in the
>>>> zc error counter (such that failed ZC doesn’t disable ZC in vhost), you 
>>>> get ZC
>>>> from vhost; however, the network interrupt handler under net_tx_action and
>>>> eventually incurs the memcpy under dev_queue_xmit_nit().
>>> 
>>> Well, yes, we need a copy if there's a packet socket. But if there's
>>> no network interface taps, we don't need to do the copy here.
>>> 
> 
> Agreed on the packet socket side. I recently fixed an issue in lldpd [1] that 
> prevented
> this specific case; however, there are still other trip wires spread out 
> across the
> stack that would need to be addressed.
> 
> [1] 
> https://github.com/lldpd/lldpd/commit/622a91144de4ae487ceebdb333863e9f660e0717
> 
>> 
>> Hi!
>> 
>> I need more time diving into the issues. As Jon mentioned, vhost ZC is
>> so little used I didn't have the chance to experiment with this until
>> now :). But yes, I expect to be able to overcome these for pasta, by
>> adapting buffer sizes or modifying code etc.
> 
> Another tricky bit here is that it has been disabled both upstream and 
> downstream
> for so long, the code naturally has a bit of wrench-in-the-engine.
> 
> RE Buffer sizes: I tried this as well, because I think on sufficiently fast 
> systems,
> zero copy gets especially interesting in GSO/TSO cases where you have mega
> payloads.
> 
> I tried playing around with the good copy value such that ZC restricted 
> itself to
> only lets say 32K payloads and above, and while it *does* work (with enough
> holes punched in), absolute t-put doesn’t actually go up, its just that CPU 
> utilization
> goes down a pinch. Not a bad thing for certain, but still not great.
> 
> In fact, I found that tput actually went down with this path, even with ZC 
> occurring
> successfully, as there was still a mix of ZC and non-ZC because you can only
> have so many pending at any given time before the copy path kicks in again.
> 
> 
>> 
>>>> 
>>>> This is no more performant, and in fact is actually worse since the time 
>>>> spent
>>>> waiting on that memcpy to resolve is longer.
>>>> 
>>>>> 
>>>>>> as vhost_net does not set
>>>>>> SKBFL_DONT_ORPHAN.
>>> 
>>> Maybe we can try to set this as vhost-net can hornor ulimit now.
> 
> Yea I tried that, and while it helps kick things further down the stack, its 
> not actually
> faster in any testing I’ve drummed up.
> 
>>> 
>>>>>> 
>>>>>> Orphaning enforces a memcpy and triggers the completion callback, which
>>>>>> increments the failed TX counter, effectively disabling zerocopy again.
>>>>>> 
>>>>>> Even after addressing these issues to prevent SKB orphaning and error
>>>>>> counter increments, performance remains poor. By default, only 64
>>>>>> messages can be zerocopied, which is immediately exhausted by workloads
>>>>>> like iperf, resulting in most messages being memcpy'd anyhow.
>>>>>> 
>>>>>> Additionally, memcpy'd messages do not benefit from the XDP batching
>>>>>> optimizations present in the handle_tx_copy path.
>>>>>> 
>>>>>> Given these limitations and the lack of any tangible benefits, remove
>>>>>> zerocopy entirely to simplify the code base.
>>>>>> 
>>>>>> Signed-off-by: Jon Kohler <j...@nutanix.com>
>>>>> 
>>>>> Any chance we can fix those issues? Actually, we had a plan to make
>>>>> use of vhost-net and its tx zerocopy (or even implement the rx
>>>>> zerocopy) in pasta.
>>>> 
>>>> Happy to take direction and ideas here, but I don’t see a clear way to fix 
>>>> these
>>>> issues, without dealing with the assertions that skb_orphan_frags_rx calls 
>>>> out.
>>>> 
>>>> Said another way, I’d be interested in hearing if there is a config where 
>>>> ZC in
>>>> current host-net implementation works, as I was driving myself crazy 
>>>> trying to
>>>> reverse engineer.
>>> 
>>> See above.
>>> 
>>>> 
>>>> Happy to collaborate if there is something we could do here.
>>> 
>>> Great, we can start here by seeking a way to fix the known issues of
>>> the vhost-net zerocopy code.
>>> 
>> 
>> Happy to help here :).
>> 
>> Jon, could you share more details about the orphan problem so I can
>> speed up the help? For example, can you describe the code changes and
>> the code path that would lead to that assertion of
>> skb_orphan_frags_rx?
>> 
>> Thanks!
>> 
> 
> Sorry for the slow response, getting back from holiday and catching up.
> 
> When running through tun.c, there are a handful of places where ZC turns into
> a full copy, whether that is in the tun code itself, or in the interrupt 
> handler when
> tun xmit is running.
> 
> For example, tun_net_xmit mandatorily calls skb_orphan_frags_rx. Anything
> with frags will get this memcpy, which are of course the “juicy” targets here 
> as
> they would take up the most memory bandwidth in general. Nasty catch22 :) 
> 
> There are also plenty of places that call normal skb_orphan_frags, which
> triggers because vhost doesn’t set SKBFL_DONT_ORPHAN. That’s an easy
> fix, but still something to think about.
> 
> Then there is the issue of packet sockets, which throw a king sized wrench 
> into
> this. Its slightly insidious, but it isn’t directly apparent that loading 
> some user
> space app nukes zero copy, but it happens.
> 
> See my previous comment about LLDPD, where a simply compiler snafu caused
> one socket option to get silently break, and it then ripped out ZC 
> capability. Easy
> fix, but its an example of how this can fall over.
> 
> Bottom line, I’d *love****** have ZC work, work well and so on. I’m open to 
> ideas
> here :) (up to and including both A) fixing it and B) deleting it)

Hey Eugenio - wondering if you had a chance to check out my notes on this?

Reply via email to