On 18/05/2026 14:08, Stefano Garzarella wrote:
> On Mon, May 18, 2026 at 11:50:05AM +0100, David Laight wrote:
>> On Mon, 18 May 2026 11:54:19 +0200
>> Stefano Garzarella <[email protected]> wrote:
>>
>>> On Mon, May 18, 2026 at 05:33:08AM -0400, Michael S. Tsirkin wrote:
>>> >On Mon, May 18, 2026 at 11:18:24AM +0200, Stefano Garzarella wrote:
>>> >> On Sat, May 16, 2026 at 12:53:29PM +0100, David Laight wrote:
>>> >> > On Thu, 14 May 2026 11:29:48 +0200
>>> >> > Stefano Garzarella <[email protected]> wrote:
>>> >> >
>>> >> > > From: Stefano Garzarella <[email protected]>
>>> >> > >
>>> >> > > When a large message is fragmented into multiple skbs, the zerocopy
>>> >> > > uarg is only allocated and attached to the last skb in the loop.
>>> >> > > Non-final skbs carry pinned user pages with no completion tracking,
>>> >> > > so the kernel has no way to notify userspace when those pages are 
>>> >> > > safe
>>> >> > > to reuse. If the loop breaks early the uarg is never allocated at 
>>> >> > > all,
>>> >> > > leaking pinned pages with no completion notification.
>>> >> > >
>>> >> > > Fix this by following the approach used by TCP: allocate the zerocopy
>>> >> > > uarg (if not provided by the caller) before the send loop and attach
>>> >> > > it to every skb via skb_zcopy_set(), which takes a reference per skb.
>>> >> > > Each skb's completion properly decrements the refcount, and the
>>> >> > > notification only fires after the last skb is freed.
>>> >> > > On failure, if no data was sent, the uarg is cleanly aborted via
>>> >> > > net_zcopy_put_abort().
>>> >> > >
>>> >> > > This issue was initially discovered by sashiko while reviewing commit
>>> >> > > 1cb36e252211 ("vsock/virtio: fix MSG_ZEROCOPY pinned-pages 
>>> >> > > accounting")
>>> >> > > but was pre-existing.
>>> >> > >
>>> >> > > Fixes: 581512a6dc93 ("vsock/virtio: MSG_ZEROCOPY flag support")
>>> >> > > Cc: Arseniy Krasnov <[email protected]>
>>> >> > > Closes: 
>>> >> > > https://sashiko.dev/#/patchset/20260420132051.217589-1-sgarzare%40redhat.com
>>> >> > > Reported-by: Maher Azzouzi <[email protected]>
>>> >> > > Signed-off-by: Stefano Garzarella <[email protected]>
>>> >> > > ---
>>> >> > >  net/vmw_vsock/virtio_transport_common.c | 83 
>>> >> > >++++++++++---------------
>>> >> > >  1 file changed, 34 insertions(+), 49 deletions(-)
>>> >> > >
>>> >> > > diff --git a/net/vmw_vsock/virtio_transport_common.c 
>>> >> > > b/net/vmw_vsock/virtio_transport_common.c
>>> >> > > index 989cc252d3d3..1e3409d28164 100644
>>> >> > > --- a/net/vmw_vsock/virtio_transport_common.c
>>> >> > > +++ b/net/vmw_vsock/virtio_transport_common.c
>>> >> > > @@ -70,34 +70,6 @@ static bool virtio_transport_can_zcopy(const 
>>> >> > > struct virtio_transport *t_ops,
>>> >> > >      return true;
>>> >> > >  }
>>> >> > >
>>> >> > > -static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
>>> >> > > -                       struct sk_buff *skb,
>>> >> > > -                       struct msghdr *msg,
>>> >> > > -                       size_t pkt_len,
>>> >> > > -                       bool zerocopy)
>>> >> > > -{
>>> >> > > -    struct ubuf_info *uarg;
>>> >> > > -
>>> >> > > -    if (msg->msg_ubuf) {
>>> >> > > -        uarg = msg->msg_ubuf;
>>> >> > > -        net_zcopy_get(uarg);
>>> >> > > -    } else {
>>> >> > > -        struct ubuf_info_msgzc *uarg_zc;
>>> >> > > -
>>> >> > > -        uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> >> > > -                        pkt_len, NULL, false);
>>> >> > > -        if (!uarg)
>>> >> > > -            return -1;
>>> >> > > -
>>> >> > > -        uarg_zc = uarg_to_msgzc(uarg);
>>> >> > > -        uarg_zc->zerocopy = zerocopy ? 1 : 0;
>>> >> > > -    }
>>> >> > > -
>>> >> > > -    skb_zcopy_init(skb, uarg);
>>> >> > > -
>>> >> > > -    return 0;
>>> >> > > -}
>>> >> > > -
>>> >> > >  static int virtio_transport_fill_skb(struct sk_buff *skb,
>>> >> > >                       struct virtio_vsock_pkt_info *info,
>>> >> > >                       size_t len,
>>> >> > > @@ -317,8 +289,10 @@ static int 
>>> >> > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> >> > >      u32 src_cid, src_port, dst_cid, dst_port;
>>> >> > >      const struct virtio_transport *t_ops;
>>> >> > >      struct virtio_vsock_sock *vvs;
>>> >> > > +    struct ubuf_info *uarg = NULL;
>>> >> > >      u32 pkt_len = info->pkt_len;
>>> >> > >      bool can_zcopy = false;
>>> >> > > +    bool have_uref = false;
>>> >> > >      u32 rest_len;
>>> >> > >      int ret;
>>> >> > >
>>> >> > > @@ -360,6 +334,25 @@ static int 
>>> >> > > virtio_transport_send_pkt_info(struct vsock_sock *vsk,
>>> >> > >          if (can_zcopy)
>>> >> > >              max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
>>> >> > >                          (MAX_SKB_FRAGS * PAGE_SIZE));
>>> >> > > +
>>> >> > > +        if (info->msg->msg_flags & MSG_ZEROCOPY &&
>>> >> > > +            info->op == VIRTIO_VSOCK_OP_RW) {
>>> >> > > +            uarg = info->msg->msg_ubuf;
>>> >> > > +
>>> >> > > +            if (!uarg) {
>>> >> > > +                uarg = msg_zerocopy_realloc(sk_vsock(vsk),
>>> >> > > +                                pkt_len, NULL, false);
>>> >> > > +                if (!uarg) {
>>> >> > > +                    virtio_transport_put_credit(vvs, pkt_len);
>>> >> > > +                    return -ENOMEM;
>>> >> > > +                }
>>> >> > > +
>>> >> > > +                if (!can_zcopy)
>>> >> > > +                    uarg_to_msgzc(uarg)->zerocopy = 0;
>>> >> > > +
>>> >> > > +                have_uref = true;
>>> >> > > +            }
>>> >> > > +        }
>>> >> >

Hi guys! Just some replies after quick look:

>>> >> > Surely that block should only be done if can_zcopy is true?

I guess no, since TCP also allocates uarg even if zerocopy is impossible -
it just sets uarg_to_msgzc(uarg)->zerocopy = 0;

>>> >> > And shouldn't something unset it if info->op != VIRTIO_VSOCK_OP_RW ?

Hm, we can't enter block 'if (info->msg) {' when 'info->op' is not equal to 
'VIRTIO_VSOCK_OP_RW',
because 'msg' is not NULL only for 'VIRTIO_VSOCK_OP_RW'. But anyway You point 
to right thing - check for
'VIRTIO_VSOCK_OP_RW' could be removed here. Just with comment why.

>>> >> > If the msg_zerocopy_realloc() fails then can't you just set can_zcopy 
>>> >> > to false.

Here I guess it is better to follow TCP way to make same behaviour, because 
exact
logic for MSG_ZEROCOPY is not documented. TCP also returns error and stops tx 
loop.

>>> >> >
>>> >> > It info->msg->msg_buf is already set then I think you have to disable 
>>> >> > zero-copy.
>>> >> > The caller has already requested a callback - and you can't add 
>>> >> > another.

In TCP implementation if 'msg_ubuf' is set they just use it and check for 
zerocopy in
the same way as 'msg_ubuf' is NULL.

>>> >> >
>>> >> > In any case by the end of this can_zcopy and have_uref are really the 
>>> >> > same flag.
>>> >>

Need to check it more. But 'can_zcopy' means that we fill frags in skb, 
have_uref means that we
allocated completion (but it could be reported with not set 
'SO_EE_CODE_ZEROCOPY_COPIED' if
'can_zcopy' was false).


@Stefano, I guess current implementation differs from TCP in two cases (at 
least from first
view):
1) When 'msg_ubuf' is set: in TCP, already set 'msg_ubuf' is passed to 
'skb_zerocopy_iter_stream()'
   (if zerocopy is possible) where it is used as in vsock in call 
'skb_zcopy_set()'. In vsock case if
   'msg_ubuf' is not NULL we will pass just NULL to 'skb_zcopy_set()'. Yes this 
is will be
   no-effect call today (due to checks in 'skb_zcopy_set()'), but anyway - in 
future may be not.
2) Also i see that 'skb_zerocopy_iter_stream()' in TCP version has some extra 
checks which are
   missed in vsock - we only just call '__zerocopy_sg_from_iter()' to fill skb 
in zerocopy way.

But, I think, instead of trying to compare vsock and TCP versions best way is 
to just copy
current TCP flow as close as possible:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/tree/net/ipv4/tcp.c#n1144
1) Use same flow of checks for 'msg_flags', 'msg_ubuf', SOCK_ZEROCOPY etc.
2) To copy data we can use 'skb_zerocopy_iter_stream()'.
3) The only thing that we don't need in vsock is dev mem related code from TCP 
implementation.

I can take this task, but pls need some time - may be two/three weeks due to 
another tasks. 

What do You think?

Thanks

>>> >> I kept the same approach we had before, trying to make as few changes as
>>> >> possible.
>>> >>
>>> >> All these potential issues seem to be pre-existing and should be 
>>> >> eventually
>>> >> addressed in other patches IMHO. This patch one only resolves the main 
>>> >> issue
>>> >> of calling `skb_zcopy_set()` for every skb to avoid leaking pages, etc.
>>> >
>>> >the patch is upstream now, right? So pretty much have to be patches on
>>> >top.
>>>
>>> If those are actual issues, then yes. TBH, I didn’t look into that
>>> aspect and left it as it was before. We should take a closer look at how
>>> MSG_ZEROCOPY is handled.
>>>
>>> David, if you think it needs fixing and you have time, feel free to send
>>> patches on top.
>>
>> I'm not fully sure how it all works.
> 
> Same here, so I pinged Arseniy who worked on that, since it seemed deliberate 
> to have `can_zcopy` (and set `uarg->zerocopy` accordingly) only when it was 
> supported by the transport.
> 
>> Especially the paths where msg->msg_ubuf is non-NULL, I suspect it should
>> be added to all the skb even if the ZEROCOPY flag isn't set.
>> I was just reading the one function.
>> But there did look like some very dodgy conditionals.
> 
> I see, let's wait for Arseniy's feedback; otherwise, I'll try to fix it next 
> week. As mentioned, this issue existed before this patch, so it shouldn't be 
> a regression.
> 
> Thanks,
> Stefano
> 


Reply via email to