On Fri, May 2, 2025 at 4:47 AM Paolo Abeni <pab...@redhat.com> wrote: > > Hi, > > On 4/29/25 5:26 AM, Mina Almasry wrote: > > Augment dmabuf binding to be able to handle TX. Additional to all the RX > > binding, we also create tx_vec needed for the TX path. > > > > Provide API for sendmsg to be able to send dmabufs bound to this device: > > > > - Provide a new dmabuf_tx_cmsg which includes the dmabuf to send from. > > - MSG_ZEROCOPY with SCM_DEVMEM_DMABUF cmsg indicates send from dma-buf. > > > > Devmem is uncopyable, so piggyback off the existing MSG_ZEROCOPY > > implementation, while disabling instances where MSG_ZEROCOPY falls back > > to copying. > > > > We additionally pipe the binding down to the new > > zerocopy_fill_skb_from_devmem which fills a TX skb with net_iov netmems > > instead of the traditional page netmems. > > > > We also special case skb_frag_dma_map to return the dma-address of these > > dmabuf net_iovs instead of attempting to map pages. > > > > The TX path may release the dmabuf in a context where we cannot wait. > > This happens when the user unbinds a TX dmabuf while there are still > > references to its netmems in the TX path. In that case, the netmems will > > be put_netmem'd from a context where we can't unmap the dmabuf, Resolve > > this by making __net_devmem_dmabuf_binding_free schedule_work'd. > > > > Based on work by Stanislav Fomichev <s...@fomichev.me>. A lot of the meat > > of the implementation came from devmem TCP RFC v1[1], which included the > > TX path, but Stan did all the rebasing on top of netmem/net_iov. > > > > Cc: Stanislav Fomichev <s...@fomichev.me> > > Signed-off-by: Kaiyuan Zhang <kaiyu...@google.com> > > Signed-off-by: Mina Almasry <almasrym...@google.com> > > Acked-by: Stanislav Fomichev <s...@fomichev.me> > > I'm sorry for the late feedback. A bunch of things I did not notice > before... > > > @@ -701,6 +743,8 @@ int __zerocopy_sg_from_iter(struct msghdr *msg, struct > > sock *sk, > > > > if (msg && msg->msg_ubuf && msg->sg_from_iter) > > ret = msg->sg_from_iter(skb, from, length); > > + else if (unlikely(binding)) > > I'm unsure if the unlikely() here (and in similar tests below) it's > worthy: depending on the actual workload this condition could be very > likely. >
Right, for now I'm guessing the MSG_ZEROCOPY use case in the else is more common workload, and putting the devmem use case in the unlikely path so I don't regress other use cases. We could revisit this in the future. In my tests, the devmem workload doesn't seem affected by this unlikely. > [...] > > @@ -1066,11 +1067,24 @@ int tcp_sendmsg_locked(struct sock *sk, struct > > msghdr *msg, size_t size) > > int flags, err, copied = 0; > > int mss_now = 0, size_goal, copied_syn = 0; > > int process_backlog = 0; > > + bool sockc_valid = true; > > int zc = 0; > > long timeo; > > > > flags = msg->msg_flags; > > > > + sockc = (struct sockcm_cookie){ .tsflags = READ_ONCE(sk->sk_tsflags), > > + .dmabuf_id = 0 }; > > the '.dmabuf_id = 0' part is not needed, and possibly the code is > clearer without it. > > > + if (msg->msg_controllen) { > > + err = sock_cmsg_send(sk, msg, &sockc); > > + if (unlikely(err)) > > + /* Don't return error until MSG_FASTOPEN has been > > + * processed; that may succeed even if the cmsg is > > + * invalid. > > + */ > > + sockc_valid = false; > > + } > > + > > if ((flags & MSG_ZEROCOPY) && size) { > > if (msg->msg_ubuf) { > > uarg = msg->msg_ubuf; > > @@ -1078,7 +1092,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr > > *msg, size_t size) > > zc = MSG_ZEROCOPY; > > } else if (sock_flag(sk, SOCK_ZEROCOPY)) { > > skb = tcp_write_queue_tail(sk); > > - uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); > > + uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), > > + sockc_valid && > > !!sockc.dmabuf_id); > > If sock_cmsg_send() failed and the user did not provide a dmabuf_id, > memory accounting will be incorrect. > Forgive me but I don't see it. sockc_valid will be false, so msg_zerocopy_realloc will do the normal MSG_ZEROCOPY accounting. Then below whech check sockc_valid in place of where we did the sock_cmsg_send before, and goto err. I assume the goto err should undo the memory accounting in the new code as in the old code. Can you elaborate on the bug you see? > > if (!uarg) { > > err = -ENOBUFS; > > goto out_err; > > @@ -1087,12 +1102,27 @@ int tcp_sendmsg_locked(struct sock *sk, struct > > msghdr *msg, size_t size) > > zc = MSG_ZEROCOPY; > > else > > uarg_to_msgzc(uarg)->zerocopy = 0; > > + > > + if (sockc_valid && sockc.dmabuf_id) { > > + binding = net_devmem_get_binding(sk, > > sockc.dmabuf_id); > > + if (IS_ERR(binding)) { > > + err = PTR_ERR(binding); > > + binding = NULL; > > + goto out_err; > > + } > > + } > > } > > } else if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES) && size) { > > if (sk->sk_route_caps & NETIF_F_SG) > > zc = MSG_SPLICE_PAGES; > > } > > > > + if (sockc_valid && sockc.dmabuf_id && > > + (!(flags & MSG_ZEROCOPY) || !sock_flag(sk, SOCK_ZEROCOPY))) { > > + err = -EINVAL; > > + goto out_err; > > + } > > + > > if (unlikely(flags & MSG_FASTOPEN || > > inet_test_bit(DEFER_CONNECT, sk)) && > > !tp->repair) { > > @@ -1131,14 +1161,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct > > msghdr *msg, size_t size) > > /* 'common' sending to sendq */ > > } > > > > - sockc = (struct sockcm_cookie) { .tsflags = > > READ_ONCE(sk->sk_tsflags)}; > > - if (msg->msg_controllen) { > > - err = sock_cmsg_send(sk, msg, &sockc); > > - if (unlikely(err)) { > > - err = -EINVAL; > > - goto out_err; > > - } > > - } > > + if (!sockc_valid) > > + goto out_err; > > Here 'err' could have been zeroed by tcp_sendmsg_fastopen(), and out_err > could emit a wrong return value. > Good point indeed. > Possibly it's better to keep the 'dmabuf_id' initialization out of > sock_cmsg_send() in a separate helper could simplify the handling here? > This should be possible as well. -- Thanks, Mina