On Tue, 19 May 2026 at 17:36, Stefano Garzarella <[email protected]> wrote: > > On Mon, May 18, 2026 at 08:39:37PM +0800, ziyu zhang wrote: > >On Mon, 18 May 2026 at 18:16, Stefano Garzarella <[email protected]> wrote: > >> > >> On Sat, May 16, 2026 at 11:47:45AM +0800, Ziyu Zhang wrote: > >> >vsock_poll() reads vsk->peer_shutdown before taking the socket > >> >lock to set EPOLLHUP and EPOLLRDHUP, then reads it again under the > >> >lock to report EOF readability. A shutdown packet can update > >> >peer_shutdown while poll is waiting for the lock, so one poll invocation > >> >can report EPOLLIN without the corresponding HUP/RDHUP bits. > >> > > >> >Keep non-connectible sockets on a single lockless READ_ONCE() > >> > >> Should this be paired with WRITE_ONCE() on writers? > > > >Yes, since the poll path uses lockless READ_ONCE() snapshots of > >peer_shutdown, the writer side should be annotated with WRITE_ONCE() as > >well. I will add that in v2. > > > >> > >> >snapshot. For connectible sockets, defer shutdown-derived poll bits > >> >until after lock_sock() and use one READ_ONCE() snapshot for both EOF > >> >readability and HUP/RDHUP. This preserves shutdowns that arrive before > >> >the lock is acquired and keeps all peer-shutdown-derived bits consistent > >> >for a poll pass. > >> > > >> >Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") > >> >Signed-off-by: Ziyu Zhang <[email protected]> > >> >--- > >> > net/vmw_vsock/af_vsock.c | 40 ++++++++++++++++++++++++++-------------- > >> > 1 file changed, 26 insertions(+), 14 deletions(-) > >> > > >> >diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c > >> >index adcba1b7b..bed42347b 100644 > >> >--- a/net/vmw_vsock/af_vsock.c > >> >+++ b/net/vmw_vsock/af_vsock.c > >> >@@ -1122,6 +1122,25 @@ static int vsock_shutdown(struct socket *sock, int > >> >mode) > >> > return err; > >> > } > >> > > >> >+static __poll_t vsock_poll_shutdown(struct sock *sk, u32 peer_shutdown) > >> >+{ > >> >+ __poll_t mask = 0; > >> >+ > >> >+ /* INET sockets treat local write shutdown and peer write shutdown > >> >as a > >> >+ * case of EPOLLHUP set. > >> >+ */ > >> >+ if (sk->sk_shutdown == SHUTDOWN_MASK || > >> >+ ((sk->sk_shutdown & SEND_SHUTDOWN) && > >> >+ (peer_shutdown & SEND_SHUTDOWN))) > >> >+ mask |= EPOLLHUP; > >> >+ > >> >+ if (sk->sk_shutdown & RCV_SHUTDOWN || > >> >+ peer_shutdown & SEND_SHUTDOWN) > >> >+ mask |= EPOLLRDHUP; > >> >+ > >> >+ return mask; > >> >+} > >> >+ > >> > static __poll_t vsock_poll(struct file *file, struct socket *sock, > >> > poll_table *wait) > >> > { > >> >@@ -1139,19 +1158,9 @@ static __poll_t vsock_poll(struct file *file, > >> >struct socket *sock, > >> > /* Signify that there has been an error on this socket. */ > >> > mask |= EPOLLERR; > >> > > >> >- /* INET sockets treat local write shutdown and peer write shutdown > >> >as a > >> >- * case of EPOLLHUP set. > >> >- */ > >> >- if ((sk->sk_shutdown == SHUTDOWN_MASK) || > >> >- ((sk->sk_shutdown & SEND_SHUTDOWN) && > >> >- (vsk->peer_shutdown & SEND_SHUTDOWN))) { > >> >- mask |= EPOLLHUP; > >> >- } > >> >- > >> >- if (sk->sk_shutdown & RCV_SHUTDOWN || > >> >- vsk->peer_shutdown & SEND_SHUTDOWN) { > >> >- mask |= EPOLLRDHUP; > >> >- } > >> >+ if (!sock_type_connectible(sk->sk_type)) > >> >+ mask |= vsock_poll_shutdown(sk, > >> >+ READ_ONCE(vsk->peer_shutdown)); > >> > >> Can we move this in the `if (sock->type == SOCK_DGRAM)` branch ? > >> > >> Not a strong opinion about that, but in any case IMO we should add a > >> comment here to explain why we are doing only for not connectible > >> sockets. > >> > >> That said, if we use WRITE_ONCE in the writers, do we really need to > >> move this after the lock_sock for the connectable ones? > > > >Yes, I will move the non-connectible handling into the SOCK_DGRAM branch > >and add a comment there. > > > >For connectible sockets, my current understanding is that > >READ_ONCE()/WRITE_ONCE() make the individual lockless accesses explicit, but > >they do not make two separate reads in one vsock_poll() invocation observe > >the > >same peer_shutdown value. So I still think using one peer_shutdown snapshot > >after lock_sock() is useful for keeping the returned mask internally > >consistent. Please let me know if you think WRITE_ONCE() is enough for this > >case. > > What will be the issue of "do not make two separate reads in one > vsock_poll() invocation observe the same peer_shutdown value." ? > > In any case, I'm not against it; I just want to understand the issue > better.
The issue I was trying to avoid is a transiently inconsistent poll mask from one vsock_poll() pass. For example, the early peer_shutdown read can see 0, so the shutdown-derived bits are not set, especially EPOLLRDHUP, and EPOLLHUP in the cases where the existing logic would set it. Then, before lock_sock() succeeds, the peer shutdown can be processed. The later read after lock_sock() can see SEND_SHUTDOWN and set EPOLLIN|EPOLLRDNORM for EOF readability. So the returned mask can say that EOF is readable, but miss the shutdown-derived indication from the same peer_shutdown state. If userspace is waiting specifically for EPOLLRDHUP, that notification can be missed or delayed for that poll pass. I agree this is small and transient. A following read() or another poll pass will observe the shutdown state. My intention with the single snapshot is only to avoid mixing old and new peer_shutdown values in one returned mask. I will send a v2 as a new thread with WRITE_ONCE() on the writers, the DGRAM comment, and your ordering suggestion. Thanks, Ziyu > > Thanks, > Stefano >

