On Fri, Jul 18, 2025 at 8:52 AM Nam Cao <nam...@linutronix.de> wrote: > > ep_events_available() checks for available events by looking at ep->rdllist > and ep->ovflist. However, this is done without a lock, therefore the > returned value is not reliable. Because it is possible that both checks on > ep->rdllist and ep->ovflist are false while ep_start_scan() or > ep_done_scan() is being executed on other CPUs, despite events are > available. > > This bug can be observed by: > > 1. Create an eventpoll with at least one ready level-triggered event > > 2. Create multiple threads who do epoll_wait() with zero timeout. The > threads do not consume the events, therefore all epoll_wait() should > return at least one event. > > If one thread is executing ep_events_available() while another thread is > executing ep_start_scan() or ep_done_scan(), epoll_wait() may wrongly > return no event for the former thread.
That is the whole point of epoll_wait with a zero timeout. We would want to opportunistically poll without much overhead, which will have more false positives. A caller that calls with a zero timeout should retry later, and will at some point observe the event. I'm not sure if we would want to add much more overheads, for higher precision. Thanks, Soheil > This reproducer is implemented as TEST(epoll65) in > tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c > > Fix it by skipping ep_events_available(), just call ep_try_send_events() > directly. > > epoll_sendevents() (io_uring) suffers the same problem, fix that as well. > > There is still ep_busy_loop() who uses ep_events_available() without lock, > but it is probably okay (?) for busy-polling. > > Fixes: c5a282e9635e ("fs/epoll: reduce the scope of wq lock in epoll_wait()") > Fixes: e59d3c64cba6 ("epoll: eliminate unnecessary lock for zero timeout") > Fixes: ae3a4f1fdc2c ("eventpoll: add epoll_sendevents() helper") > Signed-off-by: Nam Cao <nam...@linutronix.de> > Cc: sta...@vger.kernel.org > --- > fs/eventpoll.c | 16 ++-------------- > 1 file changed, 2 insertions(+), 14 deletions(-) > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c > index 0fbf5dfedb24..541481eafc20 100644 > --- a/fs/eventpoll.c > +++ b/fs/eventpoll.c > @@ -2022,7 +2022,7 @@ static int ep_schedule_timeout(ktime_t *to) > static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, > int maxevents, struct timespec64 *timeout) > { > - int res, eavail, timed_out = 0; > + int res, eavail = 1, timed_out = 0; > u64 slack = 0; > wait_queue_entry_t wait; > ktime_t expires, *to = NULL; > @@ -2041,16 +2041,6 @@ static int ep_poll(struct eventpoll *ep, struct > epoll_event __user *events, > timed_out = 1; > } > > - /* > - * This call is racy: We may or may not see events that are being > added > - * to the ready list under the lock (e.g., in IRQ callbacks). For > cases > - * with a non-zero timeout, this thread will check the ready list > under > - * lock and will add to the wait queue. For cases with a zero > - * timeout, the user by definition should not care and will have to > - * recheck again. > - */ > - eavail = ep_events_available(ep); > - > while (1) { > if (eavail) { > res = ep_try_send_events(ep, events, maxevents); > @@ -2496,9 +2486,7 @@ int epoll_sendevents(struct file *file, struct > epoll_event __user *events, > * Racy call, but that's ok - it should get retried based on > * poll readiness anyway. > */ > - if (ep_events_available(ep)) > - return ep_try_send_events(ep, events, maxevents); > - return 0; > + return ep_try_send_events(ep, events, maxevents); > } > > /* > -- > 2.39.5 >