On 1/24/25 1:33 PM, Salvatore Bonaccorso wrote: > Hi Pavel, > > On Fri, Jan 24, 2025 at 06:40:51PM +0000, Pavel Begunkov wrote: >> On 1/24/25 16:30, Xan Charbonnet wrote: >>> On 1/24/25 04:33, Pavel Begunkov wrote: >>>> Thanks for narrowing it down. Xan, can you try this change please? >>>> Waiters can miss wake ups without it, seems to match the description. >>>> >>>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>>> index 9b58ba4616d40..e5a8ee944ef59 100644 >>>> --- a/io_uring/io_uring.c >>>> +++ b/io_uring/io_uring.c >>>> @@ -592,8 +592,10 @@ static inline void __io_cq_unlock_post_flush(struct >>>> io_ring_ctx *ctx) >>>> io_commit_cqring(ctx); >>>> spin_unlock(&ctx->completion_lock); >>>> io_commit_cqring_flush(ctx); >>>> - if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) >>>> + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN)) { >>>> + smp_mb(); >>>> __io_cqring_wake(ctx); >>>> + } >>>> } >>>> void io_cq_unlock_post(struct io_ring_ctx *ctx) >>>> >>> >>> >>> Thanks Pavel! Early results look very good for this change. I'm now >>> running 6.1.120 with your added smp_mb() call. The backup process which >>> had been quickly triggering the issue has been running longer than it ever >>> did when it would ultimately fail. So that's great! >>> >>> One sour note: overnight, replication hung on this machine, which is >>> another failure that started happening with the jump from 6.1.119 to >>> 6.1.123. The machine was running 6.1.124 with the >>> __io_cq_unlock_post_flush function removed completely. That's the kernel >>> we had celebrated yesterday for running the backup process successfully. >>> >>> So, we might have two separate issues to deal with, unfortunately. >> >> Possible, but it could also be a side effect of reverting the patch. >> As usual, in most cases patches are ported either because they're >> fixing sth or other fixes depend on it, and it's not yet apparent >> to me what happened with this one. > > I researched bit the lists, and there was the inclusion request on the > stable list itself. Looking into the io-uring list I found > https://lore.kernel.org/io-uring/CADZouDRFJ9jtXHqkX-PTKeT=gxswdmc42zesakr34psug9t...@mail.gmail.com/ > which I think was the trigger to later on include in fact the commit > in 6.1.120.
Yep indeed, was just looking for the backstory and that is why it got backported. Just missed the fact that it should've been an io_cqring_wake() rather than __io_cqring_wake()... -- Jens Axboe