On Tue, Mar 07, 2017 at 08:05:19AM +0100, Dmitry Vyukov wrote: > On Tue, Mar 7, 2017 at 12:08 AM, Paul E. McKenney > <paul...@linux.vnet.ibm.com> wrote: > >> >> >> >> Hello, > >> >> >> >> > >> >> >> >> Paul, you wanted bugs in rcu. > >> >> >> > > >> >> >> > Well, whether I want them or not, I must deal with them. ;-) > >> >> >> > > >> >> >> >> I've got this WARNING while running syzkaller fuzzer on > >> >> >> >> 86292b33d4b79ee03e2f43ea0381ef85f077c760: > >> >> >> >> > >> >> >> >> ------------[ cut here ]------------ > >> >> >> >> WARNING: CPU: 0 PID: 4832 at kernel/rcu/tree.c:3533 > >> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533 > >> >> >> >> Kernel panic - not syncing: panic_on_warn set ... > >> >> >> >> CPU: 0 PID: 4832 Comm: kworker/0:3 Not tainted 4.10.0+ #276 > >> >> >> >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs > >> >> >> >> 01/01/2011 > >> >> >> >> Workqueue: events wait_rcu_exp_gp > >> >> >> >> Call Trace: > >> >> >> >> __dump_stack lib/dump_stack.c:15 [inline] > >> >> >> >> dump_stack+0x2ee/0x3ef lib/dump_stack.c:51 > >> >> >> >> panic+0x1fb/0x412 kernel/panic.c:179 > >> >> >> >> __warn+0x1c4/0x1e0 kernel/panic.c:540 > >> >> >> >> warn_slowpath_null+0x2c/0x40 kernel/panic.c:583 > >> >> >> >> rcu_seq_end+0x110/0x140 kernel/rcu/tree.c:3533 > >> >> >> >> rcu_exp_gp_seq_end kernel/rcu/tree_exp.h:36 [inline] > >> >> >> >> rcu_exp_wait_wake+0x8a9/0x1330 kernel/rcu/tree_exp.h:517 > >> >> >> >> rcu_exp_sel_wait_wake kernel/rcu/tree_exp.h:559 [inline] > >> >> >> >> wait_rcu_exp_gp+0x83/0xc0 kernel/rcu/tree_exp.h:570 > >> >> >> >> process_one_work+0xc06/0x1c20 kernel/workqueue.c:2096 > >> >> >> >> worker_thread+0x223/0x19c0 kernel/workqueue.c:2230 > >> >> >> >> kthread+0x326/0x3f0 kernel/kthread.c:227 > >> >> >> >> ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430 > >> >> >> >> Dumping ftrace buffer: > >> >> >> >> (ftrace buffer empty) > >> >> >> >> Kernel Offset: disabled > >> >> >> >> Rebooting in 86400 seconds.. > >> >> >> >> > >> >> >> >> > >> >> >> >> Not reproducible. But looking at the code, shouldn't it be: > >> >> >> >> > >> >> >> >> static void rcu_seq_end(unsigned long *sp) > >> >> >> >> { > >> >> >> >> smp_mb(); /* Ensure update-side operation before counter > >> >> >> >> increment. */ > >> >> >> >> + WARN_ON_ONCE(!(*sp & 0x1)); > >> >> >> >> WRITE_ONCE(*sp, *sp + 1); > >> >> >> >> - WARN_ON_ONCE(*sp & 0x1); > >> >> >> >> } > >> >> >> >> > >> >> >> >> ? > >> >> >> >> > >> >> >> >> Otherwise wait_event in _synchronize_rcu_expedited can return as > >> >> >> >> soon > >> >> >> >> as WRITE_ONCE(*sp, *sp + 1) finishes. As far as I understand this > >> >> >> >> consequently can allow start of next grace periods. Which in turn > >> >> >> >> can > >> >> >> >> make the warning fire. Am I missing something? > >> >> >> >> > >> >> >> >> I don't see any other bad consequences of this. The rest of > >> >> >> >> rcu_exp_wait_wake can proceed when _synchronize_rcu_expedited has > >> >> >> >> returned and destroyed work on stack and next period has started > >> >> >> >> and > >> >> >> >> ended, but it seems OK. > >> >> >> > > >> >> >> > I believe that this is a heygood change, but I don't see how it > >> >> >> > will > >> >> >> > help in this case. BTW, may I have your Signed-off-by? > >> >> >> > > >> >> >> > The reason I don't believe that it will help is that the > >> >> >> > rcu_exp_gp_seq_end() function is called from a workqueue handler > >> >> >> > that > >> >> >> > is invoked holding ->exp_mutex, and this mutex is not released > >> >> >> > until > >> >> >> > after the handler invokes rcu_seq_end() and then wakes up the task > >> >> >> > that > >> >> >> > scheduled the workqueue handler. So the ordering above should not > >> >> >> > matter > >> >> >> > (but I agree that your ordering is cleaner. > >> >> >> > > >> >> >> > That said, it looks like I am missing some memory barriers, please > >> >> >> > see the following patch. > >> >> >> > > >> >> >> > But what architecture did you see this on? > >> >> >> > >> >> >> > >> >> >> This is just x86. > >> >> >> > >> >> >> You seem to assume that wait_event() waits for the wakeup. It does > >> >> >> not > >> >> >> work this way. It can return as soon as the condition becomes true > >> >> >> without ever waiting: > >> >> >> > >> >> >> 305 #define wait_event(wq, condition) > >> >> >> \ > >> >> >> 306 do { > >> >> >> \ > >> >> >> 307 might_sleep(); > >> >> >> \ > >> >> >> 308 if (condition) > >> >> >> \ > >> >> >> 309 break; > >> >> >> \ > >> >> >> 310 __wait_event(wq, condition); > >> >> >> \ > >> >> >> 311 } while (0) > >> >> > > >> >> > Agreed, hence my patch in the previous email. I guess I knew that, > >> >> > but > >> >> > >> >> Ah, you meant to synchronize rcu_seq_end with rcu_seq_done? > >> > > >> > No, there is a mutex release and acquisition that do the synchronization, > >> > but only if the wakeup has appropriate barriers. The issue is that > >> > part of the mutex's critical section executes in a workqueue, possibly > >> > on some other CPU. > >> > >> What is that mutex? And what locks/unlocks provide synchronization? I > >> see that one uses exp_mutex and another -- exp_wake_mutex. > > > > Both of them. > > > > ->exp_mutex is acquired by the task requesting the grace period, and > > the counter's first increment is done by that task under that mutex. > > This task then schedules a workqueue, which drives forward the grace > > period. Upon grace-period completion, the workqueue handler does the > > second increment (the one that your patch addressed). The workqueue > > handler then acquires ->exp_wake_mutex and wakes the task that holds > > ->exp_mutex (along with all other tasks waiting for this grace period), > > and that task releases ->exp_mutex, which allows the next grace period to > > start (and the first increment for that next grace period to be carried > > out under that lock). The workqueue handler releases ->exp_wake_mutex > > after finishing its wakeups.
First and foremost, thank you for reviewing this stuff!!! > Then we need the following for the case when task requesting the grace > period does not block, right? > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index d80c2587bed8..aa7ba83f6a56 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3534,7 +3534,7 @@ static void rcu_seq_start(unsigned long *sp) > static void rcu_seq_end(unsigned long *sp) > { > smp_mb(); /* Ensure update-side operation before counter increment. */ > - WRITE_ONCE(*sp, *sp + 1); > + smp_store_release(sp, *sp + 1); > WARN_ON_ONCE(*sp & 0x1); > } The above is not needed, as the smp_mb() covers it completely. > @@ -3554,7 +3554,7 @@ static unsigned long rcu_seq_snap(unsigned long *sp) > */ > static bool rcu_seq_done(unsigned long *sp, unsigned long s) > { > - return ULONG_CMP_GE(READ_ONCE(*sp), s); > + return ULONG_CMP_GE(smp_load_acquire(sp), s); And this one is covered by the smp_mb__before_atomic() inside the "if" statement in sync_exp_work_done(), via rcu_exp_gp_seq_done(). That is, sync_exp_work_done() invokes rcu_exp_gp_seq_done() which in turn invokes rcu_seq_done(). Why smp_mb() instead of release-acquire? Because the grace-period relationship is quite strong, it cannot be implemented with release-acquire on all platforms. Thanx, Paul