On Fri, Jun 28, 2019 at 07:12:41PM -0400, Joel Fernandes wrote: > On Fri, Jun 28, 2019 at 03:25:47PM -0700, Paul E. McKenney wrote: > > On Fri, Jun 28, 2019 at 05:40:18PM -0400, Joel Fernandes wrote: > > > Hi Paul, > > > > > > On Fri, Jun 28, 2019 at 01:04:23PM -0700, Paul E. McKenney wrote: > > > [snip] > > > > > > > Commit > > > > > > > - 23634ebc1d946 ("rcu: Check for wakeup-safe conditions in > > > > > > > rcu_read_unlock_special()") does not trigger the bug within 94 > > > > > > > attempts. > > > > > > > > > > > > > > - 48d07c04b4cc1 ("rcu: Enable elimination of Tree-RCU softirq > > > > > > > processing") needed 12 attempts to trigger the bug. > > > > > > > > > > > > That matches my belief that 23634ebc1d946 ("rcu: Check for > > > > > > wakeup-safe > > > > > > conditions in rcu_read_unlock_special()") will at least greatly > > > > > > decrease > > > > > > the probability of this bug occurring. > > > > > > > > > > I was just typing a reply that I can't reproduce it with: > > > > > rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() > > > > > > > > > > I am trying to revert enough of this patch to see what would break > > > > > things, > > > > > however I think a better exercise might be to understand more what > > > > > the patch > > > > > does why it fixes things in the first place ;-) It is probably the > > > > > deferred_qs thing. > > > > > > > > The deferred_qs flag is part of it! Looking forward to hearing what > > > > you come up with as being the critical piece of this commit. > > > > > > The new deferred_qs flag indeed saves the machine from the dead-lock. > > > > > > If we don't want the deferred_qs, then the below patch also fixes the > > > issue. > > > However, I am more sure than not that it does not handle all cases (such > > > as > > > what if we previously had an expedited grace period IPI in a previous > > > reader > > > section and had to to defer processing. Then it seems a similar deadlock > > > would present. But anyway, the below patch does fix it for me! It is > > > based on > > > your -rcu tree commit 23634ebc1d946f19eb112d4455c1d84948875e31 (rcu: Check > > > for wakeup-safe conditions in rcu_read_unlock_special()). > > > > The point here being that you rely on .b.blocked rather than > > .b.deferred_qs. Hmmm... There are a number of places that check all > > the bits via the .s leg of the rcu_special union. The .s check in > > rcu_preempt_need_deferred_qs() should be OK because it is conditioned > > on t->rcu_read_lock_nesting of zero or negative. > > Do rest of those also work out OK? > > > > It would be nice to remove the flag, but doing so clearly needs careful > > review and testing. > > Agreed. I am planning to do an audit of this code within the next couple of > weeks so I will be on the look out for any optimization opportunities related > to this. Will let you know if this can work. For now I like your patch better > because it is more conservative and doesn't cause any space overhead.
Fixing the bug in a maintainable manner is the priority, to be sure. However, simplifications, assuming that they work, are very much worth considering as well. And Murphy says that there are still a number of bugs and optimization opportunities. ;-) > If you'd like, please free to included my Tested-by on it: > > Tested-by: Joel Fernandes (Google) <j...@joelfernandes.org> Will do, thank you! > If you had a chance, could you also point to me any tests that show > performance improvement with the irqwork patch, on the expedited GP usecase? > I'd like to try it out as well. I guess rcuperf should have some? As a first thing to try, I suggest running rcuperf with both readers and writers, with only expedited grace periods, and with most (or maybe even all) CPUs having nohz_full enabled. Thanx, Paul