On Fri, Nov 02, 2018 at 01:00:00PM -0700, Paul E. McKenney wrote: [..] > > I think it would make sense also to combine it with other memory-ordering > > topics like the memory model and rseq/cpu-opv things that Mathieu was doing > > (if it makes sense to combine). But yes, I am definitely interested in an > > RCU-implementation BoF session. > > There is an LKMM kernel summit track presentation. I believe that > Mathieu's rseq/cpu-opv would be a good one as well, but Mathieu needs > to lead this up and it should be a separate BoF. Please do feel free > to reach out to him. I am sure that he would be particularly interested > in potential uses of rseq and especially cpu-opv.
Cool! Looking forward to LKMM talk and I'll keep in mind to reach out to Mathieu about rseq usecases. > > > > > > Also about GP memory ordering and RCU-tree-locking, I think you > > > > > > mentioned to > > > > > > me that the RCU reader-sections are virtually extended both forward > > > > > > and > > > > > > backward and whereever it ends, those paths do heavy-weight > > > > > > synchronization > > > > > > that should be sufficient to prevent memory ordering issues (such > > > > > > as those > > > > > > you mentioned in the Requierments document). That is exactly why we > > > > > > don't > > > > > > need explicit barriers during rcu_read_unlock. If I recall I asked > > > > > > you why > > > > > > those are not needed. So that answer made sense, but then now on > > > > > > going > > > > > > through the 'Memory Ordering' document, I see that you mentioned > > > > > > there is > > > > > > reliance on the locking. Is that reliance on locking necessary to > > > > > > maintain > > > > > > ordering then? > > > > > > > > > > There is a "network" of locking augmented by > > > > > smp_mb__after_unlock_lock() > > > > > that implements the all-to-all memory ordering mentioned above. But > > > > > it > > > > > also needs to handle all the possible complete()/wait_for_completion() > > > > > races, even those assisted by hypervisor vCPU preemption. > > > > > > > > I see, so it sounds like the lock network is just a partial solution. > > > > For > > > > some reason I thought before that complete() was even called on the CPU > > > > executing the callback, all the CPUs would have acquired and released a > > > > lock > > > > in the "lock network" atleast once thus ensuring the ordering (due to > > > > the > > > > fact that the quiescent state reporting has to travel up the tree > > > > starting > > > > from the leaves), but I think that's not necessarily true so I see your > > > > point > > > > now. > > > > > > There is indeed a lock that is unconditionally acquired and released by > > > wait_for_completion(), but it lacks the smp_mb__after_unlock_lock() that > > > is required to get full-up any-to-any ordering. And unfortunate timing > > > (as well as spurious wakeups) allow the interaction to have only normal > > > lock-release/acquire ordering, which does not suffice in all cases. > > > > > > SRCU and expedited RCU grace periods handle this correctly. Only the > > > normal grace periods are missing the needed barrier. The probability of > > > failure is extremely low in the common case, which involves all sorts > > > of synchronization on the wakeup path. It would be quite strange (but > > > not impossible) for the wait_for_completion() exit path to -not- to do > > > a full wakeup. Plus the bug requires a reader before the grace period > > > to do a store to some location that post-grace-period code loads from. > > > Which is a very rare use case. > > > > > > But it still should be fixed. ;-) > > > > > > > Did you feel this will violate condition 1. or condition 2. in > > > > "Memory-Barrier > > > > Guarantees"? Or both? > > > > https://www.kernel.org/doc/Documentation/RCU/Design/Requirements/Requirements.html#Memory-Barrier%20Guarantees > > > > > > Condition 1. There might be some strange combination of events that > > > could also cause it to also violate condition 2, but I am not immediately > > > seeing it. > > > > In the previous paragraph, you mentioned the bug "requires a reader before > > the GP to do a store". However, condition 1 is really different - it is a > > reader holding a reference to a pointer that is used *after* the > > synchronize_rcu returns. So that reader's load of the pointer should have > > completed by the time GP ends, otherwise the reader can look at kfree'd > > data. > > That's different right? > > More specifically, the fix prevents a prior reader's -store- within its > critical section to be seen as happening after a load that follows the > end of the grace period. So I stand by Condition 1. ;-) > And again, a store within an RCU read-side critical section is a bit > on the strange side, but this sort of thing is perfectly legal and > is used, albeit rather rarely. Cool :) I never thought about condition 1 this way but good to know that's possible :) > > For condition 2, I analyzed it below, let me know what you think: > > > > > Thanx, Paul > > > > > > ------------------------------------------------------------------------ > > > > > > commit bf3c11b7b9789283f993d9beb80caaabc4403916 > > > Author: Paul E. McKenney <paul...@linux.ibm.com> > > > Date: Thu Nov 1 09:05:02 2018 -0700 > > > > > > rcu: Add full memory barrier in __wait_rcu_gp() > > > > > > RCU grace periods have extremely strong any-to-any ordering > > > requirements that are met by placing full barriers in various places > > > in the grace-period computation. However, normal grace period > > > requests > > > might be subjected to a "fly-by" wakeup in which the requesting > > > process > > > doesn't actually sleep and in which the corresponding CPU is not yet > > > aware that the grace period has ended. In this case, loads in the > > > code > > > immediately following the synchronize_rcu() call might potentially see > > > values before stores preceding the grace period on other CPUs. > > > > > > This is an unusual use case, because RCU readers normally read. > > > However, > > > they can do writes, and if they do, we need post-grace-period reads to > > > see those writes. > > > > > > This commit therefore adds an smp_mb() to the end of __wait_rcu_gp(). > > > > > > Many thanks to Joel Fernandes for the series of questions leading to > > > me > > > realizing that this bug exists! > > > > > > Signed-off-by: Paul E. McKenney <paul...@linux.ibm.com> > > > > > > diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c > > > index 1971869c4072..74020b558216 100644 > > > --- a/kernel/rcu/update.c > > > +++ b/kernel/rcu/update.c > > > @@ -360,6 +360,7 @@ void __wait_rcu_gp(bool checktiny, int n, > > > call_rcu_func_t *crcu_array, > > > wait_for_completion(&rs_array[i].completion); > > > destroy_rcu_head_on_stack(&rs_array[i].head); > > > } > > > + smp_mb(); /* Provide ordering in case of fly-by wakeup. */ > > > } > > > EXPORT_SYMBOL_GPL(__wait_rcu_gp); > > > https://cs.corp.google.com/piper///depot/google3/base/percpu.cc?type=cs&q=%22rseq%22+restart&sq=package:piper+file://depot/google3+-file:google3/experimental&g=0&l=247> > > > The fix looks fine to me. Thanks. > > > > If I understand correctly the wait_for_completion() is an ACQUIRE operation, > > and the complete() is a RELEASE operation aka the "MP pattern". The > > ACQUIRE/RELEASE semantics allow any writes that happened before the ACQUIRE > > to get ordered after it. So that would actually imply it is not strong > > enough > > for ordering purposes during a "fly-by" wake up scenario and would be a > > violation of CONDITION 2, I think (not only condition 1 as you said). This > > is because future readers may accidentally see the writes that happened > > *before* the synchronize_rcu which is CONDITION 2 in the requirements: > > https://goo.gl/8mrDHN (I had to shortlink it ;)) > > I do appreciate the shorter link. ;-) > > A write happening before the grace period is ordered by the grace period's > network of strong barriers, so the fix does not matter in that case. I was talking about the acquire/release pairs in the calls to spin_lock and spin_unlock in wait_for_completion, not in the grace period network of rnp locks. Does that make sense? I thought that during a "fly-by" wake up, that network of strong barriers doesn't really trigger and that that's the problematic scenario. Did I miss something? I was talking about the acquire/release pair in wait_for_completion during that fly-by scenario. > Also, the exact end of the grace period is irrelevant for Condition 2, > it is instead the beginning of the grace period compared to the beginning > of later RCU read-side critical sections. > > Not saying that Condition 2 cannot somehow happen without the memory > barrier, just saying that it will take quite a bit more creativity to > find a relevant scenario. > > Please see below for the updated patch, containing only the typo fix. > Please let me know if I messed anything up. Looks good to me, thanks! - Joel