On Sun, Nov 04, 2018 at 07:43:30PM -0800, Paul E. McKenney wrote: [...] > > > > > > > > Also about GP memory ordering and RCU-tree-locking, I think you > > > > > > > > mentioned to > > > > > > > > me that the RCU reader-sections are virtually extended both > > > > > > > > forward and > > > > > > > > backward and whereever it ends, those paths do heavy-weight > > > > > > > > synchronization > > > > > > > > that should be sufficient to prevent memory ordering issues > > > > > > > > (such as those > > > > > > > > you mentioned in the Requierments document). That is exactly > > > > > > > > why we don't > > > > > > > > need explicit barriers during rcu_read_unlock. If I recall I > > > > > > > > asked you why > > > > > > > > those are not needed. So that answer made sense, but then now > > > > > > > > on going > > > > > > > > through the 'Memory Ordering' document, I see that you > > > > > > > > mentioned there is > > > > > > > > reliance on the locking. Is that reliance on locking necessary > > > > > > > > to maintain > > > > > > > > ordering then? > > > > > > > > > > > > > > There is a "network" of locking augmented by > > > > > > > smp_mb__after_unlock_lock() > > > > > > > that implements the all-to-all memory ordering mentioned above. > > > > > > > But it > > > > > > > also needs to handle all the possible > > > > > > > complete()/wait_for_completion() > > > > > > > races, even those assisted by hypervisor vCPU preemption. > > > > > > > > > > > > I see, so it sounds like the lock network is just a partial > > > > > > solution. For > > > > > > some reason I thought before that complete() was even called on the > > > > > > CPU > > > > > > executing the callback, all the CPUs would have acquired and > > > > > > released a lock > > > > > > in the "lock network" atleast once thus ensuring the ordering (due > > > > > > to the > > > > > > fact that the quiescent state reporting has to travel up the tree > > > > > > starting > > > > > > from the leaves), but I think that's not necessarily true so I see > > > > > > your point > > > > > > now. > > > > > > > > > > There is indeed a lock that is unconditionally acquired and released > > > > > by > > > > > wait_for_completion(), but it lacks the smp_mb__after_unlock_lock() > > > > > that > > > > > is required to get full-up any-to-any ordering. And unfortunate > > > > > timing > > > > > (as well as spurious wakeups) allow the interaction to have only > > > > > normal > > > > > lock-release/acquire ordering, which does not suffice in all cases. > > > > > > > > Sorry to be so persistent, but I did spend some time on this and I still > > > > don't get why every CPU would _not_ have executed > > > > smp_mb__after_unlock_lock at least > > > > once before the wait_for_completion() returns, because every CPU should > > > > have > > > > atleast called rcu_report_qs_rdp() -> rcu_report_qs_rnp() atleast once > > > > to > > > > report its QS up the tree right?. Before that procedure, the complete() > > > > cannot happen because the complete() itself is in an RCU callback which > > > > is > > > > executed only once all the QS(s) have been reported. > > > > > > > > So I still couldn't see how the synchronize_rcu can return without the > > > > rcu_report_qs_rnp called atleast once on the CPU reporting its QS > > > > during a > > > > grace period. > > > > > > > > Would it be possible to provide a small example showing this in least > > > > number > > > > of steps? I appreciate your time and it would be really helpful. If you > > > > feel > > > > its too complicated, then feel free to keep this for LPC discussion :) > > > > > > The key point is that "at least once" does not suffice, other than for the > > > CPU that detects the end of the grace period. The rest of the CPUs must > > > do at least -two- full barriers, which could of course be either smp_mb() > > > on the one hand or smp_mb__after_unlock_lock() after a lock on the other. > > > > I thought I'll atleast get an understanding of the "atleast two full > > barriers" point and ask you any questions at LPC, because that's what I'm > > missing I think. Trying to understand what can go wrong without two full > > barriers. I'm sure an RCU implementation BoF could really in this regard. > > > > I guess its also documented somewhere in Tree-RCU-Memory-Ordering.html but a > > quick search through that document didn't show a mention of the two full > > barriers need.. I think its also a great idea for us to document it there > > and/or discuss it during the conference. > > > > I went through the litmus test here for some hints on the two-barriers but > > couldn't find any: > > https://lkml.org/lkml/2017/10/5/636 > > > > Atleast this commit made me think no extra memory barrier is needed for > > tree RCU: :-\ > > https://lore.kernel.org/patchwork/patch/813386/ > > > > I'm sure your last email will be useful to me in the future once I can make > > more sense of the ordering and the need for two full barriers, so thanks a > > lot for writing it! > > Hmmm... I have had this argument before, haven't I? Perhaps I should > take some time and get my story straight. ;-) > > In my defense, I have been doing RCU since the early 1990s, long before > executable formal memory models. I kept it working through sheer > paranoia, and that is a hard habit to shake...
Sure no problem, thanks a lot for taking another look into it! Regards, - Joel