On Sun, Nov 04, 2018 at 07:43:30PM -0800, Paul E. McKenney wrote:
[...]
> > > > > > > > Also about GP memory ordering and RCU-tree-locking, I think you 
> > > > > > > > mentioned to
> > > > > > > > me that the RCU reader-sections are virtually extended both 
> > > > > > > > forward and
> > > > > > > > backward and whereever it ends, those paths do heavy-weight 
> > > > > > > > synchronization
> > > > > > > > that should be sufficient to prevent memory ordering issues 
> > > > > > > > (such as those
> > > > > > > > you mentioned in the Requierments document). That is exactly 
> > > > > > > > why we don't
> > > > > > > > need explicit barriers during rcu_read_unlock. If I recall I 
> > > > > > > > asked you why
> > > > > > > > those are not needed. So that answer made sense, but then now 
> > > > > > > > on going
> > > > > > > > through the 'Memory Ordering' document, I see that you 
> > > > > > > > mentioned there is
> > > > > > > > reliance on the locking. Is that reliance on locking necessary 
> > > > > > > > to maintain
> > > > > > > > ordering then?
> > > > > > > 
> > > > > > > There is a "network" of locking augmented by 
> > > > > > > smp_mb__after_unlock_lock()
> > > > > > > that implements the all-to-all memory ordering mentioned above.  
> > > > > > > But it
> > > > > > > also needs to handle all the possible 
> > > > > > > complete()/wait_for_completion()
> > > > > > > races, even those assisted by hypervisor vCPU preemption.
> > > > > > 
> > > > > > I see, so it sounds like the lock network is just a partial 
> > > > > > solution. For
> > > > > > some reason I thought before that complete() was even called on the 
> > > > > > CPU
> > > > > > executing the callback, all the CPUs would have acquired and 
> > > > > > released a lock
> > > > > > in the "lock network" atleast once thus ensuring the ordering (due 
> > > > > > to the
> > > > > > fact that the quiescent state reporting has to travel up the tree 
> > > > > > starting
> > > > > > from the leaves), but I think that's not necessarily true so I see 
> > > > > > your point
> > > > > > now.
> > > > > 
> > > > > There is indeed a lock that is unconditionally acquired and released 
> > > > > by
> > > > > wait_for_completion(), but it lacks the smp_mb__after_unlock_lock() 
> > > > > that
> > > > > is required to get full-up any-to-any ordering.  And unfortunate 
> > > > > timing
> > > > > (as well as spurious wakeups) allow the interaction to have only 
> > > > > normal
> > > > > lock-release/acquire ordering, which does not suffice in all cases.
> > > > 
> > > > Sorry to be so persistent, but I did spend some time on this and I still
> > > > don't get why every CPU would _not_ have executed 
> > > > smp_mb__after_unlock_lock at least
> > > > once before the wait_for_completion() returns, because every CPU should 
> > > > have
> > > > atleast called rcu_report_qs_rdp() -> rcu_report_qs_rnp() atleast once 
> > > > to
> > > > report its QS up the tree right?. Before that procedure, the complete()
> > > > cannot happen because the complete() itself is in an RCU callback which 
> > > > is
> > > > executed only once all the QS(s) have been reported.
> > > > 
> > > > So I still couldn't see how the synchronize_rcu can return without the
> > > > rcu_report_qs_rnp called atleast once on the CPU reporting its QS 
> > > > during a
> > > > grace period.
> > > > 
> > > > Would it be possible to provide a small example showing this in least 
> > > > number
> > > > of steps? I appreciate your time and it would be really helpful. If you 
> > > > feel
> > > > its too complicated, then feel free to keep this for LPC discussion :)
> > > 
> > > The key point is that "at least once" does not suffice, other than for the
> > > CPU that detects the end of the grace period.  The rest of the CPUs must
> > > do at least -two- full barriers, which could of course be either smp_mb()
> > > on the one hand or smp_mb__after_unlock_lock() after a lock on the other.
> > 
> > I thought I'll atleast get an understanding of the "atleast two full
> > barriers" point and ask you any questions at LPC, because that's what I'm
> > missing I think. Trying to understand what can go wrong without two full
> > barriers. I'm sure an RCU implementation BoF could really in this regard.
> > 
> > I guess its also documented somewhere in Tree-RCU-Memory-Ordering.html but a
> > quick search through that document didn't show a mention of the two full
> > barriers need.. I think its also a great idea for us to document it there
> > and/or discuss it during the conference.
> > 
> > I went through the litmus test here for some hints on the two-barriers but
> > couldn't find any:
> > https://lkml.org/lkml/2017/10/5/636
> > 
> > Atleast this commit made me think no extra memory barrier is needed for
> > tree RCU:  :-\
> > https://lore.kernel.org/patchwork/patch/813386/
> > 
> > I'm sure your last email will be useful to me in the future once I can make
> > more sense of the ordering and the need for two full barriers, so thanks a
> > lot for writing it!
> 
> Hmmm...  I have had this argument before, haven't I?  Perhaps I should
> take some time and get my story straight.  ;-)
> 
> In my defense, I have been doing RCU since the early 1990s, long before
> executable formal memory models.  I kept it working through sheer
> paranoia, and that is a hard habit to shake...

Sure no problem, thanks a lot for taking another look into it!

Regards,

- Joel

Reply via email to