On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
[..]
> > > > > CPU will with high probability report its own quiescent state before 
> > > > > three
> > > > > jiffies pass, in which case the cache misses on the rcu_data 
> > > > > structures
> > > > > would be wasted motion.
> > > > 
> > > > If all the CPUs are busy and reporting their QS themselves, then I 
> > > > think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > > 
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state.  This is because for recent kernels, cond_resched()
> > > is not a quiescent state until the grace period is something like 100
> > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > quiescent state unless it actually scheduled.)
> > > 
> > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > to complain bitterly.  Besides, I would expect that in the common case,
> > > CPUs would be executing usermode code.
> > 
> > Makes sense. I was also wondering about this other thing you mentioned about
> > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > that mean that even if a single CPU is dyntick-idle for a long period of
> > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > someone wants shorter grace periods, they should just use expedited GPs), 
> > but
> > it sounds like it would be shorter GP if we just set the qsmask early on 
> > some
> > how and we can manage the overhead of doing so.
> 
> First, there is some autotuning of the delay based on HZ:
> 
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> 
> So at HZ=300, you should be seeing a two-jiffy delay rather than the
> usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> rates often mean slower instruction execution and thus a desire for
> lower RCU overhead.  There is further autotuning based on number of
> CPUs, but this does not kick in until you have 256 CPUs on your system,
> and I bet that smartphones aren't there yet.  Nevertheless, check out
> RCU_JIFFIES_FQS_DIV for more info on this.

Got it. I agree with that heuristic.

> But you can always override this autotuning using the following kernel
> boot paramters:
> 
> rcutree.jiffies_till_first_fqs
> rcutree.jiffies_till_next_fqs
> 
> You can even set the first one to zero if you want the effect of pre-scanning
> for idle CPUs.  ;-)
> 
> The second must be set to one or greater.
> 
> Both are capped at one second (HZ).

Got it. Thanks a lot for the explanations.

> > > > Anyway it was just an idea that popped up when I was going through 
> > > > traces :)
> > > > Thanks for the discussion and happy to discuss further or try out 
> > > > anything.
> > > 
> > > Either way, I do appreciate your going through this.  People have found
> > > RCU bugs this way, one of which involved RCU uselessly calling a 
> > > particular
> > > function twice in quick succession.  ;-)
> >  
> > Thanks.  It is my pleasure and happy to help :) I'll keep digging into it.
> 
> Looking forward to further questions and patches.  ;-)

Will do! thanks,

 - Joel

Reply via email to