> On Mar 23, 2025, at 11:05 PM, Frederic Weisbecker <frede...@kernel.org> wrote: > > Le Sat, Mar 22, 2025 at 03:40:53PM +0100, Joel Fernandes a écrit : >> >> >>> On 3/22/2025 3:20 PM, Joel Fernandes wrote: >>> >>> On 3/22/2025 11:25 AM, Frederic Weisbecker wrote: >>>> Le Sat, Mar 22, 2025 at 03:06:08AM +0100, Joel Fernandes a écrit : >>>>> Insomnia kicked in, so 3 am reply here (Zurich local time) ;-): >>>>> >>>>> On 3/20/2025 3:15 PM, Frederic Weisbecker wrote: >>>>>> Le Wed, Mar 19, 2025 at 03:38:31PM -0400, Joel Fernandes a écrit : >>>>>>> On Tue, Mar 18, 2025 at 11:37:38AM -0700, Paul E. McKenney wrote: >>>>>>>> On Tue, Mar 18, 2025 at 02:56:18PM +0100, Frederic Weisbecker wrote: >>>>>>>>> The numbers used in rcu_seq_done_exact() lack some explanation behind >>>>>>>>> their magic. Especially after the commit: >>>>>>>>> >>>>>>>>> 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start >>>>>>>>> detection") >>>>>>>>> >>>>>>>>> which reported a subtle issue where a new GP sequence snapshot was >>>>>>>>> taken >>>>>>>>> on the root node state while a grace period had already been started >>>>>>>>> and >>>>>>>>> reflected on the global state sequence but not yet on the root node >>>>>>>>> sequence, making a polling user waiting on a wrong already started >>>>>>>>> grace >>>>>>>>> period that would ignore freshly online CPUs. >>>>>>>>> >>>>>>>>> The fix involved taking the snaphot on the global state sequence and >>>>>>>>> waiting on the root node sequence. And since a grace period is first >>>>>>>>> started on the global state and only afterward reflected on the root >>>>>>>>> node, a snapshot taken on the global state sequence might be two full >>>>>>>>> grace periods ahead of the root node as in the following example: >>>>>>>>> >>>>>>>>> rnp->gp_seq = rcu_state.gp_seq = 0 >>>>>>>>> >>>>>>>>> CPU 0 CPU 1 >>>>>>>>> ----- ----- >>>>>>>>> // rcu_state.gp_seq = 1 >>>>>>>>> rcu_seq_start(&rcu_state.gp_seq) >>>>>>>>> // snap = 8 >>>>>>>>> snap = >>>>>>>>> rcu_seq_snap(&rcu_state.gp_seq) >>>>>>>>> // Two full GP >>>>>>>>> differences >>>>>>>>> >>>>>>>>> rcu_seq_done_exact(&rnp->gp_seq, snap) >>>>>>>>> // rnp->gp_seq = 1 >>>>>>>>> WRITE_ONCE(rnp->gp_seq, rcu_state.gp_seq); >>>>>>>>> >>>>>>>>> Add a comment about those expectations and to clarify the magic within >>>>>>>>> the relevant function. >>>>>>>>> >>>>>>>>> Signed-off-by: Frederic Weisbecker <frede...@kernel.org> >>>>>>>> Reviewed-by: Paul E. McKenney <paul...@kernel.org> >>>>>>>> >>>>>>>> But it would of course be good to get reviews from the others. >>>>>>> I actually don't agree that the magic in the rcu_seq_done_exact() >>>>>>> function about the >>>>>>> ~2 GPs is related to the lag between rcu_state.gp_seq and root >>>>>>> rnp->gp_seq, >>>>>>> because the small lag can just as well survive with the rcu_seq_done() >>>>>>> function in the above sequence right? >>>>>>> >>>>>>> The rcu_seq_done_exact() function on the other hand is more about not >>>>>>> being >>>>>>> stuck in the ULONG_MAX/2 guard band, but to actually get to that, you >>>>>>> need a >>>>>>> wrap around to happen and the delta between "rnp->gp_seq" and "snap" to >>>>>>> be at >>>>>>> least ULONG_MAX/2 AFAIU. >>>>>>> >>>>>>> So the only time this magic will matter is if you have a huge delta >>>>>>> between >>>>>>> what is being compared, not just 2 GPs. >>>>>> You're right, and perhaps I should have made it more specific that my >>>>>> comment >>>>>> only explains the magic "3" number here, in that if it were "2" instead, >>>>>> there >>>>>> could be accidents with 2 full GPs difference (which is possible) >>>>>> spuriously >>>>>> accounted as a wrap around. >>>>> Ahh, so I guess I get it now and we are both right. The complete picture >>>>> is - We >>>>> are trying to handle the case of "very large wrap" around but as a part >>>>> of that, >>>>> we don't want to create false-positives for this "snap" case. >>>>> >>>>> A "snap" can be atmost (2 * RCU_SEQ_STATE_MASK + 1) away from a gp_seq. >>>>> >>>>> That's within "2 GPs" worth of counts (about 8 counts) >>>>> >>>>> Taking some numbers: >>>>> >>>>> cur_s s delta (s - cur_s) >>>>> 0 4 4 >>>>> 1 8 7 >>>>> 2 8 6 >>>>> 3 8 5 >>>>> 4 8 4 >>>>> 5 12 7 >>>>> >>>>> The maximum delta of a snap from actual gp_seq can be (2 * >>>>> RCU_SEQ_STATE_MASK + >>>>> 1) which in this case is 7. >>>>> >>>>> So we adjust the comparison by adding the ULONG_CMP_LT(cur_s, s - (2 * >>>>> RCU_SEQ_STATE_MASK + 1)). i.e. >>>> 3, right? >>> Just to be absolutely sure, are you talking about the value of >>> RCU_SEQ_STATE_MASK ? >>> >>> That is indeed 3 (RCU_SEQ_STATE_MASK). >>> >>> But if we're talking about number of GPs, my understanding is a count of 4 >>> is >>> one GP worth. Per the above table, the delta between gp_seq and is snap is >>> always a count of 7 (hence less than 2 GPs). >>> >>> Agreed? >>> >>> If RCU_SEQ_STATE_MASK was 0x1 instead of 0x11, that is a single bit (or a >>> count >>> of 2 instead of 4, for a GP), then the table would be: >>> >>> cur_s s (snap) delta (s - cur_s) >>> 0 2 2 >>> 1 4 3 >>> 2 4 2 >>> 3 6 3 >>> 4 6 2 >>> 5 8 3 >>> >>> So delta is always <= 3, Or more generally: <= (RCU_SEQ_STATE_MASK * 2) + 1 >> >> Oh man, I am wondering if we are on to a bug here: >> >> From your example: >> >> CPU 0 CPU 1 >> ----- ----- >> // rcu_state.gp_seq = 1 >> rcu_seq_start(&rcu_state.gp_seq) >> // snap = 8 >> snap = rcu_seq_snap(&rcu_state.gp_seq) >> // Two full GP >> rcu_seq_done_exact(&rnp->gp_seq, snap) >> >> >> Here, the >> ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 1)); >> >> Will be >> ULONG_CMP_LT(0, 8 - (2 * RCU_SEQ_STATE_MASK + 1)); >> >> = ULONG_CMP_LT(0, 8 - 7) >> >> = TRUE. >> >> Which means rcu_seq_done_exact() will return a false positive saying the GP >> has >> completed even though it has not. >> >> I think rcu_seq_done_exact() is off by one and should be doing: >> >> ULONG_CMP_LT(cur_s, s - (2 * RCU_SEQ_STATE_MASK + 2)); >> >> ? > > But it's ULONG_CMP_LT(cur_s, s - (3 * RCU_SEQ_STATE_MASK + 1) now since: > > 85aad7cc4178 ("rcu: Fix get_state_synchronize_rcu_full() GP-start > detection") > > That's 10 so we are good. > > However that magic value is arbitrary and doesn't mean much. It should be > like you said. Or rather for clarity: > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h > index 7acf1f36dd6c..e53f0b687a83 100644 > --- a/kernel/rcu/rcu.h > +++ b/kernel/rcu/rcu.h > @@ -57,6 +57,10 @@ > /* Low-order bit definition for polled grace-period APIs. */ > #define RCU_GET_STATE_COMPLETED 0x1 > > +/* A complete grace period count */ > +#define RCU_SEQ_GP (RCU_SEQ_STATE_MASK + 1) > + > + > extern int sysctl_sched_rt_runtime; > > /* > @@ -169,7 +173,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, > unsigned long s) > { > unsigned long cur_s = READ_ONCE(*sp); > > - return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * > RCU_SEQ_STATE_MASK + 1)); > + return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * > RCU_SEQ_GP)); > }
Ah, my kernel did not have the change at the time I commented, sorry. I agree that this is a much better and meaningful expression than the existing one. I shall create a patch with it and send it out with my other series on forcing the wrap for testing (along with the modification for the new comments you added). Thanks! > > /* > >