On Sun, Mar 02, 2025 at 09:39:44AM -0800, Paul E. McKenney wrote:
> On Sun, Mar 02, 2025 at 11:19:44AM +0100, Uladzislau Rezki wrote:
> > On Fri, Feb 28, 2025 at 05:08:49PM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 28, 2025 at 11:59:55AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 28, 2025 at 08:12:51PM +0100, Uladzislau Rezki wrote:
> > > > > Hello, Paul!
> > > > > 
> > > > > > > > > > 
> > > > > > > > > > Except that I got this from overnight testing of rcu/dev on 
> > > > > > > > > > the shared
> > > > > > > > > > RCU tree:
> > > > > > > > > > 
> > > > > > > > > > WARNING: CPU: 5 PID: 14 at kernel/rcu/tree.c:1636 
> > > > > > > > > > rcu_sr_normal_complete+0x5c/0x80
> > > > > > > > > > 
> > > > > > > > > > I see this only on TREE05.  Which should not be too 
> > > > > > > > > > surprising, given
> > > > > > > > > > that this is the scenario that tests it.  It happened 
> > > > > > > > > > within five minutes
> > > > > > > > > > on all 14 of the TREE05 runs.
> > > > > > > > > > 
> > > > > > > > > Hm.. This is not fun. I tested this on my system and i did 
> > > > > > > > > not manage to
> > > > > > > > > trigger this whereas you do. Something is wrong.
> > > > > > > > 
> > > > > > > > If you have a debug patch, I would be happy to give it a go.
> > > > > > > > 
> > > > > > > I can trigger it. But.
> > > > > > > 
> > > > > > > Some background. I tested those patches during many hours on the 
> > > > > > > stable
> > > > > > > kernel which is 6.13. On that kernel i was not able to trigger 
> > > > > > > it. Running
> > > > > > > the rcutorture on the our shared "dev" tree, which i did now, 
> > > > > > > triggers this
> > > > > > > right away.
> > > > > > 
> > > > > > Bisection?  (Hey, you knew that was coming!)
> > > > > > 
> > > > > Looks like this: rcu: Fix get_state_synchronize_rcu_full() GP-start 
> > > > > detection
> > > > > 
> > > > > After revert in the dev, rcutorture passes TREE05, 16 instances.
> > > > 
> > > > Huh.  We sure don't get to revert that one...
> > > > 
> > > > Do we have a problem with the ordering in rcu_gp_init() between the 
> > > > calls
> > > > to rcu_seq_start() and portions of rcu_sr_normal_gp_init()?  For 
> > > > example,
> > > > do we need to capture the relevant portion of the list before the call
> > > > to rcu_seq_start(), and do the grace-period-start work afterwards?
> > > 
> > > I tried moving the call to rcu_sr_normal_gp_init() before the call to
> > > rcu_seq_start() and got no failures in a one-hour run of 200*TREE05.
> > > Which does not necessarily mean that this is the correct fix, but I
> > > figured that it might at least provide food for thought.
> > > 
> > >                                                   Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 48384fa2eaeb8..d3efeff7740e7 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1819,10 +1819,10 @@ static noinline_for_stack bool rcu_gp_init(void)
> > >  
> > >   /* Advance to a new grace period and initialize state. */
> > >   record_gp_stall_check_time();
> > > + start_new_poll = rcu_sr_normal_gp_init();
> > >   /* Record GP times before starting GP, hence rcu_seq_start(). */
> > >   rcu_seq_start(&rcu_state.gp_seq);
> > >   ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
> > > - start_new_poll = rcu_sr_normal_gp_init();
> > >   trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
> > >   rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
> > >   raw_spin_unlock_irq_rcu_node(rnp);
> > >
> > Running this 24 hours already. TREE05 * 16 scenario. I do not see any
> > warnings yet. There is a race, indeed. The gp_seq is moved forward,
> > wheres clients can still come until rcu_sr_normal_gp_init() places a
> > dummy-wait-head for this GP.
> > 
> > Thank you for testing Paul and looking to this :)
> 
> Very good!  This is a bug in this commit of mine:
> 
> 012f47f0f806 ("rcu: Fix get_state_synchronize_rcu_full() GP-start detection")
> 
> Boqun, could you please fold this into that commit with something like
> this added to the commit log just before the paragraph starting with
> "Although this fixes 91a967fd6934"?
> 
>       However, simply changing get_state_synchronize_rcu_full() function
>       to use rcu_state.gp_seq instead of the root rcu_node structure's
>       ->gp_seq field results in a theoretical bug in kernels booted
>       with rcutree.rcu_normal_wake_from_gp=1 due to the following
>       sequence of events:
> 
>       o       The rcu_gp_init() function invokes rcu_seq_start()
>               to officially start a new grace period.
> 
>       o       A new RCU reader begins, referencing X from some
>               RCU-protected list.  The new grace period is not
>               obligated to wait for this reader.
> 
>       o       An updater removes X, then calls synchronize_rcu(),
>               which queues a wait element.
> 
>       o       The grace period ends, awakening the updater, which
>               frees X while the reader is still referencing it.
> 
>       The reason that this is theoretical is that although the
>       grace period has officially started, none of the CPUs are
>       officially aware of this, and thus will have to assume that
>       the RCU reader pre-dated the start of the grace period.
> 
>       Except for kernels built with CONFIG_PROVE_RCU=y, which use the
>       polled grace-period APIs, which can and do complain bitterly when
>       this sequence of events occurs.  Not only that, there might be
>       some future RCU grace-period mechanism that pulls this sequence
>       of events from theory into practice.  This commit therefore
>       also pulls the call to rcu_sr_normal_gp_init() to precede that
>       to rcu_seq_start().
> 
> I will let you guys decide whether the call to rcu_sr_normal_gp_init()
> needs a comment, and, if so, what that comment should say.  ;-)
> 

Please see the updated patch below (next and rcu/dev branches are
updated too). For the comment, I think we can add something like

        /* 
         * A new wait segment must be started before gp_seq advanced, so
         * that previous gp waiters won't observe the new gp_seq.
         */

but I will let Ulad to decide ;-)

Regards,
Boqun

------------------------>8
Subject: [PATCH] rcu: Fix get_state_synchronize_rcu_full() GP-start detection

The get_state_synchronize_rcu_full() and poll_state_synchronize_rcu_full()
functions use the root rcu_node structure's ->gp_seq field to detect
the beginnings and ends of grace periods, respectively.  This choice is
necessary for the poll_state_synchronize_rcu_full() function because
(give or take counter wrap), the following sequence is guaranteed not
to trigger:

        get_state_synchronize_rcu_full(&rgos);
        synchronize_rcu();
        WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&rgos));

The RCU callbacks that awaken synchronize_rcu() instances are
guaranteed not to be invoked before the root rcu_node structure's
->gp_seq field is updated to indicate the end of the grace period.
However, these callbacks might start being invoked immediately
thereafter, in particular, before rcu_state.gp_seq has been updated.
Therefore, poll_state_synchronize_rcu_full() must refer to the
root rcu_node structure's ->gp_seq field.  Because this field is
updated under this structure's ->lock, any code following a call to
poll_state_synchronize_rcu_full() will be fully ordered after the
full grace-period computation, as is required by RCU's memory-ordering
semantics.

By symmetry, the get_state_synchronize_rcu_full() function should also
use this same root rcu_node structure's ->gp_seq field.  But it turns out
that symmetry is profoundly (though extremely infrequently) destructive
in this case.  To see this, consider the following sequence of events:

1.      CPU 0 starts a new grace period, and updates rcu_state.gp_seq
        accordingly.

2.      As its first step of grace-period initialization, CPU 0 examines
        the current CPU hotplug state and decides that it need not wait
        for CPU 1, which is currently offline.

3.      CPU 1 comes online, and updates its state.  But this does not
        affect the current grace period, but rather the one after that.
        After all, CPU 1 was offline when the current grace period
        started, so all pre-existing RCU readers on CPU 1 must have
        completed or been preempted before it last went offline.
        The current grace period therefore has nothing it needs to wait
        for on CPU 1.

4.      CPU 1 switches to an rcutorture kthread which is running
        rcutorture's rcu_torture_reader() function, which starts a new
        RCU reader.

5.      CPU 2 is running rcutorture's rcu_torture_writer() function
        and collects a new polled grace-period "cookie" using
        get_state_synchronize_rcu_full().  Because the newly started
        grace period has not completed initialization, the root rcu_node
        structure's ->gp_seq field has not yet been updated to indicate
        that this new grace period has already started.

        This cookie is therefore set up for the end of the current grace
        period (rather than the end of the following grace period).

6.      CPU 0 finishes grace-period initialization.

7.      If CPU 1’s rcutorture reader is preempted, it will be added to
        the ->blkd_tasks list, but because CPU 1’s ->qsmask bit is not
        set in CPU 1's leaf rcu_node structure, the ->gp_tasks pointer
        will not be updated.  Thus, this grace period will not wait on
        it.  Which is only fair, given that the CPU did not come online
        until after the grace period officially started.

8.      CPUs 0 and 2 then detect the new grace period and then report
        a quiescent state to the RCU core.

9.      Because CPU 1 was offline at the start of the current grace
        period, CPUs 0 and 2 are the only CPUs that this grace period
        needs to wait on.  So the grace period ends and post-grace-period
        cleanup starts.  In particular, the root rcu_node structure's
        ->gp_seq field is updated to indicate that this grace period
        has now ended.

10.     CPU 2 continues running rcu_torture_writer() and sees that,
        from the viewpoint of the root rcu_node structure consulted by
        the poll_state_synchronize_rcu_full() function, the grace period
        has ended.  It therefore updates state accordingly.

11.     CPU 1 is still running the same RCU reader, which notices this
        update and thus complains about the too-short grace period.

The fix is for the get_state_synchronize_rcu_full() function to use
rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq field.
With this change in place, if step 5's cookie indicates that the grace
period has not yet started, then any prior code executed by CPU 2 must
have happened before CPU 1 came online.  This will in turn prevent CPU
1's code in steps 3 and 11 from spanning CPU 2's grace-period wait,
thus preventing CPU 1 from being subjected to a too-short grace period.

This commit therefore makes this change.  Note that there is no change to
the poll_state_synchronize_rcu_full() function, which as noted above,
must continue to use the root rcu_node structure's ->gp_seq field.
This is of course an asymmetry between these two functions, but is an
asymmetry that is absolutely required for correct operation.  It is a
common human tendency to greatly value symmetry, and sometimes symmetry
is a wonderful thing.  Other times, symmetry results in poor performance.
But in this case, symmetry is just plain wrong.

Nevertheless, the asymmetry does require an additional adjustment.
It is possible for get_state_synchronize_rcu_full() to see a given
grace period as having started, but for an immediately following
poll_state_synchronize_rcu_full() to see it as having not yet started.
Given the current rcu_seq_done_exact() implementation, this will
result in a false-positive indication that the grace period is done
from poll_state_synchronize_rcu_full().  This is dealt with by making
rcu_seq_done_exact() reach back three grace periods rather than just
two of them.

However, simply changing get_state_synchronize_rcu_full() function to
use rcu_state.gp_seq instead of the root rcu_node structure's ->gp_seq
field results in a theoretical bug in kernels booted with
rcutree.rcu_normal_wake_from_gp=1 due to the following sequence of
events:

o       The rcu_gp_init() function invokes rcu_seq_start() to officially
        start a new grace period.

o       A new RCU reader begins, referencing X from some RCU-protected
        list.  The new grace period is not obligated to wait for this
        reader.

o       An updater removes X, then calls synchronize_rcu(), which queues
        a wait element.

o       The grace period ends, awakening the updater, which frees X
        while the reader is still referencing it.

The reason that this is theoretical is that although the grace period
has officially started, none of the CPUs are officially aware of this,
and thus will have to assume that the RCU reader pre-dated the start of
the grace period.

Except for kernels built with CONFIG_PROVE_RCU=y, which use the polled
grace-period APIs, which can and do complain bitterly when this sequence
of events occurs.  Not only that, there might be some future RCU
grace-period mechanism that pulls this sequence of events from theory
into practice.  This commit therefore also pulls the call to
rcu_sr_normal_gp_init() to precede that to rcu_seq_start().

Although this fixes 91a967fd6934 ("rcu: Add full-sized polling for
get_completed*() and poll_state*()"), it is not clear that it is
worth backporting this commit.  First, it took me many weeks to
convince rcutorture to reproduce this more frequently than once
per year.  Second, this cannot be reproduced at all without frequent
CPU-hotplug operations, as in waiting all of 50 milliseconds from the
end of the previous operation until starting the next one.  Third,
the TREE03.boot settings cause multi-millisecond delays during RCU
grace-period initialization, which greatly increase the probability of
the above sequence of events.  (Don't do this in production workloads!)
Fourth, the TREE03 rcutorture scenario was modified to use four-CPU
guest OSes, to have a single-rcu_node combining tree, no testing of RCU
priority boosting, and no random preemption, and these modifications
were necessary to reproduce this issue in a reasonable timeframe.
Fifth, extremely heavy use of get_state_synchronize_rcu_full() and/or
poll_state_synchronize_rcu_full() is required to reproduce this, and as of
v6.12, only kfree_rcu() uses it, and even then not particularly heavily.

[boqun: Apply the fix [1]]

Signed-off-by: Paul E. McKenney <paul...@kernel.org>
Reviewed-by: Frederic Weisbecker <frede...@kernel.org>
Reviewed-by: Joel Fernandes (Google) <j...@joelfernandes.org>
Tested-by: Uladzislau Rezki (Sony) <ure...@gmail.com>
Link: 
https://lore.kernel.org/rcu/d90bd6d9-d15c-4b9b-8a69-95336e74e8f4@paulmck-laptop/
 [1]
Signed-off-by: Boqun Feng <boqun.f...@gmail.com>
---
 kernel/rcu/rcu.h  |  2 +-
 kernel/rcu/tree.c | 11 +++++++----
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index feb3ac1dc5d5..f87c9d6d36fc 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -162,7 +162,7 @@ static inline bool rcu_seq_done_exact(unsigned long *sp, 
unsigned long s)
 {
        unsigned long cur_s = READ_ONCE(*sp);
 
-       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (2 * 
RCU_SEQ_STATE_MASK + 1));
+       return ULONG_CMP_GE(cur_s, s) || ULONG_CMP_LT(cur_s, s - (3 * 
RCU_SEQ_STATE_MASK + 1));
 }
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e4c0ce600b2b..f80cfdc3ee3e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1801,10 +1801,10 @@ static noinline_for_stack bool rcu_gp_init(void)
 
        /* Advance to a new grace period and initialize state. */
        record_gp_stall_check_time();
+       start_new_poll = rcu_sr_normal_gp_init();
        /* Record GP times before starting GP, hence rcu_seq_start(). */
        rcu_seq_start(&rcu_state.gp_seq);
        ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
-       start_new_poll = rcu_sr_normal_gp_init();
        trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("start"));
        rcu_poll_gp_seq_start(&rcu_state.gp_seq_polled_snap);
        raw_spin_unlock_irq_rcu_node(rnp);
@@ -3357,14 +3357,17 @@ EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
  */
 void get_state_synchronize_rcu_full(struct rcu_gp_oldstate *rgosp)
 {
-       struct rcu_node *rnp = rcu_get_root();
-
        /*
         * Any prior manipulation of RCU-protected data must happen
         * before the loads from ->gp_seq and ->expedited_sequence.
         */
        smp_mb();  /* ^^^ */
-       rgosp->rgos_norm = rcu_seq_snap(&rnp->gp_seq);
+
+       // Yes, rcu_state.gp_seq, not rnp_root->gp_seq, the latter's use
+       // in poll_state_synchronize_rcu_full() notwithstanding.  Use of
+       // the latter here would result in too-short grace periods due to
+       // interactions with newly onlined CPUs.
+       rgosp->rgos_norm = rcu_seq_snap(&rcu_state.gp_seq);
        rgosp->rgos_exp = rcu_seq_snap(&rcu_state.expedited_sequence);
 }
 EXPORT_SYMBOL_GPL(get_state_synchronize_rcu_full);
-- 
2.45.1


Reply via email to