On Mon, Apr 14, 2025 at 10:56:24AM -0400, Joel Fernandes wrote:
> On 4/14/2025 10:24 AM, Paul E. McKenney wrote:
> > On Mon, Apr 14, 2025 at 08:07:24AM -0400, Joel Fernandes wrote:
> >>> On Apr 11, 2025, at 3:18 PM, Paul E. McKenney <paul...@kernel.org> wrote:
> >>> On Fri, Apr 11, 2025 at 05:36:32AM -0000, Joel Fernandes wrote:
> >>>> Hello, Paul,
> >>>>
> >>>>> On Fri, 11 Apr 2025 05:33:16 GMT, "Paul E. McKenney" wrote:
> >>>>> On Thu, Apr 10, 2025 at 11:54:13AM -0700, Paul E. McKenney wrote:
> >>>>>> On Thu, Apr 10, 2025 at 11:29:03AM -0700, Paul E. McKenney wrote:
> >>>>>>> On Thu, Apr 10, 2025 at 11:03:27AM -0400, Joel Fernandes wrote: >
> >>>>>>> Currently, the ->gpwrap is not tested (at all per my testing) due to
> >>>>>>> the > requirement of a large delta between a CPU's rdp->gp_seq and its
> >>>>>>> node's > rnp->gpseq.  > > This results in no testing of ->gpwrap being
> >>>>>>> set. This patch by default > adds 5 minutes of testing with ->gpwrap
> >>>>>>> forced by lowering the delta > between rdp->gp_seq and rnp->gp_seq to
> >>>>>>> just 8 GPs. All of this is > configurable, including the active time 
> >>>>>>> for
> >>>>>>> the setting and a full > testing cycle.  > > By default, the first 25
> >>>>>>> minutes of a test will have the _default_ > behavior there is right 
> >>>>>>> now
> >>>>>>> (ULONG_MAX / 4) delta. Then for 5 minutes, > we switch to a smaller 
> >>>>>>> delt
> >>>>> a
> >>>>>>> causing 1-2 wraps in 5 minutes. I believe > this is reasonable since 
> >>>>>>> we
> >>>>>>> at least add a little bit of testing for > usecases where ->gpwrap is 
> >>>>>>> se
> >>>>> t.
> >>>>>>>>> Signed-off-by: Joel Fernandes <joelagn...@nvidia.com>
> >>>>>>>
> >>>>>>> Much better, thank you!
> >>>>>>>
> >>>>>>> One potential nit below.  I will run some tests on this version.
> >>>>>>
> >>>>>> And please feel free to apply the following to both:
> >>>>>>
> >>>>>> Tested-by: Paul E. McKenney <paul...@kernel.org>
> >>>>>
> >>>>> And this happy situation lasted only until I rebased onto v6.15-rc1 and
> >>>>> on top of this commit:
> >>>>>
> >>>>> 1342aec2e442 ("Merge branches 'rcu/misc-for-6.16', 
> >>>>> 'rcu/seq-counters-for-6.1
> >>>>> 6' and 'rcu/torture-for-6.16' into rcu/for-next")
> >>>>>
> >>>>> This got me the splat shown below when running rcutorture scenario 
> >>>>> SRCU-N.
> >>>>> I reverted this commit and tests pass normally.
> >>>>>
> >>>>> Your other commit (ARM64 images) continues working fine.
> >>>>
> >>>> Interesting.. it seems to be crashing during statistics printing.
> >>>>
> >>>> I am wondering if the test itself uncovered a bug or the bug is in the 
> >>>> test
> >>>> itself.
> >>>
> >>> Both are quite possible, also a bug somewhere else entirely.
> >>
> >> I may not get to debugging it for this merge window so I am leaning to 
> >> defer it.
> > 
> > The usual cause is use of an rcu_torture_ops function pointer without
> > having first checked that it is non-NULL.  But I suspect that you already
> > checked for this.
> 
> Oops, I am not! You are right I think it broke since the movement into ops and
> needs this which I missed for this call site (though I did it for the other). 
> I
> could repro SRCU-N without it and with the fix it passes:
> 
> diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
> index 74de92c3a9ab..df6504a855aa 100644
> --- a/kernel/rcu/rcutorture.c
> +++ b/kernel/rcu/rcutorture.c
> @@ -2412,7 +2412,8 @@ rcu_torture_stats_print(void)
>                         pipesummary[i] += READ_ONCE(per_cpu(rcu_torture_count,
> cpu)[i]);
>                         batchsummary[i] += 
> READ_ONCE(per_cpu(rcu_torture_batch,
> cpu)[i]);
>                 }
> -               n_gpwraps += cur_ops->get_gpwrap_count(cpu);
> +               if (cur_ops->get_gpwrap_count)
> +                       n_gpwraps += cur_ops->get_gpwrap_count(cpu);
>         }
>         for (i = RCU_TORTURE_PIPE_LEN; i >= 0; i--) {
>                 if (pipesummary[i] != 0)
> 
> I will squash the fix into the patch and repost as v4.

Been there, done that!  And looks like a proper fix to me.

                                                        Thanx, Paul

> >>>> Looking forward to your test with the other patch and we could hold off 
> >>>> on this
> >>>> one till we have more data about what is going on.
> >>>
> >>> This one got lot of OOMs when tests of RCU priority boosting overlapped
> >>> with testing of RCU callback flooding on TREE03, as in 13 of the 14
> >>> 9-hour runs.  Back on v6.14-rc1, these were quite rare.
> >>>
> >>> Ah, and I am carrying this as an experimental patch:
> >>>
> >>> 269b9b5be09d ("EXP sched: Disable DL server if sysctl_sched_rt_runtime is 
> >>> -1")
> >>>
> >>> Just checking to see if this is still something I should be carrying.
> >>
> >> I think since it exposing boost issues, we should carry it! However since 
> >> it is also noisy, maybe for short term we not carry it in any trees since 
> >> we are getting close to posting the topic branches.
> > 
> > I am carrying it in -rcu, but marked "EXP" so that I don't post it or
> > send it along in a pull request.
> 
> Sounds good to me.
> 
> >> Do you see the same boost issues or frequency of them when carrying it on 
> >> 6.15-rc1 without any of this merge windows changes?
> >>
> >> By the way I have to rewrite that EXP patch at some point based on a 
> >> review of it but functionally that patch is good.
> > 
> > I just now started a short test with it reverted.
> > 
> > Oh, and yours and Boqun's latest passed overnight tests except for a
> > few Kconfig issues including the PREEMPT_RT pair:
> > 
> > 75cf58ef310a ("Merge branches 'rcu/misc-for-6.16', 
> > 'rcu/seq-counters-for-6.16' and 'rcu/torture-for-6.16' into rcu/for-next")
> > 
> > This is a known Kconfig issue in torture.sh, fixed by this -rcu commit:
> > 
> > 2e26af16b7b6 ("torture.sh: Force CONFIG_RCU_NOCB_CPU=y for --do-rt 
> > configurations")
> 
> I already merged this change
> https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=rcu/for-next&id=a3204f778cf7e37c7344404768398b4f9d43a368
> 
> But you saw issues in your testing even with this?
> 
> Could you rebase on top of my for-next branch so we are on the same page? Tag
> next.2025.04.11a
> 
> I believe you said you were going to rebuild your tree, but were waiting on 
> testing?
> 
> > 
> > There are also Kconfig issues with a few of the KCSAN rcutorture scenarios
> > that I am looking into.  And torture.sh needs to be more aggressive about
> > reporting these...
> Ok sounds good, happy to add these to my torture-for-6.16 topic branch once 
> you
> post them.
> 
> thanks,
> 
>  - Joel
> 
> 

Reply via email to