from:"paulmck"

[PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

2021-02-17 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks.  Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable.  This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter.  Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled.  The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 22 
 kernel/time/clocksource.c   | 27 +
 2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index a10b545..9965266 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.
 
+   clocksource.inject_delay_freq= [KNL]
+   Number of runs of calls to clocksource_watchdog()
+   before delays are injected between reads from the
+   two clocksources.  Values less than or equal to
+   zero disable this delay injection.  These delays
+   can cause clocks to be marked unstable, so use
+   of this parameter should therefore be avoided on
+   production systems.  Defaults to zero (disabled).
+
+   clocksource.inject_delay_run= [KNL]
+   Run lengths of clocksource_watchdog() delay
+   injections.  Specifying the value 8 will result
+   in eight consecutive delays followed by eight
+   times the value specified for inject_delay_freq
+   of consecutive non-delays.
+
+   clocksource.max_read_retries= [KNL]
+   Number of clocksource_watchdog() retries due to
+   external delays before the clock will be marked
+   unstable.  Defaults to three retries, that is,
+   four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..4be4391 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
 #include  /* for spin_unlock_irq() using preempt_count() m68k */
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 #include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+   int i;
+   static int injectfail = -1;
+
+   if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+   return;
+   if (injectfail < 0 || injectfail > INT_MAX / 2)
+   injectfail = inject_delay_run;
+   if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+   pr_warn("%s(): Injecting delay.\n", __func__);
+   for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+   udelay(1000);
+   pr_warn("%s(): Done injecting delay.\n", __func_

[PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-02-17 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks.  Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test.  If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries.  If retries
were required, a message is printed on the console.  If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small.  In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4be4391..3f734c6 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource 
*cs, int rating);
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)
 
 static void clocksource_watchdog_work(struct work_struct *work)
 {
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
 static void clocksource_watchdog(struct timer_list *unused)
 {
struct clocksource *cs;
-   u64 csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
+   u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+   int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+   int nretries;
 
spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);
 
list_for_each_entry(cs, &watchdog_list, wd_list) {
+   nretries = 0;
 
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list 
*unused)
continue;
}
 
+retry:
local_irq_disable();
-   csnow = cs->read(cs);
-   clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+   clocksource_watchdog_inject_delay();
+   csnow = cs->read(cs);
+   wdagain = watchdog->read(watchdog);
local_irq_enable();
+   delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+   wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, 
watchdog->shift);
+   if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+   wderr_nsec = wdagain_nsec;
+   if (nretries++ < max_read_retries)
+   goto retry;
+   }
+   if (nretries)
+   pr_warn("timekeeping watchdog on CPU%d: %s read-back 
delay of %lldns, attempt %d\n",
+   smp_processor_id(), watchdog->name, wderr_nsec, 
nretries);
 
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
-- 
2.9.5

[PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

2021-02-17 Thread paulmck

From: "Paul E. McKenney" 

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
[ paulmck: Apply Randy Dunlap feedback. ]
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 16 
 kernel/time/clocksource.c   | 10 +-
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9965266..628e87f 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,22 @@
times the value specified for inject_delay_freq
of consecutive non-delays.
 
+   clocksource.inject_delay_shift_percpu= [KNL]
+   Clocksource delay injection partitions the CPUs
+   into two sets, one whose clocks are moved ahead
+   and the other whose clocks are moved behind.
+   This kernel parameter selects the CPU-number
+   bit that determines which of these two sets the
+   corresponding CPU is placed into.  For example,
+   setting this parameter to the value 4 will result
+   in the first set containing alternating groups
+   of 16 CPUs whose clocks are moved ahead, while
+   the second set will contain the rest of the CPUs,
+   whose clocks are moved behind.
+
+   The default value of -1 disables this type of
+   error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 663bc53..df48416 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
 module_param(inject_delay_freq, int, 0644);
 static int inject_delay_run = 1;
 module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
 static int max_read_retries = 3;
 module_param(max_read_retries, int, 0644);
 
@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
 static void clocksource_verify_one_cpu(void *csin)
 {
struct clocksource *cs = (struct clocksource *)csin;
+   s64 delta = 0;
+   int sign;
 
-   __this_cpu_write(csnow_mid, cs->read(cs));
+   if (inject_delay_shift_percpu >= 0) {
+   sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
+   delta = sign * NSEC_PER_SEC;
+   }
+   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
-- 
2.9.5

[PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-02-17 Thread paulmck

From: "Paul E. McKenney" 

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other.  However, this problem has purportedy
been solved in the past ten years.  Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization.  How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag.  Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot 
feedback. ]
Signed-off-by: Paul E. McKenney 
---
 arch/x86/kernel/kvmclock.c  |  2 +-
 arch/x86/kernel/tsc.c   |  3 +-
 include/linux/clocksource.h |  2 +-
 kernel/time/clocksource.c   | 73 +
 4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read   = kvm_clock_get_cycles,
.rating = 400,
.mask   = CLOCKSOURCE_MASK(64),
-   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
+   .flags  = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask   = CLOCKSOURCE_MASK(64),
.flags  = CLOCK_SOURCE_IS_CONTINUOUS |
  CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode= VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
 #define CLOCK_SOURCE_UNSTABLE  0x40
 #define CLOCK_SOURCE_SUSPEND_NONSTOP   0x80
 #define CLOCK_SOURCE_RESELECT  0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
 
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 3f734c6..663bc53 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
 }
 
+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+   struct clocksource *cs = (struct clocksource *)csin;
+
+   __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+   int cpu;
+   struct clocksource *cs;
+   int64_t cs_nsec;
+   u64 csnow_begin;
+   u64 csnow_end;
+   u64 delta;
+
+   cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
+   if (WARN_ON_ONCE(!cs))
+   return;
+   pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+   cs->name, smp_processor_id());
+   cpumask_clear(&cpus_ahead);
+   cpumask_clear(&cpus_behind);
+   csnow_begin = cs->read(cs);
+   smp_call_function(clocksource_verify_one_cpu, cs, 1);
+   csnow_end = cs->read(cs);
+   for_each_online_cpu(cpu) {
+   if (cpu == smp_processor_id())
+   continue;
+   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_behind);
+   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_ahead);
+   }
+   if (!cpuma

[PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

2021-02-17 Thread paulmck

From: "Paul E. McKenney" 

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 41 +
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index df48416..4161c84 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
 }
 
 static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
 static cpumask_t cpus_ahead;
 static cpumask_t cpus_behind;
 
@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
-   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+   csnow_mid = cs->read(cs) + delta;
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+   int64_t cs_nsec_max;
+   int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
-   u64 delta;
+   s64 delta;
+   bool firsttime = 1;
 
cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
-   csnow_begin = cs->read(cs);
-   smp_call_function(clocksource_verify_one_cpu, cs, 1);
-   csnow_end = cs->read(cs);
+   preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
-   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
-   if ((s64)delta < 0)
+   csnow_begin = cs->read(cs);
+   smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 
1);
+   csnow_end = cs->read(cs);
+   delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
-   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
-   if ((s64)delta < 0)
+   delta = (csnow_end - csnow_mid) & cs->mask;
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+   if (firsttime || cs_nsec > cs_nsec_max)
+   cs_nsec_max = cs_nsec;
+   if (firsttime || cs_nsec < cs_nsec_min)
+   cs_nsec_min = cs_nsec;
+   firsttime = 0;
}
+   preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn("CPUs %*pbl ahead of CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
pr_warn("CPUs %*pbl behind CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
-   if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
-   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
-   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
-   pr_warn("CPU %d duration %lldns for clocksource %s.\n",
-   smp_processor_id(), cs_nsec, cs->name);
-   }
+   if (!firsttime && (!cpumask_empty(&cpus_ahead) || 
!cpumask_empty(&cpus_behind)))
+   pr_warn("CPU %d check durations %lldns - %lldns for 
clocksource %s.\n",
+   smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with 
acquire.
 }
 
-- 
2.9.5

[PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-02-02 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks.  Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test.  If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries.  If retries
were required, a message is printed on the console.  If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small.  In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 545889c..4663b86 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource 
*cs, int rating);
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)
 
 static void clocksource_watchdog_work(struct work_struct *work)
 {
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
 static void clocksource_watchdog(struct timer_list *unused)
 {
struct clocksource *cs;
-   u64 csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
+   u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+   int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+   int nretries;
 
spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);
 
list_for_each_entry(cs, &watchdog_list, wd_list) {
+   nretries = 0;
 
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list 
*unused)
continue;
}
 
+retry:
local_irq_disable();
-   csnow = cs->read(cs);
-   clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+   clocksource_watchdog_inject_delay();
+   csnow = cs->read(cs);
+   wdagain = watchdog->read(watchdog);
local_irq_enable();
+   delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+   wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, 
watchdog->shift);
+   if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+   wderr_nsec = wdagain_nsec;
+   if (nretries++ < max_read_retries)
+   goto retry;
+   }
+   if (nretries)
+   pr_warn("timekeeping watchdog on CPU%d: %s read-back 
delay of %lldns, attempt %d\n",
+   smp_processor_id(), watchdog->name, wderr_nsec, 
nretries);
 
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
-- 
2.9.5

[PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

2021-02-02 Thread paulmck

From: "Paul E. McKenney" 

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +
 kernel/time/clocksource.c   | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9965266..f561e94 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,15 @@
times the value specified for inject_delay_freq
of consecutive non-delays.
 
+   clocksource.inject_delay_shift_percpu= [KNL]
+   Shift count to obtain bit from CPU number to
+   determine whether to shift the time of the per-CPU
+   clock under test ahead or behind.  For example,
+   setting this to the value four will result in
+   alternating groups of 16 CPUs shifting ahead and
+   the rest of the CPUs shifting behind.  The default
+   value of -1 disable this type of error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 23bcefe..67cf41c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
 module_param(inject_delay_freq, int, 0644);
 static int inject_delay_run = 1;
 module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
 static int max_read_retries = 3;
 module_param(max_read_retries, int, 0644);
 
@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
 static void clocksource_verify_one_cpu(void *csin)
 {
struct clocksource *cs = (struct clocksource *)csin;
+   s64 delta = 0;
+   int sign;
 
-   __this_cpu_write(csnow_mid, cs->read(cs));
+   if (inject_delay_shift_percpu >= 0) {
+   sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
+   delta = sign * NSEC_PER_SEC;
+   }
+   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
-- 
2.9.5

[PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-02-02 Thread paulmck

From: "Paul E. McKenney" 

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other.  However, this problem has purportedy
been solved in the past ten years.  Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization.  How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag.  Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot 
feedback. ]
Signed-off-by: Paul E. McKenney 
---
 arch/x86/kernel/kvmclock.c  |  2 +-
 arch/x86/kernel/tsc.c   |  3 +-
 include/linux/clocksource.h |  2 +-
 kernel/time/clocksource.c   | 73 +
 4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read   = kvm_clock_get_cycles,
.rating = 400,
.mask   = CLOCKSOURCE_MASK(64),
-   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
+   .flags  = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask   = CLOCKSOURCE_MASK(64),
.flags  = CLOCK_SOURCE_IS_CONTINUOUS |
  CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode= VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
 #define CLOCK_SOURCE_UNSTABLE  0x40
 #define CLOCK_SOURCE_SUSPEND_NONSTOP   0x80
 #define CLOCK_SOURCE_RESELECT  0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
 
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4663b86..23bcefe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
 }
 
+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+   struct clocksource *cs = (struct clocksource *)csin;
+
+   __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+   int cpu;
+   struct clocksource *cs;
+   int64_t cs_nsec;
+   u64 csnow_begin;
+   u64 csnow_end;
+   u64 delta;
+
+   cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
+   if (WARN_ON_ONCE(!cs))
+   return;
+   pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+   cs->name, smp_processor_id());
+   cpumask_clear(&cpus_ahead);
+   cpumask_clear(&cpus_behind);
+   csnow_begin = cs->read(cs);
+   smp_call_function(clocksource_verify_one_cpu, cs, 1);
+   csnow_end = cs->read(cs);
+   for_each_online_cpu(cpu) {
+   if (cpu == smp_processor_id())
+   continue;
+   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_behind);
+   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_ahead);
+   }
+   if (!cpuma

[PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

2021-02-02 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks.  Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable.  This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter.  Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled.  The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 22 
 kernel/time/clocksource.c   | 27 +
 2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index a10b545..9965266 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.
 
+   clocksource.inject_delay_freq= [KNL]
+   Number of runs of calls to clocksource_watchdog()
+   before delays are injected between reads from the
+   two clocksources.  Values less than or equal to
+   zero disable this delay injection.  These delays
+   can cause clocks to be marked unstable, so use
+   of this parameter should therefore be avoided on
+   production systems.  Defaults to zero (disabled).
+
+   clocksource.inject_delay_run= [KNL]
+   Run lengths of clocksource_watchdog() delay
+   injections.  Specifying the value 8 will result
+   in eight consecutive delays followed by eight
+   times the value specified for inject_delay_freq
+   of consecutive non-delays.
+
+   clocksource.max_read_retries= [KNL]
+   Number of clocksource_watchdog() retries due to
+   external delays before the clock will be marked
+   unstable.  Defaults to three retries, that is,
+   four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..545889c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
 #include  /* for spin_unlock_irq() using preempt_count() m68k */
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 #include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+   int i;
+   static int injectfail = -1;
+
+   if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+   return;
+   if (injectfail < 0 || injectfail > INT_MAX / 2)
+   injectfail = inject_delay_run;
+   if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+   printk("%s(): Injecting delay.\n", __func__);
+   for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+   udelay(1000);
+   printk("%s(): Done injecting delay.\n", __func_

[PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

2021-02-02 Thread paulmck

From: "Paul E. McKenney" 

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 41 +
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 67cf41c..3bae5fb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
 }
 
 static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
 static cpumask_t cpus_ahead;
 static cpumask_t cpus_behind;
 
@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
-   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+   csnow_mid = cs->read(cs) + delta;
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+   int64_t cs_nsec_max;
+   int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
-   u64 delta;
+   s64 delta;
+   bool firsttime = 1;
 
cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
-   csnow_begin = cs->read(cs);
-   smp_call_function(clocksource_verify_one_cpu, cs, 1);
-   csnow_end = cs->read(cs);
+   preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
-   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
-   if ((s64)delta < 0)
+   csnow_begin = cs->read(cs);
+   smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 
1);
+   csnow_end = cs->read(cs);
+   delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
-   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
-   if ((s64)delta < 0)
+   delta = (csnow_end - csnow_mid) & cs->mask;
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+   if (firsttime || cs_nsec > cs_nsec_max)
+   cs_nsec_max = cs_nsec;
+   if (firsttime || cs_nsec < cs_nsec_min)
+   cs_nsec_min = cs_nsec;
+   firsttime = 0;
}
+   preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn("CPUs %*pbl ahead of CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
pr_warn("CPUs %*pbl behind CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
-   if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
-   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
-   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
-   pr_warn("CPU %d duration %lldns for clocksource %s.\n",
-   smp_processor_id(), cs_nsec, cs->name);
-   }
+   if (!firsttime && (!cpumask_empty(&cpus_ahead) || 
!cpumask_empty(&cpus_behind)))
+   pr_warn("CPU %d check durations %lldns - %lldns for 
clocksource %s.\n",
+   smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with 
acquire.
 }
 
-- 
2.9.5

[PATCH tip/core/rcu 11/12] rcu: Execute RCU reader shortly after rcu_core for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

A kernel built with CONFIG_RCU_STRICT_GRACE_PERIOD=y needs a quiescent
state to appear very shortly after a CPU has noticed a new grace period.
Placing an RCU reader immediately after this point is ineffective because
this normally happens in softirq context, which acts as a big RCU reader.
This commit therefore introduces a new per-CPU work_struct, which is
used at the end of rcu_core() processing to schedule an RCU read-side
critical section from within a clean environment.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 13 +
 kernel/rcu/tree.h |  1 +
 2 files changed, 14 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dd7af40..ac37343 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2634,6 +2634,14 @@ void rcu_force_quiescent_state(void)
 }
 EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
 
+// Workqueue handler for an RCU reader for kernels enforcing struct RCU
+// grace periods.
+static void strict_work_handler(struct work_struct *work)
+{
+   rcu_read_lock();
+   rcu_read_unlock();
+}
+
 /* Perform RCU core processing work for the current CPU.  */
 static __latent_entropy void rcu_core(void)
 {
@@ -2678,6 +2686,10 @@ static __latent_entropy void rcu_core(void)
/* Do any needed deferred wakeups of rcuo kthreads. */
do_nocb_deferred_wakeup(rdp);
trace_rcu_utilization(TPS("End RCU core"));
+
+   // If strict GPs, schedule an RCU reader in a clean environment.
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   queue_work_on(rdp->cpu, rcu_gp_wq, &rdp->strict_work);
 }
 
 static void rcu_core_si(struct softirq_action *h)
@@ -3874,6 +3886,7 @@ rcu_boot_init_percpu_data(int cpu)
 
/* Set up local state, ensuring consistent view of global state. */
rdp->grpmask = leaf_node_cpu_bit(rdp->mynode, cpu);
+   INIT_WORK(&rdp->strict_work, strict_work_handler);
WARN_ON_ONCE(rdp->dynticks_nesting != 1);
WARN_ON_ONCE(rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp)));
rdp->rcu_ofl_gp_seq = rcu_state.gp_seq;
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 309bc7f..e4f66b8 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -165,6 +165,7 @@ struct rcu_data {
/* period it is aware of. */
struct irq_work defer_qs_iw;/* Obtain later scheduler attention. */
bool defer_qs_iw_pending;   /* Scheduler attention pending? */
+   struct work_struct strict_work; /* Schedule readers for strict GPs. */
 
/* 2) batch handling */
struct rcu_segcblist cblist;/* Segmented callback list, with */
-- 
2.9.5

[PATCH tip/core/rcu 05/12] rcu: Always set .need_qs from __rcu_read_lock() for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

The ->rcu_read_unlock_special.b.need_qs field in the task_struct
structure indicates that the RCU core needs a quiscent state from the
corresponding task.  The __rcu_read_unlock() function checks this (via
an eventual call to rcu_preempt_deferred_qs_irqrestore()), and if set
reports a quiscent state immediately upon exit from the outermost RCU
read-side critical section.

Currently, this flag is only set when the scheduling-clock interrupt
decides that the current RCU grace period is too old, as in about
one full second too old.  But if the kernel has been built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y, we clearly do not want to wait that
long.  This commit therefore sets the .need_qs field immediately at the
start of the RCU read-side critical section from within __rcu_read_lock()
in order to unconditionally enlist help from __rcu_read_unlock().

But note the additional check for rcu_state.gp_kthread, which prevents
attempts to awaken RCU's grace-period kthread during early boot before
there is a scheduler.  Leaving off this check results in early boot hangs.
So early that there is no console output.  Thus, this additional check
fails until such time as RCU's grace-period kthread has been created,
avoiding these empty-console hangs.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 5c0c580..7ed55c5 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -376,6 +376,8 @@ void __rcu_read_lock(void)
rcu_preempt_read_enter();
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX);
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && rcu_state.gp_kthread)
+   WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true);
barrier();  /* critical section after entry code. */
 }
 EXPORT_SYMBOL_GPL(__rcu_read_lock);
-- 
2.9.5

[PATCH tip/core/rcu 10/12] rcu: Provide optional RCU-reader exit delay for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

The goal of this series is to increase the probability of tools like
KASAN detecting that an RCU-protected pointer was used outside of its
RCU read-side critical section.  Thus far, the approach has been to make
grace periods and callback processing happen faster.  Another approach
is to delay the pointer leaker.  This commit therefore allows a delay
to be applied to exit from RCU read-side critical sections.

This slowdown is specified by a new rcutree.rcu_unlock_delay kernel boot
parameter that specifies this delay in microseconds, defaulting to zero.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +
 kernel/rcu/tree_plugin.h| 12 ++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 60e2c6e..c532c70 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4128,6 +4128,15 @@
This wake_up() will be accompanied by a
WARN_ONCE() splat and an ftrace_dump().
 
+   rcutree.rcu_unlock_delay= [KNL]
+   In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels,
+   this specifies an rcu_read_unlock()-time delay
+   in microseconds.  This defaults to zero.
+   Larger delays increase the probability of
+   catching RCU pointer leaks, that is, buggy use
+   of RCU-protected pointers after the relevant
+   rcu_read_unlock() has completed.
+
rcutree.sysrq_rcu= [KNL]
Commandeer a sysrq key to dump out Tree RCU's
rcu_node tree with an eye towards determining
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 1761ff4..25c9ee4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -430,6 +430,12 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
return !list_empty(&rnp->blkd_tasks);
 }
 
+// Add delay to rcu_read_unlock() for strict grace periods.
+static int rcu_unlock_delay;
+#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
+module_param(rcu_unlock_delay, int, 0444);
+#endif
+
 /*
  * Report deferred quiescent states.  The deferral time can
  * be quite short, for example, in the case of the call from
@@ -460,10 +466,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, 
unsigned long flags)
}
t->rcu_read_unlock_special.s = 0;
if (special.b.need_qs) {
-   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) {
rcu_report_qs_rdp(rdp->cpu, rdp);
-   else
+   udelay(rcu_unlock_delay);
+   } else {
rcu_qs();
+   }
}
 
/*
-- 
2.9.5

[PATCH tip/core/rcu 12/12] rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

The CONFIG_PREEMPT=n instance of rcu_read_unlock is even more
aggressively than that of CONFIG_PREEMPT=y in deferring reporting
quiescent states to the RCU core.  This is just what is wanted in normal
use because it reduces overhead, but the resulting delay is not what
is wanted for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.
This commit therefore adds an rcu_read_unlock_strict() function that
checks for exceptional conditions, and reports the newly started
quiescent state if it is safe to do so, also doing a spin-delay if
requested via rcutree.rcu_unlock_delay.  This commit also adds a call
to rcu_read_unlock_strict() from the CONFIG_PREEMPT=n instance of
__rcu_read_unlock().

[ paulmck: Fixed bug located by kernel test robot  ]
Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h |  7 +++
 kernel/rcu/tree.c|  6 ++
 kernel/rcu/tree_plugin.h | 23 +--
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index b47d6b6..7c1ceff 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -55,6 +55,12 @@ void __rcu_read_unlock(void);
 
 #else /* #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_TINY_RCU
+#define rcu_read_unlock_strict() do { } while (0)
+#else
+void rcu_read_unlock_strict(void);
+#endif
+
 static inline void __rcu_read_lock(void)
 {
preempt_disable();
@@ -63,6 +69,7 @@ static inline void __rcu_read_lock(void)
 static inline void __rcu_read_unlock(void)
 {
preempt_enable();
+   rcu_read_unlock_strict();
 }
 
 static inline int rcu_preempt_depth(void)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ac37343..78852ef 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -164,6 +164,12 @@ module_param(gp_init_delay, int, 0444);
 static int gp_cleanup_delay;
 module_param(gp_cleanup_delay, int, 0444);
 
+// Add delay to rcu_read_unlock() for strict grace periods.
+static int rcu_unlock_delay;
+#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
+module_param(rcu_unlock_delay, int, 0444);
+#endif
+
 /*
  * This rcu parameter is runtime-read-only. It reflects
  * a minimum allowed number of objects which can be cached
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 25c9ee4..0881aef 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -430,12 +430,6 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp)
return !list_empty(&rnp->blkd_tasks);
 }
 
-// Add delay to rcu_read_unlock() for strict grace periods.
-static int rcu_unlock_delay;
-#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD
-module_param(rcu_unlock_delay, int, 0444);
-#endif
-
 /*
  * Report deferred quiescent states.  The deferral time can
  * be quite short, for example, in the case of the call from
@@ -785,6 +779,23 @@ dump_blkd_tasks(struct rcu_node *rnp, int ncheck)
 #else /* #ifdef CONFIG_PREEMPT_RCU */
 
 /*
+ * If strict grace periods are enabled, and if the calling
+ * __rcu_read_unlock() marks the beginning of a quiescent state, immediately
+ * report that quiescent state and, if requested, spin for a bit.
+ */
+void rcu_read_unlock_strict(void)
+{
+   struct rcu_data *rdp;
+
+   if (!IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
+  irqs_disabled() || preempt_count() || !rcu_state.gp_kthread)
+   return;
+   rdp = this_cpu_ptr(&rcu_data);
+   rcu_report_qs_rdp(rdp->cpu, rdp);
+   udelay(rcu_unlock_delay);
+}
+
+/*
  * Tell them what RCU they are running.
  */
 static void __init rcu_bootup_announce(void)
-- 
2.9.5

[PATCH tip/core/rcu 03/12] rcu: Restrict default jiffies_till_first_fqs for strict RCU GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

If there are idle CPUs, RCU's grace-period kthread will wait several
jiffies before even thinking about polling them.  This promotes
efficiency, which is normally a good thing, but when the kernel
has been built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, we care more
about short grace periods.  This commit therefore restricts the
default jiffies_till_first_fqs value to zero in kernels built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y, which causes RCU's grace-period kthread
to poll for idle CPUs immediately after starting a grace period.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 65e1b5e..d333f1b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -471,7 +471,7 @@ module_param(qhimark, long, 0444);
 module_param(qlowmark, long, 0444);
 module_param(qovld, long, 0444);
 
-static ulong jiffies_till_first_fqs = ULONG_MAX;
+static ulong jiffies_till_first_fqs = 
IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 0 : ULONG_MAX;
 static ulong jiffies_till_next_fqs = ULONG_MAX;
 static bool rcu_kick_kthreads;
 static int rcu_divisor = 7;
-- 
2.9.5

[PATCH tip/core/rcu 09/12] rcu: IPI all CPUs at GP end for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

Currently, each CPU discovers the end of a given grace period on its
own time, which is again good for efficiency but bad for fast grace
periods, given that it is things like kfree() within the RCU callbacks
that will cause trouble for pointers leaked from RCU read-side critical
sections.  This commit therefore uses on_each_cpu() to IPI each CPU
after grace-period cleanup in order to inform each CPU of the end of
the old grace period in a timely manner, but only in kernels build with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a30d6f3..dd7af40 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2034,6 +2034,10 @@ static void rcu_gp_cleanup(void)
   rcu_state.gp_flags & RCU_GP_FLAG_INIT);
}
raw_spin_unlock_irq_rcu_node(rnp);
+
+   // If strict, make all CPUs aware of the end of the old grace period.
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
 }
 
 /*
-- 
2.9.5

[PATCH tip/core/rcu 08/12] rcu: IPI all CPUs at GP start for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

Currently, each CPU discovers the beginning of a given grace period
on its own time, which is again good for efficiency but bad for fast
grace periods.  This commit therefore uses on_each_cpu() to IPI each
CPU after grace-period initialization in order to inform each CPU of
the new grace period in a timely manner, but only in kernels build with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4353a1a..a30d6f3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1678,6 +1678,15 @@ static void rcu_gp_torture_wait(void)
 }
 
 /*
+ * Handler for on_each_cpu() to invoke the target CPU's RCU core
+ * processing.
+ */
+static void rcu_strict_gp_boundary(void *unused)
+{
+   invoke_rcu_core();
+}
+
+/*
  * Initialize a new grace period.  Return false if no grace period required.
  */
 static bool rcu_gp_init(void)
@@ -1805,6 +1814,10 @@ static bool rcu_gp_init(void)
WRITE_ONCE(rcu_state.gp_activity, jiffies);
}
 
+   // If strict, make all CPUs aware of new grace period.
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   on_each_cpu(rcu_strict_gp_boundary, NULL, 0);
+
return true;
 }
 
-- 
2.9.5

[PATCH tip/core/rcu 06/12] rcu: Do full report for .need_qs for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

The rcu_preempt_deferred_qs_irqrestore() function is invoked at
the end of an RCU read-side critical section (for example, directly
from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to
report the new quiescent state.  This works, except that rcu_qs() only
updates per-CPU state, leaving reporting of the actual quiescent state
to a later call to rcu_report_qs_rdp(), for example from within a later
RCU_SOFTIRQ instance.  Although this approach is exactly what you want if
you are more concerned about efficiency than about short grace periods,
in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are
the name of the game.

This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly
invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus
shortening grace periods.

Historical note:  To the best of my knowledge, causing rcu_read_unlock()
to directly report a quiescent state first appeared in Jim Houston's
and Joe Korty's JRCU.  This is the second instance of a Linux-kernel RCU
feature being inspired by JRCU, the first being RCU callback offloading
(as in the RCU_NOCB_CPU Kconfig option).

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 7ed55c5..1761ff4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -459,8 +459,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, 
unsigned long flags)
return;
}
t->rcu_read_unlock_special.s = 0;
-   if (special.b.need_qs)
-   rcu_qs();
+   if (special.b.need_qs) {
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   rcu_report_qs_rdp(rdp->cpu, rdp);
+   else
+   rcu_qs();
+   }
 
/*
 * Respond to a request by an expedited grace period for a
-- 
2.9.5

[PATCH tip/core/rcu 04/12] rcu: Force DEFAULT_RCU_BLIMIT to 1000 for strict RCU GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

The value of DEFAULT_RCU_BLIMIT is normally set to 10, the idea being to
avoid needless response-time degradation due to RCU callback invocation.
However, when CONFIG_RCU_STRICT_GRACE_PERIOD=y it is better to avoid
throttling callback execution in order to better detect pointer
leaks from RCU read-side critical sections.  This commit therefore
sets the value of DEFAULT_RCU_BLIMIT to 1000 in kernels built with
CONFIG_RCU_STRICT_GRACE_PERIOD=y.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index d333f1b..08cc91c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -454,17 +454,18 @@ static int rcu_is_cpu_rrupt_from_idle(void)
return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
 }
 
-#define DEFAULT_RCU_BLIMIT 10 /* Maximum callbacks per rcu_do_batch ... */
-#define DEFAULT_MAX_RCU_BLIMIT 1 /* ... even during callback flood. */
+#define DEFAULT_RCU_BLIMIT (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 1000 
: 10)
+   // Maximum callbacks per rcu_do_batch ...
+#define DEFAULT_MAX_RCU_BLIMIT 1 // ... even during callback flood.
 static long blimit = DEFAULT_RCU_BLIMIT;
-#define DEFAULT_RCU_QHIMARK 1 /* If this many pending, ignore blimit. */
+#define DEFAULT_RCU_QHIMARK 1 // If this many pending, ignore blimit.
 static long qhimark = DEFAULT_RCU_QHIMARK;
-#define DEFAULT_RCU_QLOMARK 100   /* Once only this many pending, use blimit. 
*/
+#define DEFAULT_RCU_QLOMARK 100   // Once only this many pending, use blimit.
 static long qlowmark = DEFAULT_RCU_QLOMARK;
 #define DEFAULT_RCU_QOVLD_MULT 2
 #define DEFAULT_RCU_QOVLD (DEFAULT_RCU_QOVLD_MULT * DEFAULT_RCU_QHIMARK)
-static long qovld = DEFAULT_RCU_QOVLD; /* If this many pending, hammer QS. */
-static long qovld_calc = -1; /* No pre-initialization lock acquisitions! */
+static long qovld = DEFAULT_RCU_QOVLD; // If this many pending, hammer QS.
+static long qovld_calc = -1; // No pre-initialization lock acquisitions!
 
 module_param(blimit, long, 0444);
 module_param(qhimark, long, 0444);
-- 
2.9.5

[PATCH tip/core/rcu 02/12] rcu: Reduce leaf fanout for strict RCU grace periods

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

Because strict RCU grace periods will complete more quickly, they will
experience greater lock contention on each leaf rcu_node structure's
->lock.  This commit therefore reduces the leaf fanout in order to reduce
this lock contention.

Note that this also has the effect of reducing the number of CPUs
supported to 16 in the case of CONFIG_RCU_FANOUT_LEAF=2 or 81 in the
case of CONFIG_RCU_FANOUT_LEAF=3.  However, greater numbers of CPUs are
probably a bad idea when using CONFIG_RCU_STRICT_GRACE_PERIOD=y.  Those
wishing to live dangerously are free to edit their kernel/rcu/Kconfig
files accordingly.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/Kconfig | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 0ebe15a..b71e21f 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -135,10 +135,12 @@ config RCU_FANOUT
 
 config RCU_FANOUT_LEAF
int "Tree-based hierarchical RCU leaf-level fanout value"
-   range 2 64 if 64BIT
-   range 2 32 if !64BIT
+   range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD
+   range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD
+   range 2 3 if RCU_STRICT_GRACE_PERIOD
depends on TREE_RCU && RCU_EXPERT
-   default 16
+   default 16 if !RCU_STRICT_GRACE_PERIOD
+   default 2 if RCU_STRICT_GRACE_PERIOD
help
  This option controls the leaf-level fanout of hierarchical
  implementations of RCU, and allows trading off cache misses
-- 
2.9.5

[PATCH tip/core/rcu 07/12] rcu: Attempt QS when CPU discovers GP for strict GPs

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

A given CPU normally notes a new grace period during one RCU_SOFTIRQ,
but avoids reporting the corresponding quiescent state until some later
RCU_SOFTIRQ.  This leisurly approach improves efficiency by increasing
the number of update requests served by each grace period, but is not
what is needed for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

This commit therefore adds a new rcu_strict_gp_check_qs() function
which, in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, simply enters and
immediately exist an RCU read-side critical section.  If the CPU is
in a quiescent state, the rcu_read_unlock() will attempt to report an
immediate quiescent state.  This rcu_strict_gp_check_qs() function is
invoked from note_gp_changes(), so that a CPU just noticing a new grace
period might immediately report a quiescent state for that grace period.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 08cc91c..4353a1a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1557,6 +1557,19 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct 
rcu_node *rnp,
 }
 
 /*
+ * In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, attempt to generate a
+ * quiescent state.  This is intended to be invoked when the CPU notices
+ * a new grace period.
+ */
+static void rcu_strict_gp_check_qs(void)
+{
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) {
+   rcu_read_lock();
+   rcu_read_unlock();
+   }
+}
+
+/*
  * Update CPU-local rcu_data state to record the beginnings and ends of
  * grace periods.  The caller must hold the ->lock of the leaf rcu_node
  * structure corresponding to the current CPU, and must have irqs disabled.
@@ -1626,6 +1639,7 @@ static void note_gp_changes(struct rcu_data *rdp)
}
needwake = __note_gp_changes(rnp, rdp);
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+   rcu_strict_gp_check_qs();
if (needwake)
rcu_gp_kthread_wake();
 }
-- 
2.9.5

[PATCH tip/core/rcu 01/12] rcu: Add Kconfig option for strict RCU grace periods

2020-08-12 Thread paulmck

From: "Paul E. McKenney" 

People running automated tests have asked for a way to make RCU minimize
grace-period duration in order to increase the probability of KASAN
detecting a pointer being improperly leaked from an RCU read-side critical
section, for example, like this:

rcu_read_lock();
p = rcu_dereference(gp);
do_something_with(p); // OK
rcu_read_unlock();
do_something_else_with(p); // BUG!!!

The rcupdate.rcu_expedited boot parameter is a start in this direction,
given that it makes calls to synchronize_rcu() instead invoke the faster
(and more wasteful) synchronize_rcu_expedited().  However, this does
nothing to shorten RCU grace periods that are instead initiated by
call_rcu(), and RCU pointer-leak bugs can involve call_rcu() just as
surely as they can synchronize_rcu().

This commit therefore adds a RCU_STRICT_GRACE_PERIOD Kconfig option
that will be used to shorten normal (non-expedited) RCU grace periods.
This commit also dumps out a message when this option is in effect.
Later commits will actually shorten grace periods.

Reported-by Jann Horn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/Kconfig.debug | 15 +++
 kernel/rcu/tree_plugin.h |  2 ++
 2 files changed, 17 insertions(+)

diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug
index 3cf6132..cab5a4b 100644
--- a/kernel/rcu/Kconfig.debug
+++ b/kernel/rcu/Kconfig.debug
@@ -114,4 +114,19 @@ config RCU_EQS_DEBUG
  Say N here if you need ultimate kernel/user switch latencies
  Say Y if you are unsure
 
+config RCU_STRICT_GRACE_PERIOD
+   bool "Provide debug RCU implementation with short grace periods"
+   depends on DEBUG_KERNEL && RCU_EXPERT
+   default n
+   select PREEMPT_COUNT if PREEMPT=n
+   help
+ Select this option to build an RCU variant that is strict about
+ grace periods, making them as short as it can.  This limits
+ scalability, destroys real-time response, degrades battery
+ lifetime and kills performance.  Don't try this on large
+ machines, as in systems with more than about 10 or 20 CPUs.
+ But in conjunction with tools like KASAN, it can be helpful
+ when looking for certain types of RCU usage bugs, for example,
+ too-short RCU read-side critical sections.
+
 endmenu # "RCU Debugging"
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cb1e8c8..5c0c580 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -36,6 +36,8 @@ static void __init rcu_bootup_announce_oddness(void)
pr_info("\tRCU dyntick-idle grace-period acceleration is 
enabled.\n");
if (IS_ENABLED(CONFIG_PROVE_RCU))
pr_info("\tRCU lockdep checking is enabled.\n");
+   if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD))
+   pr_info("\tRCU strict (and thus non-scalable) grace periods 
enabled.\n");
if (RCU_NUM_LVLS >= 4)
pr_info("\tFour(or more)-level hierarchy is enabled.\n");
if (RCU_FANOUT_LEAF != 16)
-- 
2.9.5

[PATCH v2 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking

2021-01-11 Thread paulmck

From: "Paul E. McKenney" 

Although smp_call_function() has the advantage of simplicity, using
it to check for cross-CPU clock desynchronization means that any CPU
being slow reduces the sensitivity of the checking across all CPUs.
And it is not uncommon for smp_call_function() latencies to be in the
hundreds of microseconds.

This commit therefore switches to smp_call_function_single(), so that
delays from a given CPU affect only those measurements involving that
particular CPU.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 41 +
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 67cf41c..3bae5fb 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void)
 }
 
 static struct clocksource *clocksource_verify_work_cs;
-static DEFINE_PER_CPU(u64, csnow_mid);
+static u64 csnow_mid;
 static cpumask_t cpus_ahead;
 static cpumask_t cpus_behind;
 
@@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin)
sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
delta = sign * NSEC_PER_SEC;
}
-   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
+   csnow_mid = cs->read(cs) + delta;
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
@@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
int cpu;
struct clocksource *cs;
int64_t cs_nsec;
+   int64_t cs_nsec_max;
+   int64_t cs_nsec_min;
u64 csnow_begin;
u64 csnow_end;
-   u64 delta;
+   s64 delta;
+   bool firsttime = 1;
 
cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
if (WARN_ON_ONCE(!cs))
@@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
cs->name, smp_processor_id());
cpumask_clear(&cpus_ahead);
cpumask_clear(&cpus_behind);
-   csnow_begin = cs->read(cs);
-   smp_call_function(clocksource_verify_one_cpu, cs, 1);
-   csnow_end = cs->read(cs);
+   preempt_disable();
for_each_online_cpu(cpu) {
if (cpu == smp_processor_id())
continue;
-   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
-   if ((s64)delta < 0)
+   csnow_begin = cs->read(cs);
+   smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 
1);
+   csnow_end = cs->read(cs);
+   delta = (s64)((csnow_mid - csnow_begin) & cs->mask);
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_behind);
-   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
-   if ((s64)delta < 0)
+   delta = (csnow_end - csnow_mid) & cs->mask;
+   if (delta < 0)
cpumask_set_cpu(cpu, &cpus_ahead);
+   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
+   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+   if (firsttime || cs_nsec > cs_nsec_max)
+   cs_nsec_max = cs_nsec;
+   if (firsttime || cs_nsec < cs_nsec_min)
+   cs_nsec_min = cs_nsec;
+   firsttime = 0;
}
+   preempt_enable();
if (!cpumask_empty(&cpus_ahead))
pr_warn("CPUs %*pbl ahead of CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_ahead),
@@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct 
work_struct *unused)
pr_warn("CPUs %*pbl behind CPU %d for clocksource 
%s.\n",
cpumask_pr_args(&cpus_behind),
smp_processor_id(), cs->name);
-   if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) {
-   delta = clocksource_delta(csnow_end, csnow_begin, cs->mask);
-   cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
-   pr_warn("CPU %d duration %lldns for clocksource %s.\n",
-   smp_processor_id(), cs_nsec, cs->name);
-   }
+   if (!firsttime && (!cpumask_empty(&cpus_ahead) || 
!cpumask_empty(&cpus_behind)))
+   pr_warn("CPU %d check durations %lldns - %lldns for 
clocksource %s.\n",
+   smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name);
smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with 
acquire.
 }
 
-- 
2.9.5

[PATCH v2 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable

2021-01-11 Thread paulmck

From: "Paul E. McKenney" 

Some sorts of per-CPU clock sources have a history of going out of
synchronization with each other.  However, this problem has purportedy
been solved in the past ten years.  Except that it is all too possible
that the problem has instead simply been made less likely, which might
mean that some of the occasional "Marking clocksource 'tsc' as unstable"
messages might be due to desynchronization.  How would anyone know?

This commit therefore adds CPU-to-CPU synchronization checking
for newly unstable clocksource that are marked with the new
CLOCK_SOURCE_VERIFY_PERCPU flag.  Lists of desynchronized CPUs are
printed, with the caveat that if it is the reporting CPU that is itself
desynchronized, it will appear that all the other clocks are wrong.
Just like in real life.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Reported-by: Chris Mason 
[ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot 
feedback. ]
Signed-off-by: Paul E. McKenney 
---
 arch/x86/kernel/kvmclock.c  |  2 +-
 arch/x86/kernel/tsc.c   |  3 +-
 include/linux/clocksource.h |  2 +-
 kernel/time/clocksource.c   | 73 +
 4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index aa59374..337bb2c 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -169,7 +169,7 @@ struct clocksource kvm_clock = {
.read   = kvm_clock_get_cycles,
.rating = 400,
.mask   = CLOCKSOURCE_MASK(64),
-   .flags  = CLOCK_SOURCE_IS_CONTINUOUS,
+   .flags  = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU,
.enable = kvm_cs_enable,
 };
 EXPORT_SYMBOL_GPL(kvm_clock);
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index f70dffc..5628917 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = {
.mask   = CLOCKSOURCE_MASK(64),
.flags  = CLOCK_SOURCE_IS_CONTINUOUS |
  CLOCK_SOURCE_VALID_FOR_HRES |
- CLOCK_SOURCE_MUST_VERIFY,
+ CLOCK_SOURCE_MUST_VERIFY |
+ CLOCK_SOURCE_VERIFY_PERCPU,
.vdso_clock_mode= VDSO_CLOCKMODE_TSC,
.enable = tsc_cs_enable,
.resume = tsc_resume,
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 86d143d..83a3ebf 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -131,7 +131,7 @@ struct clocksource {
 #define CLOCK_SOURCE_UNSTABLE  0x40
 #define CLOCK_SOURCE_SUSPEND_NONSTOP   0x80
 #define CLOCK_SOURCE_RESELECT  0x100
-
+#define CLOCK_SOURCE_VERIFY_PERCPU 0x200
 /* simplify initialization of mask field */
 #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0)
 
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 4663b86..23bcefe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void)
WARN_ON_ONCE(injectfail < 0);
 }
 
+static struct clocksource *clocksource_verify_work_cs;
+static DEFINE_PER_CPU(u64, csnow_mid);
+static cpumask_t cpus_ahead;
+static cpumask_t cpus_behind;
+
+static void clocksource_verify_one_cpu(void *csin)
+{
+   struct clocksource *cs = (struct clocksource *)csin;
+
+   __this_cpu_write(csnow_mid, cs->read(cs));
+}
+
+static void clocksource_verify_percpu_wq(struct work_struct *unused)
+{
+   int cpu;
+   struct clocksource *cs;
+   int64_t cs_nsec;
+   u64 csnow_begin;
+   u64 csnow_end;
+   u64 delta;
+
+   cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with 
release
+   if (WARN_ON_ONCE(!cs))
+   return;
+   pr_warn("Checking clocksource %s synchronization from CPU %d.\n",
+   cs->name, smp_processor_id());
+   cpumask_clear(&cpus_ahead);
+   cpumask_clear(&cpus_behind);
+   csnow_begin = cs->read(cs);
+   smp_call_function(clocksource_verify_one_cpu, cs, 1);
+   csnow_end = cs->read(cs);
+   for_each_online_cpu(cpu) {
+   if (cpu == smp_processor_id())
+   continue;
+   delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_behind);
+   delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask;
+   if ((s64)delta < 0)
+   cpumask_set_cpu(cpu, &cpus_ahead);
+   }
+   if (!cpuma

[PATCH v2 clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking

2021-01-11 Thread paulmck

From: "Paul E. McKenney" 

Code that checks for clock desynchronization must itself be tested, so
this commit creates a new clocksource.inject_delay_shift_percpu= kernel
boot parameter that adds or subtracts a large value from the check read,
using the specified bit of the CPU ID to determine whether to add or
to subtract.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt |  9 +
 kernel/time/clocksource.c   | 10 +-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 4c59813..ca64b0c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -593,6 +593,15 @@
times the value specified for inject_delay_freq
of consecutive non-delays.
 
+   clocksource.inject_delay_shift_percpu= [KNL]
+   Shift count to obtain bit from CPU number to
+   determine whether to shift the time of the per-CPU
+   clock under test ahead or behind.  For example,
+   setting this to the value four will result in
+   alternating groups of 16 CPUs shifting ahead and
+   the rest of the CPUs shifting behind.  The default
+   value of -1 disable this type of error injection.
+
clocksource.max_read_retries= [KNL]
Number of clocksource_watchdog() retries due to
external delays before the clock will be marked
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 23bcefe..67cf41c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -190,6 +190,8 @@ static int inject_delay_freq;
 module_param(inject_delay_freq, int, 0644);
 static int inject_delay_run = 1;
 module_param(inject_delay_run, int, 0644);
+static int inject_delay_shift_percpu = -1;
+module_param(inject_delay_shift_percpu, int, 0644);
 static int max_read_retries = 3;
 module_param(max_read_retries, int, 0644);
 
@@ -219,8 +221,14 @@ static cpumask_t cpus_behind;
 static void clocksource_verify_one_cpu(void *csin)
 {
struct clocksource *cs = (struct clocksource *)csin;
+   s64 delta = 0;
+   int sign;
 
-   __this_cpu_write(csnow_mid, cs->read(cs));
+   if (inject_delay_shift_percpu >= 0) {
+   sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 
0x1) * 2 - 1;
+   delta = sign * NSEC_PER_SEC;
+   }
+   __this_cpu_write(csnow_mid, cs->read(cs) + delta);
 }
 
 static void clocksource_verify_percpu_wq(struct work_struct *unused)
-- 
2.9.5

[PATCH v2 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-01-11 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might
be due to that clock being unstable or it might be due to delays that
happen to occur between the reads of the two clocks.  Yes, interrupts are
disabled across those two reads, but there are no shortage of things that
can delay interrupts-disabled regions of code ranging from SMI handlers
to vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

This commit therefore re-reads the watchdog clock on either side of
the read from the clock under test.  If the watchdog clock shows an
excessive time delta between its pair of reads, the reads are retried.
The maximum number of retries is specified by a new kernel boot
parameter clocksource.max_read_retries, which defaults to three, that
is, up to four reads, one initial and up to three retries.  If retries
were required, a message is printed on the console.  If the number of
retries is exceeded, the clock under test will be marked unstable.
However, the probability of this happening due to various sorts of
delays is quite small.  In addition, the reason (clock-read delays)
for the unstable marking will be apparent.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Reported-by: Chris Mason 
[ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ]
[ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/time/clocksource.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 545889c..4663b86 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource 
*cs, int rating);
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)
 
 static void clocksource_watchdog_work(struct work_struct *work)
 {
@@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void)
 static void clocksource_watchdog(struct timer_list *unused)
 {
struct clocksource *cs;
-   u64 csnow, wdnow, cslast, wdlast, delta;
-   int64_t wd_nsec, cs_nsec;
+   u64 csnow, wdnow, wdagain, cslast, wdlast, delta;
+   int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec;
int next_cpu, reset_pending;
+   int nretries;
 
spin_lock(&watchdog_lock);
if (!watchdog_running)
@@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused)
reset_pending = atomic_read(&watchdog_reset_pending);
 
list_for_each_entry(cs, &watchdog_list, wd_list) {
+   nretries = 0;
 
/* Clocksource already marked unstable? */
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
@@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list 
*unused)
continue;
}
 
+retry:
local_irq_disable();
-   csnow = cs->read(cs);
-   clocksource_watchdog_inject_delay();
wdnow = watchdog->read(watchdog);
+   clocksource_watchdog_inject_delay();
+   csnow = cs->read(cs);
+   wdagain = watchdog->read(watchdog);
local_irq_enable();
+   delta = clocksource_delta(wdagain, wdnow, watchdog->mask);
+   wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, 
watchdog->shift);
+   if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) {
+   wderr_nsec = wdagain_nsec;
+   if (nretries++ < max_read_retries)
+   goto retry;
+   }
+   if (nretries)
+   pr_warn("timekeeping watchdog on CPU%d: %s read-back 
delay of %lldns, attempt %d\n",
+   smp_processor_id(), watchdog->name, wderr_nsec, 
nretries);
 
/* Clocksource initialized ? */
if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
-- 
2.9.5

[PATCH v2 clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog

2021-01-11 Thread paulmck

From: "Paul E. McKenney" 

When the clocksource watchdog marks a clock as unstable, this might be due
to that clock being unstable or it might be due to delays that happen to
occur between the reads of the two clocks.  Yes, interrupts are disabled
across those two reads, but there are no shortage of things that can
delay interrupts-disabled regions of code ranging from SMI handlers to
vCPU preemption.  It would be good to have some indication as to why
the clock was marked unstable.

The first step is a way of injecting such delays, and this
commit therefore provides a clocksource.inject_delay_freq and
clocksource.inject_delay_run kernel boot parameters that specify that
sufficient delay be injected to cause the clocksource_watchdog()
function to mark a clock unstable.  This delay is injected every
Nth set of M calls to clocksource_watchdog(), where N is the value
specified for the inject_delay_freq boot parameter and M is the value
specified for the inject_delay_run boot parameter.  Values of zero or
less for either parameter disable delay injection, and the default for
clocksource.inject_delay_freq is zero, that is, disabled.  The default for
clocksource.inject_delay_run is the value one, that is single-call runs.

This facility is intended for diagnostic use only, and should be avoided
on production systems.

Cc: John Stultz 
Cc: Thomas Gleixner 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 22 
 kernel/time/clocksource.c   | 27 +
 2 files changed, 49 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 9e3cdb2..4c59813 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -577,6 +577,28 @@
loops can be debugged more effectively on production
systems.
 
+   clocksource.inject_delay_freq= [KNL]
+   Number of runs of calls to clocksource_watchdog()
+   before delays are injected between reads from the
+   two clocksources.  Values less than or equal to
+   zero disable this delay injection.  These delays
+   can cause clocks to be marked unstable, so use
+   of this parameter should therefore be avoided on
+   production systems.  Defaults to zero (disabled).
+
+   clocksource.inject_delay_run= [KNL]
+   Run lengths of clocksource_watchdog() delay
+   injections.  Specifying the value 8 will result
+   in eight consecutive delays followed by eight
+   times the value specified for inject_delay_freq
+   of consecutive non-delays.
+
+   clocksource.max_read_retries= [KNL]
+   Number of clocksource_watchdog() retries due to
+   external delays before the clock will be marked
+   unstable.  Defaults to three retries, that is,
+   four attempts to read the clock under test.
+
clearcpuid=BITNUM[,BITNUM...] [X86]
Disable CPUID feature X for the kernel. See
arch/x86/include/asm/cpufeatures.h for the valid bit
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index cce484a..545889c 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -14,6 +14,7 @@
 #include  /* for spin_unlock_irq() using preempt_count() m68k */
 #include 
 #include 
+#include 
 
 #include "tick-internal.h"
 #include "timekeeping_internal.h"
@@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
+static int inject_delay_freq;
+module_param(inject_delay_freq, int, 0644);
+static int inject_delay_run = 1;
+module_param(inject_delay_run, int, 0644);
+static int max_read_retries = 3;
+module_param(max_read_retries, int, 0644);
+
+static void clocksource_watchdog_inject_delay(void)
+{
+   int i;
+   static int injectfail = -1;
+
+   if (inject_delay_freq <= 0 || inject_delay_run <= 0)
+   return;
+   if (injectfail < 0 || injectfail > INT_MAX / 2)
+   injectfail = inject_delay_run;
+   if (!(++injectfail / inject_delay_run % inject_delay_freq)) {
+   printk("%s(): Injecting delay.\n", __func__);
+   for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++)
+   udelay(1000);
+   printk("%s(): Done injecting delay.\n", __func__);
+   }
+   WARN_O

[PATCH v3 sl-b 1/6] mm: Add mem_dump_obj() to print source of memory block

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening.  In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.

This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from.  This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is.  These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.

The information printed can depend on kernel configuration.  For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled.  For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
[ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
[ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
Signed-off-by: Paul E. McKenney 
---
 include/linux/mm.h   |  2 ++
 include/linux/slab.h |  2 ++
 mm/slab.c| 20 ++
 mm/slab.h| 12 +
 mm/slab_common.c | 74 
 mm/slob.c|  6 +
 mm/slub.c| 36 +
 mm/util.c| 24 +
 8 files changed, 176 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe..1eea266 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct 
address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+void mem_dump_obj(void *object);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/slab.h b/include/linux/slab.h
index dd6897f..169b511 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -186,6 +186,8 @@ void kfree(const void *);
 void kfree_sensitive(const void *);
 size_t __ksize(const void *);
 size_t ksize(const void *);
+bool kmem_valid_obj(void *object);
+void kmem_dump_obj(void *object);
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
 void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
diff --git a/mm/slab.c b/mm/slab.c
index b111356..66f00ad 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3633,6 +3633,26 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t 
flags,
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 #endif /* CONFIG_NUMA */
 
+void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page)
+{
+   struct kmem_cache *cachep;
+   unsigned int objnr;
+   void *objp;
+
+   kpp->kp_ptr = object;
+   kpp->kp_page = page;
+   cachep = page->slab_cache;
+   kpp->kp_slab_cache = cachep;
+   objp = object - obj_offset(cachep);
+   kpp->kp_data_offset = obj_offset(cachep);
+   page = virt_to_head_page(objp);
+   objnr = obj_to_index(cachep, page, objp);
+   objp = index_to_obj(cachep, page, objnr);
+   kpp->kp_objp = objp;
+   if (DEBUG && cachep->flags & SLAB_STORE_USER)
+   kpp->kp_ret = *dbg_userword(cachep, objp);
+}
+
 /**
  * __do_kmalloc - allocate memory
  * @size: how many bytes of memory are required.
diff --git a/mm/slab.h b/mm/slab.h
index 6d7c6a5..0dc705b 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -630,4 +630,16 @@ static inline bool slab_want_init_on_free(struct 
kmem_cache *c)
return false;
 }
 
+#define KS_ADDRS_COUNT 16
+struct kmem_obj_info {
+   void *kp_ptr;
+   struct page *kp_page;
+   void *kp_objp;
+   unsigned long kp_data_offset;
+   struct kmem_cache *kp_slab_cache;
+   void

[PATCH v3 sl-b 5/6] rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

The debug-object double-free checks in __call_rcu() print out the
RCU callback function, which is usually sufficient to track down the
double free.  However, all uses of things like queue_rcu_work() will
have the same RCU callback function (rcu_work_rcufn() in this case),
so a diagnostic message for a double queue_rcu_work() needs more than
just the callback function.

This commit therefore calls mem_dump_obj() to dump out any additional
available information on the double-freed callback.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b408dca..80ceee5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2959,6 +2959,7 @@ static void check_cb_ovld(struct rcu_data *rdp)
 static void
 __call_rcu(struct rcu_head *head, rcu_callback_t func)
 {
+   static atomic_t doublefrees;
unsigned long flags;
struct rcu_data *rdp;
bool was_alldone;
@@ -2972,8 +2973,10 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
 * Use rcu:rcu_callback trace event to find the previous
 * time callback was passed to __call_rcu().
 */
-   WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n",
- head, head->func);
+   if (atomic_inc_return(&doublefrees) < 4) {
+   pr_err("%s(): Double-freed CB %p->%pS()!!!  ", 
__func__, head, head->func);
+   mem_dump_obj(head);
+   }
WRITE_ONCE(head->func, rcu_leak_callback);
return;
}
-- 
2.9.5

[PATCH v3 sl-b 2/6] mm: Make mem_dump_obj() handle NULL and zero-sized pointers

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

This commit makes mem_dump_obj() call out NULL and zero-sized pointers
specially instead of classifying them as non-paged memory.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
Acked-by: Vlastimil Babka 
Signed-off-by: Paul E. McKenney 
---
 mm/util.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index f2e0c4d9..f7c94c8 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -985,7 +985,12 @@ int __weak memcmp_pages(struct page *page1, struct page 
*page2)
 void mem_dump_obj(void *object)
 {
if (!virt_addr_valid(object)) {
-   pr_cont(" non-paged (local) memory.\n");
+   if (object == NULL)
+   pr_cont(" NULL pointer.\n");
+   else if (object == ZERO_SIZE_PTR)
+   pr_cont(" zero-size pointer.\n");
+   else
+   pr_cont(" non-paged (local) memory.\n");
return;
}
if (kmem_valid_obj(object)) {
-- 
2.9.5

[PATCH v3 sl-b 3/6] mm: Make mem_dump_obj() handle vmalloc() memory

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

This commit adds vmalloc() support to mem_dump_obj().  Note that the
vmalloc_dump_obj() function combines the checking and dumping, in
contrast with the split between kmem_valid_obj() and kmem_dump_obj().
The reason for the difference is that the checking in the vmalloc()
case involves acquiring a global lock, and redundant acquisitions of
global locks should be avoided, even on not-so-fast paths.

Note that this change causes on-stack variables to be reported as
vmalloc() storage from kernel_clone() or similar, depending on the degree
of inlining that your compiler does.  This is likely more helpful than
the earlier "non-paged (local) memory".

Cc: Andrew Morton 
Cc: Joonsoo Kim 
Cc: 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 include/linux/vmalloc.h |  6 ++
 mm/util.c   | 14 --
 mm/vmalloc.c| 12 
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9..c89c2be 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -248,4 +248,10 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 int register_vmap_purge_notifier(struct notifier_block *nb);
 int unregister_vmap_purge_notifier(struct notifier_block *nb);
 
+#ifdef CONFIG_MMU
+bool vmalloc_dump_obj(void *object);
+#else
+static inline bool vmalloc_dump_obj(void *object) { return false; }
+#endif
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/util.c b/mm/util.c
index f7c94c8..dcde696 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -984,18 +984,20 @@ int __weak memcmp_pages(struct page *page1, struct page 
*page2)
  */
 void mem_dump_obj(void *object)
 {
+   if (kmem_valid_obj(object)) {
+   kmem_dump_obj(object);
+   return;
+   }
+   if (vmalloc_dump_obj(object))
+   return;
if (!virt_addr_valid(object)) {
if (object == NULL)
pr_cont(" NULL pointer.\n");
else if (object == ZERO_SIZE_PTR)
pr_cont(" zero-size pointer.\n");
else
-   pr_cont(" non-paged (local) memory.\n");
-   return;
-   }
-   if (kmem_valid_obj(object)) {
-   kmem_dump_obj(object);
+   pr_cont(" non-paged memory.\n");
return;
}
-   pr_cont(" non-slab memory.\n");
+   pr_cont(" non-slab/vmalloc memory.\n");
 }
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6ae491a..7421719 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
nr_vms)
 }
 #endif /* CONFIG_SMP */
 
+bool vmalloc_dump_obj(void *object)
+{
+   struct vm_struct *vm;
+   void *objp = (void *)PAGE_ALIGN((unsigned long)object);
+
+   vm = find_vm_area(objp);
+   if (!vm)
+   return false;
+   pr_cont(" vmalloc allocated at %pS\n", vm->caller);
+   return true;
+}
+
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
__acquires(&vmap_purge_lock)
-- 
2.9.5

[PATCH v3 sl-b 4/6] mm: Make mem_obj_dump() vmalloc() dumps include start and length

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

This commit adds the starting address and number of pages to the vmalloc()
information dumped by way of vmalloc_dump_obj().

Cc: Andrew Morton 
Cc: Joonsoo Kim 
Cc: 
Reported-by: Andrii Nakryiko 
Suggested-by: Vlastimil Babka 
Signed-off-by: Paul E. McKenney 
---
 mm/vmalloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7421719..77b1100 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3439,7 +3439,8 @@ bool vmalloc_dump_obj(void *object)
vm = find_vm_area(objp);
if (!vm)
return false;
-   pr_cont(" vmalloc allocated at %pS\n", vm->caller);
+   pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n",
+   vm->nr_pages, (unsigned long)vm->addr, vm->caller);
return true;
 }
 
-- 
2.9.5

[PATCH v3 sl-b 6/6] percpu_ref: Dump mem_dump_obj() info upon reference-count underflow

2020-12-10 Thread paulmck

From: "Paul E. McKenney" 

Reference-count underflow for percpu_ref is detected in the RCU callback
percpu_ref_switch_to_atomic_rcu(), and the resulting warning does not
print anything allowing easy identification of which percpu_ref use
case is underflowing.  This is of course not normally a problem when
developing a new percpu_ref use case because it is most likely that
the problem resides in this new use case.  However, when deploying a
new kernel to a large set of servers, the underflow might well be a new
corner case in any of the old percpu_ref use cases.

This commit therefore calls mem_dump_obj() to dump out any additional
available information on the underflowing percpu_ref instance.

Cc: Ming Lei 
Cc: Jens Axboe 
Cc: Joonsoo Kim 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 lib/percpu-refcount.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index e59eda0..a1071cd 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -168,6 +169,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head 
*rcu)
struct percpu_ref_data, rcu);
struct percpu_ref *ref = data->ref;
unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+   static atomic_t underflows;
unsigned long count = 0;
int cpu;
 
@@ -191,9 +193,13 @@ static void percpu_ref_switch_to_atomic_rcu(struct 
rcu_head *rcu)
 */
atomic_long_add((long)count - PERCPU_COUNT_BIAS, &data->count);
 
-   WARN_ONCE(atomic_long_read(&data->count) <= 0,
- "percpu ref (%ps) <= 0 (%ld) after switching to atomic",
- data->release, atomic_long_read(&data->count));
+   if (WARN_ONCE(atomic_long_read(&data->count) <= 0,
+ "percpu ref (%ps) <= 0 (%ld) after switching to atomic",
+ data->release, atomic_long_read(&data->count)) &&
+   atomic_inc_return(&underflows) < 4) {
+   pr_err("%s(): percpu_ref underflow", __func__);
+   mem_dump_obj(data);
+   }
 
/* @ref is viewed as dead on all CPUs, send out switch confirmation */
percpu_ref_call_confirm_rcu(rcu);
-- 
2.9.5

[PATCH v2 sl-b 2/5] mm: Make mem_dump_obj() handle NULL and zero-sized pointers

2020-12-08 Thread paulmck

From: "Paul E. McKenney" 

This commit makes mem_dump_obj() call out NULL and zero-sized pointers
specially instead of classifying them as non-paged memory.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 mm/util.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/util.c b/mm/util.c
index d0e60d2..8c2449f 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -985,7 +985,12 @@ int __weak memcmp_pages(struct page *page1, struct page 
*page2)
 void mem_dump_obj(void *object)
 {
if (!virt_addr_valid(object)) {
-   pr_cont(" non-paged (local) memory.\n");
+   if (object == NULL)
+   pr_cont(" NULL pointer.\n");
+   else if (object == ZERO_SIZE_PTR)
+   pr_cont(" zero-size pointer.\n");
+   else
+   pr_cont(" non-paged (local) memory.\n");
return;
}
if (kmem_valid_obj(object)) {
-- 
2.9.5

[PATCH v2 sl-b 4/5] rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback

2020-12-08 Thread paulmck

From: "Paul E. McKenney" 

The debug-object double-free checks in __call_rcu() print out the
RCU callback function, which is usually sufficient to track down the
double free.  However, all uses of things like queue_rcu_work() will
have the same RCU callback function (rcu_work_rcufn() in this case),
so a diagnostic message for a double queue_rcu_work() needs more than
just the callback function.

This commit therefore calls mem_dump_obj() to dump out any additional
available information on the double-freed callback.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b6c9c49..464cf14 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2957,6 +2957,7 @@ static void check_cb_ovld(struct rcu_data *rdp)
 static void
 __call_rcu(struct rcu_head *head, rcu_callback_t func)
 {
+   static atomic_t doublefrees;
unsigned long flags;
struct rcu_data *rdp;
bool was_alldone;
@@ -2970,8 +2971,10 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
 * Use rcu:rcu_callback trace event to find the previous
 * time callback was passed to __call_rcu().
 */
-   WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n",
- head, head->func);
+   if (atomic_inc_return(&doublefrees) < 4) {
+   pr_err("%s(): Double-freed CB %p->%pS()!!!  ", 
__func__, head, head->func);
+   mem_dump_obj(head);
+   }
WRITE_ONCE(head->func, rcu_leak_callback);
return;
}
-- 
2.9.5

[PATCH v2 sl-b 5/5] percpu_ref: Dump mem_dump_obj() info upon reference-count underflow

2020-12-08 Thread paulmck

From: "Paul E. McKenney" 

Reference-count underflow for percpu_ref is detected in the RCU callback
percpu_ref_switch_to_atomic_rcu(), and the resulting warning does not
print anything allowing easy identification of which percpu_ref use
case is underflowing.  This is of course not normally a problem when
developing a new percpu_ref use case because it is most likely that
the problem resides in this new use case.  However, when deploying a
new kernel to a large set of servers, the underflow might well be a new
corner case in any of the old percpu_ref use cases.

This commit therefore calls mem_dump_obj() to dump out any additional
available information on the underflowing percpu_ref instance.

Cc: Ming Lei 
Cc: Jens Axboe 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 lib/percpu-refcount.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index e59eda0..a1071cd 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
@@ -168,6 +169,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head 
*rcu)
struct percpu_ref_data, rcu);
struct percpu_ref *ref = data->ref;
unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
+   static atomic_t underflows;
unsigned long count = 0;
int cpu;
 
@@ -191,9 +193,13 @@ static void percpu_ref_switch_to_atomic_rcu(struct 
rcu_head *rcu)
 */
atomic_long_add((long)count - PERCPU_COUNT_BIAS, &data->count);
 
-   WARN_ONCE(atomic_long_read(&data->count) <= 0,
- "percpu ref (%ps) <= 0 (%ld) after switching to atomic",
- data->release, atomic_long_read(&data->count));
+   if (WARN_ONCE(atomic_long_read(&data->count) <= 0,
+ "percpu ref (%ps) <= 0 (%ld) after switching to atomic",
+ data->release, atomic_long_read(&data->count)) &&
+   atomic_inc_return(&underflows) < 4) {
+   pr_err("%s(): percpu_ref underflow", __func__);
+   mem_dump_obj(data);
+   }
 
/* @ref is viewed as dead on all CPUs, send out switch confirmation */
percpu_ref_call_confirm_rcu(rcu);
-- 
2.9.5

[PATCH v2 sl-b 3/5] mm: Make mem_dump_obj() handle vmalloc() memory

2020-12-08 Thread paulmck

From: "Paul E. McKenney" 

This commit adds vmalloc() support to mem_dump_obj().  Note that the
vmalloc_dump_obj() function combines the checking and dumping, in
contrast with the split between kmem_valid_obj() and kmem_dump_obj().
The reason for the difference is that the checking in the vmalloc()
case involves acquiring a global lock, and redundant acquisitions of
global locks should be avoided, even on not-so-fast paths.

Note that this change causes on-stack variables to be reported as
vmalloc() storage from kernel_clone() or similar, depending on the degree
of inlining that your compiler does.  This is likely more helpful than
the earlier "non-paged (local) memory".

Cc: Andrew Morton 
Cc: Joonsoo Kim 
Cc: 
Reported-by: Andrii Nakryiko 
Signed-off-by: Paul E. McKenney 
---
 include/linux/vmalloc.h |  6 ++
 mm/util.c   | 12 +++-
 mm/vmalloc.c| 12 
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 938eaf9..c89c2be 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -248,4 +248,10 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms)
 int register_vmap_purge_notifier(struct notifier_block *nb);
 int unregister_vmap_purge_notifier(struct notifier_block *nb);
 
+#ifdef CONFIG_MMU
+bool vmalloc_dump_obj(void *object);
+#else
+static inline bool vmalloc_dump_obj(void *object) { return false; }
+#endif
+
 #endif /* _LINUX_VMALLOC_H */
diff --git a/mm/util.c b/mm/util.c
index 8c2449f..ee99a0a 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -984,6 +984,12 @@ int __weak memcmp_pages(struct page *page1, struct page 
*page2)
  */
 void mem_dump_obj(void *object)
 {
+   if (kmem_valid_obj(object)) {
+   kmem_dump_obj(object);
+   return;
+   }
+   if (vmalloc_dump_obj(object))
+   return;
if (!virt_addr_valid(object)) {
if (object == NULL)
pr_cont(" NULL pointer.\n");
@@ -993,10 +999,6 @@ void mem_dump_obj(void *object)
pr_cont(" non-paged (local) memory.\n");
return;
}
-   if (kmem_valid_obj(object)) {
-   kmem_dump_obj(object);
-   return;
-   }
-   pr_cont(" non-slab memory.\n");
+   pr_cont(" non-slab/vmalloc memory.\n");
 }
 EXPORT_SYMBOL_GPL(mem_dump_obj);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6ae491a..7421719 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int 
nr_vms)
 }
 #endif /* CONFIG_SMP */
 
+bool vmalloc_dump_obj(void *object)
+{
+   struct vm_struct *vm;
+   void *objp = (void *)PAGE_ALIGN((unsigned long)object);
+
+   vm = find_vm_area(objp);
+   if (!vm)
+   return false;
+   pr_cont(" vmalloc allocated at %pS\n", vm->caller);
+   return true;
+}
+
 #ifdef CONFIG_PROC_FS
 static void *s_start(struct seq_file *m, loff_t *pos)
__acquires(&vmap_purge_lock)
-- 
2.9.5

[PATCH v2 sl-b 1/5] mm: Add mem_dump_obj() to print source of memory block

2020-12-08 Thread paulmck

From: "Paul E. McKenney" 

There are kernel facilities such as per-CPU reference counts that give
error messages in generic handlers or callbacks, whose messages are
unenlightening.  In the case of per-CPU reference-count underflow, this
is not a problem when creating a new use of this facility because in that
case the bug is almost certainly in the code implementing that new use.
However, trouble arises when deploying across many systems, which might
exercise corner cases that were not seen during development and testing.
Here, it would be really nice to get some kind of hint as to which of
several uses the underflow was caused by.

This commit therefore exposes a mem_dump_obj() function that takes
a pointer to memory (which must still be allocated if it has been
dynamically allocated) and prints available information on where that
memory came from.  This pointer can reference the middle of the block as
well as the beginning of the block, as needed by things like RCU callback
functions and timer handlers that might not know where the beginning of
the memory block is.  These functions and handlers can use mem_dump_obj()
to print out better hints as to where the problem might lie.

The information printed can depend on kernel configuration.  For example,
the allocation return address can be printed only for slab and slub,
and even then only when the necessary debug has been enabled.  For slab,
build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
to the next power of two or use the SLAB_STORE_USER when creating the
kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
to enable printing of the allocation-time stack trace.

Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Andrew Morton 
Cc: 
Reported-by: Andrii Nakryiko 
[ paulmck: Convert to printing and change names per Joonsoo Kim. ]
[ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
Signed-off-by: Paul E. McKenney 
---
 include/linux/mm.h   |  2 ++
 include/linux/slab.h |  2 ++
 mm/slab.c| 28 +
 mm/slab.h| 11 +
 mm/slab_common.c | 69 
 mm/slob.c|  7 ++
 mm/slub.c| 40 ++
 mm/util.c| 25 +++
 8 files changed, 184 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe..1eea266 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct 
address_space *mapping,
 
 extern int sysctl_nr_trim_pages;
 
+void mem_dump_obj(void *object);
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
diff --git a/include/linux/slab.h b/include/linux/slab.h
index dd6897f..169b511 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -186,6 +186,8 @@ void kfree(const void *);
 void kfree_sensitive(const void *);
 size_t __ksize(const void *);
 size_t ksize(const void *);
+bool kmem_valid_obj(void *object);
+void kmem_dump_obj(void *object);
 
 #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR
 void __check_heap_object(const void *ptr, unsigned long n, struct page *page,
diff --git a/mm/slab.c b/mm/slab.c
index b111356..72b6743 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3602,6 +3602,34 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache 
*cachep,
 EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
 #endif
 
+void kmem_provenance(struct kmem_provenance *kpp)
+{
+#ifdef DEBUG
+   struct kmem_cache *cachep;
+   void *object = kpp->kp_ptr;
+   unsigned int objnr;
+   void *objp;
+   struct page *page = kpp->kp_page;
+
+   cachep = page->slab_cache;
+   if (!(cachep->flags & SLAB_STORE_USER)) {
+   kpp->kp_ret = NULL;
+   goto nodebug;
+   }
+   objp = object - obj_offset(cachep);
+   page = virt_to_head_page(objp);
+   objnr = obj_to_index(cachep, page, objp);
+   objp = index_to_obj(cachep, page, objnr);
+   kpp->kp_objp = objp;
+   kpp->kp_ret = *dbg_userword(cachep, objp);
+nodebug:
+#else
+   kpp->kp_ret = NULL;
+#endif
+   if (kpp->kp_nstack)
+   kpp->kp_stack[0] = NULL;
+}
+
 static __always_inline void *
 __do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller)
 {
diff --git a/mm/slab.h b/mm/slab.h
index 6d7c6a5..28a41d5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -630,4 +630,15 @@ static inline bool slab_want_init_on_free(struct 
kmem_cache *c)
return false;
 }
 
+#define KS_ADDRS_COUNT 16
+struct kmem_provenance {
+   void *kp_ptr;
+   struct page *kp_page;
+   void *kp_objp;
+   void *kp_ret;
+   void *kp_stack[KS_ADDRS_COUNT];
+   int kp_nstack;
+};
+void kmem_proven

[PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels

2021-01-19 Thread paulmck

From: "Paul E. McKenney" 

The priority level of the rcuo kthreads is the system administrator's
responsibility, but kernels that priority-boost RCU readers probably need
the rcuo kthreads running at the rcutree.kthread_prio level.  This commit
therefore sets these kthreads to that priority level at creation time,
providing a sensible default.  The system administrator is free to adjust
as needed at any time.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index fca31c6..7e33dae0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2197,6 +2197,7 @@ static int rcu_nocb_gp_kthread(void *arg)
 {
struct rcu_data *rdp = arg;

+   rcu_cpu_kthread_setup(-1);
for (;;) {
WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1);
nocb_gp_wait(rdp);
@@ -2298,6 +2299,7 @@ static int rcu_nocb_cb_kthread(void *arg)

// Each pass through this loop does one callback batch, and,
// if there are no more ready callbacks, waits for them.
+   rcu_cpu_kthread_setup(-1);
for (;;) {
nocb_cb_wait(rdp);
cond_resched_tasks_rcu_qs();
-- 
2.9.5

[PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting

2021-01-19 Thread paulmck

From: "Paul E. McKenney" 

Currently, rcutorture refuses to test RCU priority boosting in
CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on
x86 these days.  This commit therefore updates rcutorture's tests of RCU
priority boosting to make them safe for CPU hotplug.  However, these tests
will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not
happen in current mainline.  This commit therefore also refuses to test
RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y.

While in the area, this commt adds some debug output at boost-fail time
that helps diagnose the cause of the failure, for example, failing to
run TIMER_SOFTIRQ at realtime priority.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 8e93f2e..2440f89 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -245,11 +245,11 @@ static const char *rcu_torture_writer_state_getname(void)
return rcu_torture_writer_state_names[i];
 }
 
-#if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU)
-#define rcu_can_boost() 1
-#else /* #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */
-#define rcu_can_boost() 0
-#endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) 
*/
+#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_PREEMPT_RT)
+# define rcu_can_boost() 1
+#else
+# define rcu_can_boost() 0
+#endif
 
 #ifdef CONFIG_RCU_TRACE
 static u64 notrace rcu_trace_clock_local(void)
@@ -923,9 +923,13 @@ static void rcu_torture_enable_rt_throttle(void)
 
 static bool rcu_torture_boost_failed(unsigned long start, unsigned long end)
 {
+   static int dbg_done;
+
if (end - start > test_boost_duration * HZ - HZ / 2) {
VERBOSE_TOROUT_STRING("rcu_torture_boost boosting failed");
n_rcu_torture_boost_failure++;
+   if (!xchg(&dbg_done, 1) && cur_ops->gp_kthread_dbg)
+   cur_ops->gp_kthread_dbg();
 
return true; /* failed */
}
@@ -948,8 +952,8 @@ static int rcu_torture_boost(void *arg)
init_rcu_head_on_stack(&rbi.rcu);
/* Each pass through the following loop does one boost-test cycle. */
do {
-   /* Track if the test failed already in this test interval? */
-   bool failed = false;
+   bool failed = false; // Test failed already in this test 
interval
+   bool firsttime = true;
 
/* Increment n_rcu_torture_boosts once per boost-test */
while (!kthread_should_stop()) {
@@ -975,18 +979,17 @@ static int rcu_torture_boost(void *arg)
 
/* Do one boost-test interval. */
endtime = oldstarttime + test_boost_duration * HZ;
-   call_rcu_time = jiffies;
while (time_before(jiffies, endtime)) {
/* If we don't have a callback in flight, post one. */
if (!smp_load_acquire(&rbi.inflight)) {
/* RCU core before ->inflight = 1. */
smp_store_release(&rbi.inflight, 1);
-   call_rcu(&rbi.rcu, rcu_torture_boost_cb);
+   cur_ops->call(&rbi.rcu, rcu_torture_boost_cb);
/* Check if the boost test failed */
-   failed = failed ||
-rcu_torture_boost_failed(call_rcu_time,
-jiffies);
+   if (!firsttime && !failed)
+   failed = 
rcu_torture_boost_failed(call_rcu_time, jiffies);
call_rcu_time = jiffies;
+   firsttime = false;
}
if (stutter_wait("rcu_torture_boost"))
sched_set_fifo_low(current);
@@ -999,7 +1002,7 @@ static int rcu_torture_boost(void *arg)
 * this case the boost check would never happen in the above
 * loop so do another one here.
 */
-   if (!failed && smp_load_acquire(&rbi.inflight))
+   if (!firsttime && !failed && smp_load_acquire(&rbi.inflight))
rcu_torture_boost_failed(call_rcu_time, jiffies);
 
/*
@@ -1025,6 +1028,9 @@ checkwait:if (stutter_wait("rcu_torture_boost"))
sched_set_fifo_low(current);
} while (!torture_must_stop());
 
+   while (smp_load_acquire(&rbi.inflight))
+   schedule_timeout_uninterruptible(1); // rcu_barrier() deadlocks.
+
/* Clean up and exit. */
wh

[PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state

2021-01-19 Thread paulmck

From: "Paul E. McKenney" 

Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time.  However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time.  During this time, a low-priority process
might be incorrectly running at a high real-time priority level.

Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task.  This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting.  Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.

This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.

While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Lai Jiangshan 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 8b0feb2..fca31c6 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work 
*iwp)
 static void rcu_read_unlock_special(struct task_struct *t)
 {
unsigned long flags;
+   bool irqs_were_disabled;
bool preempt_bh_were_disabled =
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
-   bool irqs_were_disabled;
 
/* NMI handlers cannot block and cannot safely manipulate state. */
if (in_nmi())
@@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
local_irq_save(flags);
irqs_were_disabled = irqs_disabled_flags(flags);
if (preempt_bh_were_disabled || irqs_were_disabled) {
-   bool exp;
+   bool expboost; // Expedited GP in flight or possible boosting.
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode;
 
-   exp = (t->rcu_blocked_node &&
-  READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
- (rdp->grpmask & READ_ONCE(rnp->expmask));
+   expboost = (t->rcu_blocked_node && 
READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
+  (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
+  (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
&&
+   t->rcu_blocked_node);
// Need to defer quiescent state until everything is enabled.
-   if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
+   if (use_softirq && (in_irq() || (expboost && 
!irqs_were_disabled))) {
// Using softirq, safe to awaken, and either the
-   // wakeup is free or there is an expedited GP.
+   // wakeup is free or there is either an expedited
+   // GP in flight or a potential need to deboost.
raise_softirq_irqoff(RCU_SOFTIRQ);
} else {
// Enabling BH or preempt does reschedule, so...
-   // Also if no expediting, slow is OK.
-   // Plus nohz_full CPUs eventually get tick enabled.
+   // Also if no expediting and no possible deboosting,
+   // slow is OK.  Plus nohz_full CPUs eventually get
+   // tick enabled.
set_tsk_need_resched(current);
set_preempt_need_resched();
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-   !rdp->defer_qs_iw_pending && exp && 
cpu_online(rdp->cpu)) {
+   expboost && !rdp->defer_qs_iw_pending && 
cpu_online(rdp->cpu)) {
// Get scheduler to re-evaluate and call hooks.
// If !IRQ_WORK, FQS scan will eventually IPI.
-   init_irq_work(&rdp->defer_qs_iw,
- rcu_preempt_deferred_qs_handler);
+   init_irq_work(&rdp->defer_qs_iw, 
rcu_preempt_deferred_qs_handler);

[PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting

2021-01-19 Thread paulmck

From: "Paul E. McKenney" 

TREE03 tests RCU priority boosting, which is a real-time feature.
It would also be good if it tested something closer to what is
actually used by the real-time folks.  This commit therefore adds
tree.use_softirq=0 to the TREE03 kernel boot parameters in TREE03.boot.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
index 1c21894..64f864f1 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
@@ -4,3 +4,4 @@ rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
 threadirqs
+tree.use_softirq=0
-- 
2.9.5

[PATCH tip/core/rcu 04/10] rculist: Replace reference to atomic_ops.rst

2021-03-03 Thread paulmck

From: Akira Yokosawa 

The hlist_nulls_for_each_entry_rcu() docbook header references the
atomic_ops.rst file, which was removed in commit f0400a77ebdc ("atomic:
Delete obsolete documentation").  This commit therefore substitutes a
section in memory-barriers.txt discussing the use of barrier() in loops.

Cc: Peter Zijlstra 
Signed-off-by: Akira Yokosawa 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rculist_nulls.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index ff3e947..d8afdb8 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -161,7 +161,7 @@ static inline void hlist_nulls_add_fake(struct 
hlist_nulls_node *n)
  *
  * The barrier() is needed to make sure compiler doesn't cache first element 
[1],
  * as this loop can be restarted [2]
- * [1] Documentation/core-api/atomic_ops.rst around line 114
+ * [1] Documentation/memory-barriers.txt around line 1533
  * [2] Documentation/RCU/rculist_nulls.rst around line 146
  */
 #define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member)
\
-- 
2.9.5

[PATCH tip/core/rcu 01/10] rcu: Remove superfluous rdp fetch

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

Cc: Rafael J. Wysocki 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..cdf091f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -648,7 +648,6 @@ static noinstr void rcu_eqs_enter(bool user)
instrumentation_begin();
trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, 
atomic_read(&rdp->dynticks));
WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && 
!is_idle_task(current));
-   rdp = this_cpu_ptr(&rcu_data);
rcu_prepare_for_idle();
rcu_preempt_deferred_qs(current);
 
-- 
2.9.5

[PATCH tip/core/rcu 03/10] rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()

2021-03-03 Thread paulmck

From: Zhouyi Zhou 

In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the
second branch of the "if" statement.  Oddly enough, "objtool check -f
vmlinux.o" fails to complain because it is unable to correctly cover
all cases.  Instead, objtool visits the third branch first, which marks
following trace_rcu_dyntick() as visited.  This commit therefore removes
the spurious instrumentation_end().

Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to 
rcu_cleanup_after_idle() noinstr")
Reported-by Neeraj Upadhyay 
Signed-off-by: Zhouyi Zhou 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e62c2de..4d90f20 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1076,7 +1076,6 @@ noinstr void rcu_nmi_enter(void)
} else if (!in_nmi()) {
instrumentation_begin();
rcu_irq_enter_check_tick();
-   instrumentation_end();
} else  {
instrumentation_begin();
}
-- 
2.9.5

[PATCH tip/core/rcu 02/10] rcu: Fix CPU-offline trace in rcutree_dying_cpu

2021-03-03 Thread paulmck

From: Neeraj Upadhyay 

The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is
backwards, so that it uses the string "cpuofl" when the offline CPU is
blocking the current grace period and "cpuofl-bgp" otherwise.  Given that
the "-bgp" stands for "blocking grace period", this is at best misleading.
This commit therefore switches these strings in order to correctly trace
whether the outgoing cpu blocks the current grace period.

Signed-off-by: Neeraj Upadhyay 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index cdf091f..e62c2de 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2413,7 +2413,7 @@ int rcutree_dying_cpu(unsigned int cpu)
 
blkd = !!(rnp->qsmask & rdp->grpmask);
trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
-  blkd ? TPS("cpuofl") : TPS("cpuofl-bgp"));
+  blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
return 0;
 }
 
-- 
2.9.5

[PATCH tip/core/rcu 05/10] rcu: Fix kfree_rcu() docbook errors

2021-03-03 Thread paulmck

From: Mauro Carvalho Chehab 

After commit 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument macro"),
kernel-doc now emits two warnings:

./include/linux/rcupdate.h:884: warning: Excess function parameter 
'ptr' description in 'kfree_rcu'
./include/linux/rcupdate.h:884: warning: Excess function parameter 
'rhf' description in 'kfree_rcu'

This commit added some macro magic in order to call two different versions
of kfree_rcu(), the first having just one argument and the second having
two arguments.  That makes it difficult to document the kfree_rcu() arguments
in the docboook header.

In order to make clearer that this macro accepts optional arguments,
this commit uses macro concatenation so that this macro changes from:
#define kfree_rcu kvfree_rcu

to:
#define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)

That not only helps kernel-doc understand the macro arguments, but also
provides a better C definition that makes clearer that the first argument
is mandatory and the second one is optional.

Fixes: 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument macro")
Tested-by: Uladzislau Rezki (Sony) 
Signed-off-by: Mauro Carvalho Chehab 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcupdate.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index bd04f72..5cc6dea 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -881,7 +881,7 @@ static inline notrace void 
rcu_read_unlock_sched_notrace(void)
  * The BUILD_BUG_ON check must not involve any function calls, hence the
  * checks are done in macros here.
  */
-#define kfree_rcu kvfree_rcu
+#define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf)
 
 /**
  * kvfree_rcu() - kvfree an object after a grace period.
-- 
2.9.5

[PATCH tip/core/rcu 07/10] rcu: Prevent dyntick-idle until ksoftirqd has been spawned

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

After interrupts have enabled at boot but before some random point
in early_initcall() processing, softirq processing is unreliable.
If softirq sees a need to push softirq-handler invocation to ksoftirqd
during this time, then those handlers can be delayed until the ksoftirqd
kthreads have been spawned, which happens at some random point in the
early_initcall() processing.  In many cases, this delay is just fine.
However, if the boot sequence blocks waiting for a wakeup from a softirq
handler, this delay will result in a silent-hang deadlock.

This commit therefore prevents these hangs by ensuring that the tick
stays active until after the ksoftirqd kthreads have been spawned.
This change causes the tick to eventually drain the backlog of delayed
softirq handlers, breaking this deadlock.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2d60377..36212de 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1255,6 +1255,11 @@ static void rcu_prepare_kthreads(int cpu)
  */
 int rcu_needs_cpu(u64 basemono, u64 *nextevt)
 {
+   /* Through early_initcall(), need tick for softirq handlers. */
+   if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) {
+   *nextevt = 1;
+   return 1;
+   }
*nextevt = KTIME_MAX;
return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
   !rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist);
@@ -1350,6 +1355,12 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt)
 
lockdep_assert_irqs_disabled();
 
+   /* Through early_initcall(), need tick for softirq handlers. */
+   if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) {
+   *nextevt = 1;
+   return 1;
+   }
+
/* If no non-offloaded callbacks, RCU doesn't need the CPU. */
if (rcu_segcblist_empty(&rdp->cblist) ||
rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist)) {
-- 
2.9.5

[PATCH tip/core/rcu 10/10] rcu/tree: Add a trace event for RCU CPU stall warnings

2021-03-03 Thread paulmck

From: Sangmoon Kim 

This commit adds a trace event which allows tracing the beginnings of RCU
CPU stall warnings on systems where sysctl_panic_on_rcu_stall is disabled.

The first parameter is the name of RCU flavor like other trace events.
The second parameter indicates whether this is a stall of an expedited
grace period, a self-detected stall of a normal grace period, or a stall
of a normal grace period detected by some CPU other than the one that
is stalled.

RCU CPU stall warnings are often caused by external-to-RCU issues,
for example, in interrupt handling or task scheduling.  Therefore,
this event uses TRACE_EVENT, not TRACE_EVENT_RCU, to avoid requiring
those interested in tracing RCU CPU stalls to rebuild their kernels
with CONFIG_RCU_TRACE=y.

Reviewed-by: Uladzislau Rezki (Sony) 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Sangmoon Kim 
Signed-off-by: Paul E. McKenney 
---
 include/trace/events/rcu.h | 28 
 kernel/rcu/tree_exp.h  |  1 +
 kernel/rcu/tree_stall.h|  2 ++
 3 files changed, 31 insertions(+)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 5fc2940..c7711e9 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -432,6 +432,34 @@ TRACE_EVENT_RCU(rcu_fqs,
  __entry->cpu, __entry->qsevent)
 );
 
+/*
+ * Tracepoint for RCU stall events. Takes a string identifying the RCU flavor
+ * and a string identifying which function detected the RCU stall as follows:
+ *
+ * "StallDetected": Scheduler-tick detects other CPU's stalls.
+ * "SelfDetected": Scheduler-tick detects a current CPU's stall.
+ * "ExpeditedStall": Expedited grace period detects stalls.
+ */
+TRACE_EVENT(rcu_stall_warning,
+
+   TP_PROTO(const char *rcuname, const char *msg),
+
+   TP_ARGS(rcuname, msg),
+
+   TP_STRUCT__entry(
+   __field(const char *, rcuname)
+   __field(const char *, msg)
+   ),
+
+   TP_fast_assign(
+   __entry->rcuname = rcuname;
+   __entry->msg = msg;
+   ),
+
+   TP_printk("%s %s",
+ __entry->rcuname, __entry->msg)
+);
+
 #endif /* #if defined(CONFIG_TREE_RCU) */
 
 /*
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6c6ff06..2796084 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -521,6 +521,7 @@ static void synchronize_rcu_expedited_wait(void)
if (rcu_stall_is_suppressed())
continue;
panic_on_rcu_stall();
+   trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall"));
pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {",
   rcu_state.name);
ndetected = 0;
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 475b261..59b95cc 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -536,6 +536,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, 
unsigned long gps)
 * See Documentation/RCU/stallwarn.rst for info on how to debug
 * RCU CPU stall warnings.
 */
+   trace_rcu_stall_warning(rcu_state.name, TPS("StallDetected"));
pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name);
rcu_for_each_leaf_node(rnp) {
raw_spin_lock_irqsave_rcu_node(rnp, flags);
@@ -606,6 +607,7 @@ static void print_cpu_stall(unsigned long gps)
 * See Documentation/RCU/stallwarn.rst for info on how to debug
 * RCU CPU stall warnings.
 */
+   trace_rcu_stall_warning(rcu_state.name, TPS("SelfDetected"));
pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name);
raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags);
print_cpu_stall_info(smp_processor_id());
-- 
2.9.5

[PATCH tip/core/rcu 09/10] rcu: Add explicit barrier() to __rcu_read_unlock()

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Because preemptible RCU's __rcu_read_unlock() is an external function,
the rough equivalent of an implicit barrier() is inserted by the compiler.
Except that there is a direct call to __rcu_read_unlock() in that same
file, and compilers are getting to the point where they might choose to
inline the fastpath of the __rcu_read_unlock() function.

This commit therefore adds an explicit barrier() to the very beginning
of __rcu_read_unlock().

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 36212de..d9495de 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -393,8 +393,9 @@ void __rcu_read_unlock(void)
 {
struct task_struct *t = current;
 
+   barrier();  // critical section before exit code.
if (rcu_preempt_read_exit() == 0) {
-   barrier();  /* critical section before exit code. */
+   barrier();  // critical-section exit before .s check.
if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s)))
rcu_read_unlock_special(t);
}
-- 
2.9.5

[PATCH tip/core/rcu 08/10] docs: Correctly spell Stephen Hemminger's name

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit replaces "Steve" with the his real name, which is "Stephen".

Reported-by: Stephen Hemminger 
Signed-off-by: Paul E. McKenney 
---
 Documentation/RCU/RTFP.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt
index 3b0876c..588d973 100644
--- a/Documentation/RCU/RTFP.txt
+++ b/Documentation/RCU/RTFP.txt
@@ -847,7 +847,7 @@ Symposium on Distributed Computing}
'It's entirely possible that the current user could be replaced
by RCU and/or seqlocks, and we could get rid of brlocks entirely.'
.
-   Steve Hemminger responds by replacing them with RCU.
+   Stephen Hemminger responds by replacing them with RCU.
 }
 }
 
-- 
2.9.5

[PATCH tip/core/rcu 06/10] softirq: Don't try waking ksoftirqd before it has been spawned

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

If there is heavy softirq activity, the softirq system will attempt
to awaken ksoftirqd and will stop the traditional back-of-interrupt
softirq processing.  This is all well and good, but only if the
ksoftirqd kthreads already exist, which is not the case during early
boot, in which case the system hangs.

One reproducer is as follows:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs 
"TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y 
CONFIG_NO_HZ_IDLE=y CONFIG_HZ_PERIODIC=n" --bootargs "threadirqs=1" --trust-make

This commit therefore adds a couple of existence checks for ksoftirqd
and forces back-of-interrupt softirq processing when ksoftirqd does not
yet exist.  With this change, the above test passes.

Reported-by: Sebastian Andrzej Siewior 
Reported-by: Uladzislau Rezki 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
[ paulmck: Remove unneeded check per Sebastian Siewior feedback. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/softirq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 9908ec4a..bad14ca 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -211,7 +211,7 @@ static inline void invoke_softirq(void)
if (ksoftirqd_running(local_softirq_pending()))
return;
 
-   if (!force_irqthreads) {
+   if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*
 * We can safely execute softirq on the current stack if
-- 
2.9.5

[PATCH tip/core/rcu 6/6] rcuscale: Add kfree_rcu() single-argument scale test

2021-03-03 Thread paulmck

From: "Uladzislau Rezki (Sony)" 

The single-argument variant of kfree_rcu() is currently not
tested by any member of the rcutoture test suite.  This
commit therefore adds rcuscale code to test it.  This
testing is controlled by two new boolean module parameters,
kfree_rcu_test_single and kfree_rcu_test_double.  If one
is set and the other not, only the corresponding variant
is tested, otherwise both are tested, with the variant to
be tested determined randomly on each invocation.

Both of these module parameters are initialized to false,
so setting either to true will test only that variant.

Suggested-by: Paul E. McKenney 
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 12 
 kernel/rcu/rcuscale.c   | 15 ++-
 2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..84fce41 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4259,6 +4259,18 @@
rcuscale.kfree_rcu_test= [KNL]
Set to measure performance of kfree_rcu() flooding.
 
+   rcuscale.kfree_rcu_test_double= [KNL]
+   Test the double-argument variant of kfree_rcu().
+   If this parameter has the same value as
+   rcuscale.kfree_rcu_test_single, both the single-
+   and double-argument variants are tested.
+
+   rcuscale.kfree_rcu_test_single= [KNL]
+   Test the single-argument variant of kfree_rcu().
+   If this parameter has the same value as
+   rcuscale.kfree_rcu_test_double, both the single-
+   and double-argument variants are tested.
+
rcuscale.kfree_nthreads= [KNL]
The number of threads running loops of kfree_rcu().
 
diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 06491d5..dca51fe 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg)
 torture_param(int, kfree_nthreads, -1, "Number of threads running loops of 
kfree_rcu().");
 torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees 
done in an iteration.");
 torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num 
allocations and frees.");
+torture_param(bool, kfree_rcu_test_double, false, "Do we run a kfree_rcu() 
double-argument scale test?");
+torture_param(bool, kfree_rcu_test_single, false, "Do we run a kfree_rcu() 
single-argument scale test?");
 
 static struct task_struct **kfree_reader_tasks;
 static int kfree_nrealthreads;
@@ -644,10 +646,13 @@ kfree_scale_thread(void *arg)
struct kfree_obj *alloc_ptr;
u64 start_time, end_time;
long long mem_begin, mem_during = 0;
+   bool kfree_rcu_test_both;
+   DEFINE_TORTURE_RANDOM(tr);
 
VERBOSE_SCALEOUT_STRING("kfree_scale_thread task started");
set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
set_user_nice(current, MAX_NICE);
+   kfree_rcu_test_both = (kfree_rcu_test_single == kfree_rcu_test_double);
 
start_time = ktime_get_mono_fast_ns();
 
@@ -670,7 +675,15 @@ kfree_scale_thread(void *arg)
if (!alloc_ptr)
return -ENOMEM;
 
-   kfree_rcu(alloc_ptr, rh);
+   // By default kfree_rcu_test_single and 
kfree_rcu_test_double are
+   // initialized to false. If both have the same value 
(false or true)
+   // both are randomly tested, otherwise only the one 
with value true
+   // is tested.
+   if ((kfree_rcu_test_single && !kfree_rcu_test_double) ||
+   (kfree_rcu_test_both && 
torture_random(&tr) & 0x800))
+   kfree_rcu(alloc_ptr);
+   else
+   kfree_rcu(alloc_ptr, rh);
}
 
cond_resched();
-- 
2.9.5

[PATCH tip/core/rcu 2/6] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations
carried out by the single-argument variant of kvfree_rcu(), thus avoiding
this can-sleep code path from dipping into the emergency reserves.

Acked-by: Michal Hocko 
Suggested-by: Michal Hocko 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 1f8c980..08b5044 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3519,7 +3519,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
if (!bnode && can_alloc) {
krc_this_cpu_unlock(*krcp, *flags);
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 
-- 
2.9.5

[PATCH tip/core/rcu 4/6] kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY

2021-03-03 Thread paulmck

From: "Uladzislau Rezki (Sony)" 

__GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this
can be wasted effort given that there is a fallback code path in case
memory allocation fails.

__GFP_NORETRY does perform some light-weight reclaim, but it will fail
under OOM conditions, allowing the fallback to be taken as an alternative
to hard-OOMing the system.

There is a four-way tradeoff that must be balanced:
1) Minimize use of the fallback path;
2) Avoid full-up OOM;
3) Do a light-wait allocation request;
4) Avoid dipping into the emergency reserves.

Signed-off-by: Uladzislau Rezki (Sony) 
Acked-by: Michal Hocko 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7ee83f3..0ecc1fb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3517,8 +3517,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
bnode = get_cached_bnode(*krcp);
if (!bnode && can_alloc) {
krc_this_cpu_unlock(*krcp, *flags);
+
+   // __GFP_NORETRY - allows a light-weight direct reclaim
+   // what is OK from minimizing of fallback hitting point 
of
+   // view. Apart of that it forbids any OOM invoking what 
is
+   // also beneficial since we are about to release memory 
soon.
+   //
+   // __GFP_NOMEMALLOC - prevents from consuming of all the
+   // memory reserves. Please note we have a fallback path.
+   //
+   // __GFP_NOWARN - it is supposed that an allocation can
+   // be failed under low memory or high memory pressure
+   // scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
*krcp = krc_this_cpu_lock(flags);
}
 
-- 
2.9.5

[PATCH tip/core/rcu 5/6] kvfree_rcu: Use same set of GFP flags as does single-argument

2021-03-03 Thread paulmck

From: "Uladzislau Rezki (Sony)" 

Running an rcuscale stress-suite can lead to "Out of memory" of a
system. This can happen under high memory pressure with a small amount
of physical memory.

For example, a KVM test configuration with 64 CPUs and 512 megabytes
can result in OOM when running rcuscale with below parameters:

../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig 
CONFIG_NR_CPUS=64 \
--bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
rcuscale.holdoff=20 \
  rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" --trust-make


[   12.054448] kworker/1:1H invoked oom-killer: 
gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0
[   12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014
[   12.056485] Workqueue: events_highpri fill_page_cache_func
[   12.056485] Call Trace:
[   12.056485]  dump_stack+0x57/0x6a
[   12.056485]  dump_header+0x4c/0x30a
[   12.056485]  ? del_timer_sync+0x20/0x30
[   12.056485]  out_of_memory.cold.47+0xa/0x7e
[   12.056485]  __alloc_pages_slowpath.constprop.123+0x82f/0xc00
[   12.056485]  __alloc_pages_nodemask+0x289/0x2c0
[   12.056485]  __get_free_pages+0x8/0x30
[   12.056485]  fill_page_cache_func+0x39/0xb0
[   12.056485]  process_one_work+0x1ed/0x3b0
[   12.056485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  worker_thread+0x28/0x3c0
[   12.060485]  ? process_one_work+0x3b0/0x3b0
[   12.060485]  kthread+0x138/0x160
[   12.060485]  ? kthread_park+0x80/0x80
[   12.060485]  ret_from_fork+0x22/0x30
[   12.062156] Mem-Info:
[   12.062350] active_anon:0 inactive_anon:0 isolated_anon:0
[   12.062350]  active_file:0 inactive_file:0 isolated_file:0
[   12.062350]  unevictable:0 dirty:0 writeback:0
[   12.062350]  slab_reclaimable:2797 slab_unreclaimable:80920
[   12.062350]  mapped:1 shmem:2 pagetables:8 bounce:0
[   12.062350]  free:10488 free_pcp:1227 free_cma:0
...
[   12.101610] Out of memory and no killable processes...
[   12.102042] Kernel panic - not syncing: System is deadlocked on memory
[   12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510
[   12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.12.0-1 04/01/2014


Because kvfree_rcu() has a fallback path, memory allocation failure is
not the end of the world.  Furthermore, the added overhead of aggressive
GFP settings must be balanced against the overhead of the fallback path,
which is a cache miss for double-argument kvfree_rcu() and a call to
synchronize_rcu() for single-argument kvfree_rcu().  The current choice
of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call
to synchronize_rcu(), so less-tenacious GFP flags would be helpful.

Here is the tradeoff that must be balanced:
a) Minimize use of the fallback path,
b) Avoid pushing the system into OOM,
c) Bound allocation latency to that of synchronize_rcu(), and
d) Leave the emergency reserves to use cases lacking fallbacks.

This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to
GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN.  This combination
leaves the emergency reserves alone and can initiate reclaim, but will
not invoke the OOM killer.

Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0ecc1fb..4120d4b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3463,7 +3463,7 @@ static void fill_page_cache_func(struct work_struct *work)
 
for (i = 0; i < rcu_min_cached_objs; i++) {
bnode = (struct kvfree_rcu_bulk_data *)
-   __get_free_page(GFP_KERNEL | __GFP_NOWARN);
+   __get_free_page(GFP_KERNEL | __GFP_NORETRY | 
__GFP_NOMEMALLOC | __GFP_NOWARN);
 
if (bnode) {
raw_spin_lock_irqsave(&krcp->lock, flags);
-- 
2.9.5

[PATCH tip/core/rcu 3/6] kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately
followed by a local_irq_restore().  This commit saves a line of code by
merging them into a raw_spin_unlock_irqrestore().  This transformation
also reduces scheduling latency because raw_spin_unlock_irqrestore()
responds immediately to a reschedule request.  In contrast,
local_irq_restore() does a scheduling-oblivious enabling of interrupts.

Reported-by: Sebastian Andrzej Siewior 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 08b5044..7ee83f3 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3229,8 +3229,7 @@ krc_this_cpu_lock(unsigned long *flags)
 static inline void
 krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
 {
-   raw_spin_unlock(&krcp->lock);
-   local_irq_restore(flags);
+   raw_spin_unlock_irqrestore(&krcp->lock, flags);
 }
 
 static inline struct kvfree_rcu_bulk_data *
-- 
2.9.5

[PATCH tip/core/rcu 1/6] kvfree_rcu: Directly allocate page for single-argument case

2021-03-03 Thread paulmck

From: "Uladzislau Rezki (Sony)" 

Single-argument kvfree_rcu() must be invoked from sleepable contexts,
so we can directly allocate pages.  Furthermmore, the fallback in case
of page-allocation failure is the high-latency synchronize_rcu(), so it
makes sense to do these page allocations from the fastpath, and even to
permit limited sleeping within the allocator.

This commit therefore allocates if needed on the fastpath using
GFP_KERNEL|__GFP_RETRY_MAYFAIL.  This also has the beneficial effect
of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant
of kvfree_rcu(), given that the double-argument variant cannot directly
invoke the allocator.

[ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ]
Signed-off-by: Uladzislau Rezki (Sony) 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 42 ++
 1 file changed, 26 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..1f8c980 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3493,37 +3493,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp)
}
 }
 
+// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock()
+// state specified by flags.  If can_alloc is true, the caller must
+// be schedulable and not be holding any locks or mutexes that might be
+// acquired by the memory allocator or anything that it might invoke.
+// Returns true if ptr was successfully recorded, else the caller must
+// use a fallback.
 static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
+   unsigned long *flags, void *ptr, bool can_alloc)
 {
struct kvfree_rcu_bulk_data *bnode;
int idx;
 
-   if (unlikely(!krcp->initialized))
+   *krcp = krc_this_cpu_lock(flags);
+   if (unlikely(!(*krcp)->initialized))
return false;
 
-   lockdep_assert_held(&krcp->lock);
idx = !!is_vmalloc_addr(ptr);
 
/* Check if a new block is required. */
-   if (!krcp->bkvhead[idx] ||
-   krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) 
{
-   bnode = get_cached_bnode(krcp);
-   /* Switch to emergency path. */
+   if (!(*krcp)->bkvhead[idx] ||
+   (*krcp)->bkvhead[idx]->nr_records == 
KVFREE_BULK_MAX_ENTR) {
+   bnode = get_cached_bnode(*krcp);
+   if (!bnode && can_alloc) {
+   krc_this_cpu_unlock(*krcp, *flags);
+   bnode = (struct kvfree_rcu_bulk_data *)
+   __get_free_page(GFP_KERNEL | 
__GFP_RETRY_MAYFAIL | __GFP_NOWARN);
+   *krcp = krc_this_cpu_lock(flags);
+   }
+
if (!bnode)
return false;
 
/* Initialize the new block. */
bnode->nr_records = 0;
-   bnode->next = krcp->bkvhead[idx];
+   bnode->next = (*krcp)->bkvhead[idx];
 
/* Attach it to the head. */
-   krcp->bkvhead[idx] = bnode;
+   (*krcp)->bkvhead[idx] = bnode;
}
 
/* Finally insert. */
-   krcp->bkvhead[idx]->records
-   [krcp->bkvhead[idx]->nr_records++] = ptr;
+   (*krcp)->bkvhead[idx]->records
+   [(*krcp)->bkvhead[idx]->nr_records++] = ptr;
 
return true;
 }
@@ -3561,8 +3574,6 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
ptr = (unsigned long *) func;
}
 
-   krcp = krc_this_cpu_lock(&flags);
-
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
@@ -3570,12 +3581,11 @@ void kvfree_call_rcu(struct rcu_head *head, 
rcu_callback_t func)
  __func__, head);
 
// Mark as success and leave.
-   success = true;
-   goto unlock_return;
+   return;
}
 
kasan_record_aux_stack(ptr);
-   success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+   success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head);
if (!success) {
run_page_cache_worker(krcp);
 
-- 
2.9.5

[PATCH lib/bitmap 3/9] lib: test_bitmap: add more start-end:offset/len tests

2021-03-03 Thread paulmck

From: Paul Gortmaker 

There are inputs to bitmap_parselist() that would probably never
be entered manually by a person, but might result from some kind of
automated input generator.  Things like ranges of length 1, or group
lengths longer than nbits, overlaps, or offsets of zero.

Adding these tests serve two purposes:

1) document what might seem odd but nonetheless valid input.

2) don't regress from what we currently accept as valid.

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Acked-by: Yury Norov 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/test_bitmap.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 0f2e91d..3c1c46d 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -34,6 +34,8 @@ static const unsigned long exp1[] __initconst = {
BITMAP_FROM_U64(0xULL),
BITMAP_FROM_U64(0xULL),
BITMAP_FROM_U64(0),
+   BITMAP_FROM_U64(0x8000),
+   BITMAP_FROM_U64(0x8000),
 };
 
 static const unsigned long exp2[] __initconst = {
@@ -334,6 +336,26 @@ static const struct test_bitmap_parselist 
parselist_tests[] __initconst = {
{0, " ,  ,,  , ,   ",   &exp1[12 * step], 8, 0},
{0, " ,  ,,  , ,   \n", &exp1[12 * step], 8, 0},
 
+   {0, "0-0",  &exp1[0], 32, 0},
+   {0, "1-1",  &exp1[1 * step], 32, 0},
+   {0, "15-15",&exp1[13 * step], 32, 0},
+   {0, "31-31",&exp1[14 * step], 32, 0},
+
+   {0, "0-0:0/1",  &exp1[12 * step], 32, 0},
+   {0, "0-0:1/1",  &exp1[0], 32, 0},
+   {0, "0-0:1/31", &exp1[0], 32, 0},
+   {0, "0-0:31/31",&exp1[0], 32, 0},
+   {0, "1-1:1/1",  &exp1[1 * step], 32, 0},
+   {0, "0-15:16/31",   &exp1[2 * step], 32, 0},
+   {0, "15-15:1/2",&exp1[13 * step], 32, 0},
+   {0, "15-15:31/31",  &exp1[13 * step], 32, 0},
+   {0, "15-31:1/31",   &exp1[13 * step], 32, 0},
+   {0, "16-31:16/31",  &exp1[3 * step], 32, 0},
+   {0, "31-31:31/31",  &exp1[14 * step], 32, 0},
+
+   {0, "0-31:1/3,1-31:1/3,2-31:1/3",   &exp1[8 * step], 32, 0},
+   {0, "1-10:8/12,8-31:24/29,0-31:0/3",&exp1[9 * step], 32, 0},
+
{-EINVAL, "-1", NULL, 8, 0},
{-EINVAL, "-0", NULL, 8, 0},
{-EINVAL, "10-1", NULL, 8, 0},
-- 
2.9.5

[PATCH lib/bitmap 6/9] lib: bitmap: support "N" as an alias for size of bitmap

2021-03-03 Thread paulmck

From: Paul Gortmaker 

While this is done for all bitmaps, the original use case in mind was
for CPU masks and cpulist_parse() as described below.

It seems that a common configuration is to use the 1st couple cores for
housekeeping tasks.  This tends to leave the remaining ones to form a
pool of similarly configured cores to take on the real workload of
interest to the user.

So on machine A - with 32 cores, it could be 0-3 for "system" and then
4-31 being used in boot args like nohz_full=, or rcu_nocbs= as part of
setting up the worker pool of CPUs.

But then newer machine B is added, and it has 48 cores, and so while
the 0-3 part remains unchanged, the pool setup cpu list becomes 4-47.

Multiple deployment becomes easier when we can just simply replace 31
and 47 with "N" and let the system substitute in the actual number at
boot; a number that it knows better than we do.

Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Suggested-by: Yury Norov  # move it from CPU code
Acked-by: Yury Norov 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.rst |  7 +++
 lib/bitmap.c| 22 +-
 2 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.rst 
b/Documentation/admin-guide/kernel-parameters.rst
index 1132796..d6e3f67 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -68,6 +68,13 @@ For example one can add to the command line following 
parameter:
 
 where the final item represents CPUs 100,101,125,126,150,151,...
 
+The value "N" can be used to represent the numerically last CPU on the system,
+i.e "foo_cpus=16-N" would be equivalent to "16-31" on a 32 core system.
+
+Keep in mind that "N" is dynamic, so if system changes cause the bitmap width
+to change, such as less cores in the CPU list, then N and any ranges using N
+will also change.  Use the same on a small 4 core system, and "16-N" becomes
+"16-3" and now the same boot input will be flagged as invalid (start > end).
 
 
 This document may not be entirely up to date and comprehensive. The command
diff --git a/lib/bitmap.c b/lib/bitmap.c
index 833f152a..9f4626a 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -519,11 +519,17 @@ static int bitmap_check_region(const struct region *r)
return 0;
 }
 
-static const char *bitmap_getnum(const char *str, unsigned int *num)
+static const char *bitmap_getnum(const char *str, unsigned int *num,
+unsigned int lastbit)
 {
unsigned long long n;
unsigned int len;
 
+   if (str[0] == 'N') {
+   *num = lastbit;
+   return str + 1;
+   }
+
len = _parse_integer(str, 10, &n);
if (!len)
return ERR_PTR(-EINVAL);
@@ -571,7 +577,9 @@ static const char *bitmap_find_region_reverse(const char 
*start, const char *end
 
 static const char *bitmap_parse_region(const char *str, struct region *r)
 {
-   str = bitmap_getnum(str, &r->start);
+   unsigned int lastbit = r->nbits - 1;
+
+   str = bitmap_getnum(str, &r->start, lastbit);
if (IS_ERR(str))
return str;
 
@@ -581,7 +589,7 @@ static const char *bitmap_parse_region(const char *str, 
struct region *r)
if (*str != '-')
return ERR_PTR(-EINVAL);
 
-   str = bitmap_getnum(str + 1, &r->end);
+   str = bitmap_getnum(str + 1, &r->end, lastbit);
if (IS_ERR(str))
return str;
 
@@ -591,14 +599,14 @@ static const char *bitmap_parse_region(const char *str, 
struct region *r)
if (*str != ':')
return ERR_PTR(-EINVAL);
 
-   str = bitmap_getnum(str + 1, &r->off);
+   str = bitmap_getnum(str + 1, &r->off, lastbit);
if (IS_ERR(str))
return str;
 
if (*str != '/')
return ERR_PTR(-EINVAL);
 
-   return bitmap_getnum(str + 1, &r->group_len);
+   return bitmap_getnum(str + 1, &r->group_len, lastbit);
 
 no_end:
r->end = r->start;
@@ -625,6 +633,10 @@ static const char *bitmap_parse_region(const char *str, 
struct region *r)
  * From each group will be used only defined amount of bits.
  * Syntax: range:used_size/group_size
  * Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769
+ * The value 'N' can be used as a dynamically substituted token for the
+ * maximum allowed value; i.e (nmaskbits - 1).  Keep in mind that it is
+ * dynamic, so if system changes cause the bitmap width to change, such
+ * as more cores in a CPU list, then any ranges using N will also change.
  *
  * Returns: 0 on success, -errno on invalid input strings. Error values:
  *
-- 
2.9.5

[PATCH lib/bitmap 8/9] rcu: deprecate "all" option to rcu_nocbs=

2021-03-03 Thread paulmck

From: Paul Gortmaker 

With the core bitmap support now accepting "N" as a placeholder for
the end of the bitmap, "all" can be represented as "0-N" and has the
advantage of not being specific to RCU (or any other subsystem).

So deprecate the use of "all" by removing documentation references
to it.  The support itself needs to remain for now, since we don't
know how many people out there are using it currently, but since it
is in an __init area anyway, it isn't worth losing sleep over.

Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: "Paul E. McKenney" 
Cc: Josh Triplett 
Acked-by: Yury Norov 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 Documentation/admin-guide/kernel-parameters.txt | 4 +---
 kernel/rcu/tree_plugin.h| 6 ++
 2 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0454572..83e2ef1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4068,9 +4068,7 @@
see CONFIG_RAS_CEC help text.
 
rcu_nocbs=  [KNL]
-   The argument is a cpu list, as described above,
-   except that the string "all" can be used to
-   specify every CPU on the system.
+   The argument is a cpu list, as described above.
 
In kernels built with CONFIG_RCU_NOCB_CPU=y, set
the specified list of CPUs to be no-callback CPUs.
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2d60377..0b95562 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1464,14 +1464,12 @@ static void rcu_cleanup_after_idle(void)
 
 /*
  * Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters.
- * The string after the "rcu_nocbs=" is either "all" for all CPUs, or a
- * comma-separated list of CPUs and/or CPU ranges.  If an invalid list is
- * given, a warning is emitted and all CPUs are offloaded.
+ * If the list is invalid, a warning is emitted and all CPUs are offloaded.
  */
 static int __init rcu_nocb_setup(char *str)
 {
alloc_bootmem_cpumask_var(&rcu_nocb_mask);
-   if (!strcasecmp(str, "all"))
+   if (!strcasecmp(str, "all"))/* legacy: use "0-N" instead */
cpumask_setall(rcu_nocb_mask);
else
if (cpulist_parse(str, rcu_nocb_mask)) {
-- 
2.9.5

[PATCH lib/bitmap 9/9] rcutorture: Use "all" and "N" in "nohz_full" and "rcu_nocbs"

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit uses the shiny new "all" and "N" cpumask options to decouple
the "nohz_full" and "rcu_nocbs" kernel boot parameters in the TREE04.boot
and TREE08.boot files from the CONFIG_NR_CPUS options in the TREE04 and
TREE08 files.

Reported-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot | 2 +-
 tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot
index 5adc675..a8d94ca 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot
@@ -1 +1 @@
-rcutree.rcu_fanout_leaf=4 nohz_full=1-7
+rcutree.rcu_fanout_leaf=4 nohz_full=1-N
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot
index 22478fd..94d3844 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot
@@ -1,3 +1,3 @@
 rcupdate.rcu_self_test=1
 rcutree.rcu_fanout_exact=1
-rcu_nocbs=0-7
+rcu_nocbs=all
-- 
2.9.5

[PATCH lib/bitmap 7/9] lib: test_bitmap: add tests for "N" alias

2021-03-03 Thread paulmck

From: Paul Gortmaker 

These are copies of existing tests, with just 31 --> N.  This ensures
the recently added "N" alias transparently works in any normally
numeric fields of a region specification.

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Acked-by: Yury Norov 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/test_bitmap.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 3c1c46d..9cd5755 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -353,6 +353,16 @@ static const struct test_bitmap_parselist 
parselist_tests[] __initconst = {
{0, "16-31:16/31",  &exp1[3 * step], 32, 0},
{0, "31-31:31/31",  &exp1[14 * step], 32, 0},
 
+   {0, "N-N",  &exp1[14 * step], 32, 0},
+   {0, "0-0:1/N",  &exp1[0], 32, 0},
+   {0, "0-0:N/N",  &exp1[0], 32, 0},
+   {0, "0-15:16/N",&exp1[2 * step], 32, 0},
+   {0, "15-15:N/N",&exp1[13 * step], 32, 0},
+   {0, "15-N:1/N", &exp1[13 * step], 32, 0},
+   {0, "16-N:16/N",&exp1[3 * step], 32, 0},
+   {0, "N-N:N/N",  &exp1[14 * step], 32, 0},
+
+   {0, "0-N:1/3,1-N:1/3,2-N:1/3",  &exp1[8 * step], 32, 0},
{0, "0-31:1/3,1-31:1/3,2-31:1/3",   &exp1[8 * step], 32, 0},
{0, "1-10:8/12,8-31:24/29,0-31:0/3",&exp1[9 * step], 32, 0},
 
-- 
2.9.5

[PATCH lib/bitmap 1/9] lib: test_bitmap: clearly separate ERANGE from EINVAL tests.

2021-03-03 Thread paulmck

From: Paul Gortmaker 

This block of tests was meant to find/flag incorrect use of the ":"
and "/" separators (syntax errors) and invalid (zero) group len.

However they were specified with an 8 bit width and 32 bit operations,
so they really contained two errors (EINVAL and ERANGE).

Promote them to 32 bit so it is clear what they are meant to target.
Then we can add tests specific for ERANGE (no syntax errors, just
doing 32bit op on 8 bit width, plus a typical 9-on-8 fencepost error).

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Acked-by: Yury Norov 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/test_bitmap.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 0ea0e82..853a3a6 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -337,12 +337,12 @@ static const struct test_bitmap_parselist 
parselist_tests[] __initconst = {
{-EINVAL, "-1", NULL, 8, 0},
{-EINVAL, "-0", NULL, 8, 0},
{-EINVAL, "10-1", NULL, 8, 0},
-   {-EINVAL, "0-31:", NULL, 8, 0},
-   {-EINVAL, "0-31:0", NULL, 8, 0},
-   {-EINVAL, "0-31:0/", NULL, 8, 0},
-   {-EINVAL, "0-31:0/0", NULL, 8, 0},
-   {-EINVAL, "0-31:1/0", NULL, 8, 0},
-   {-EINVAL, "0-31:10/1", NULL, 8, 0},
+   {-EINVAL, "0-31:", NULL, 32, 0},
+   {-EINVAL, "0-31:0", NULL, 32, 0},
+   {-EINVAL, "0-31:0/", NULL, 32, 0},
+   {-EINVAL, "0-31:0/0", NULL, 32, 0},
+   {-EINVAL, "0-31:1/0", NULL, 32, 0},
+   {-EINVAL, "0-31:10/1", NULL, 32, 0},
{-EOVERFLOW, "0-98765432123456789:10/1", NULL, 8, 0},
 
{-EINVAL, "a-31", NULL, 8, 0},
-- 
2.9.5

[PATCH lib/bitmap 2/9] lib: test_bitmap: add tests to trigger ERANGE case.

2021-03-03 Thread paulmck

From: Paul Gortmaker 

Add tests that specify a valid range, but one that is outside the
width of the bitmap for which it is to be applied to.  These should
trigger an -ERANGE response from the code.

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Acked-by: Yury Norov 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/test_bitmap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index 853a3a6..0f2e91d 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -337,6 +337,8 @@ static const struct test_bitmap_parselist parselist_tests[] 
__initconst = {
{-EINVAL, "-1", NULL, 8, 0},
{-EINVAL, "-0", NULL, 8, 0},
{-EINVAL, "10-1", NULL, 8, 0},
+   {-ERANGE, "8-8", NULL, 8, 0},
+   {-ERANGE, "0-31", NULL, 8, 0},
{-EINVAL, "0-31:", NULL, 32, 0},
{-EINVAL, "0-31:0", NULL, 32, 0},
{-EINVAL, "0-31:0/", NULL, 32, 0},
-- 
2.9.5

[PATCH lib/bitmap 4/9] lib: bitmap: fold nbits into region struct

2021-03-03 Thread paulmck

From: Paul Gortmaker 

This will reduce parameter passing and enable using nbits as part
of future dynamic region parameter parsing.

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Suggested-by: Yury Norov 
Acked-by: Yury Norov 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/bitmap.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index 75006c4..162e285 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -487,24 +487,24 @@ EXPORT_SYMBOL(bitmap_print_to_pagebuf);
 
 /*
  * Region 9-38:4/10 describes the following bitmap structure:
- * 0  9  1218  38
- * .......
- * ^  ^ ^   ^
- *  start  off   group_lenend
+ * 0  9  1218  38   N
+ * .......
+ * ^  ^ ^   ^   ^
+ *  start  off   group_lenend   nbits
  */
 struct region {
unsigned int start;
unsigned int off;
unsigned int group_len;
unsigned int end;
+   unsigned int nbits;
 };
 
-static int bitmap_set_region(const struct region *r,
-   unsigned long *bitmap, int nbits)
+static int bitmap_set_region(const struct region *r, unsigned long *bitmap)
 {
unsigned int start;
 
-   if (r->end >= nbits)
+   if (r->end >= r->nbits)
return -ERANGE;
 
for (start = r->start; start <= r->end; start += r->group_len)
@@ -640,7 +640,8 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, 
int nmaskbits)
struct region r;
long ret;
 
-   bitmap_zero(maskp, nmaskbits);
+   r.nbits = nmaskbits;
+   bitmap_zero(maskp, r.nbits);
 
while (buf) {
buf = bitmap_find_region(buf);
@@ -655,7 +656,7 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, 
int nmaskbits)
if (ret)
return ret;
 
-   ret = bitmap_set_region(&r, maskp, nmaskbits);
+   ret = bitmap_set_region(&r, maskp);
if (ret)
return ret;
}
-- 
2.9.5

[PATCH lib/bitmap 5/9] lib: bitmap: move ERANGE check from set_region to check_region

2021-03-03 Thread paulmck

From: Paul Gortmaker 

It makes sense to do all the checks in check_region() and not 1/2
in check_region and 1/2 in set_region.

Since set_region is called immediately after check_region, the net
effect on runtime is zero, but it gets rid of an if (...) return...

Cc: Yury Norov 
Cc: Rasmus Villemoes 
Cc: Andy Shevchenko 
Acked-by: Yury Norov 
Reviewed-by: Andy Shevchenko 
Signed-off-by: Paul Gortmaker 
Signed-off-by: Paul E. McKenney 
---
 lib/bitmap.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/lib/bitmap.c b/lib/bitmap.c
index 162e285..833f152a 100644
--- a/lib/bitmap.c
+++ b/lib/bitmap.c
@@ -500,17 +500,12 @@ struct region {
unsigned int nbits;
 };
 
-static int bitmap_set_region(const struct region *r, unsigned long *bitmap)
+static void bitmap_set_region(const struct region *r, unsigned long *bitmap)
 {
unsigned int start;
 
-   if (r->end >= r->nbits)
-   return -ERANGE;
-
for (start = r->start; start <= r->end; start += r->group_len)
bitmap_set(bitmap, start, min(r->end - start + 1, r->off));
-
-   return 0;
 }
 
 static int bitmap_check_region(const struct region *r)
@@ -518,6 +513,9 @@ static int bitmap_check_region(const struct region *r)
if (r->start > r->end || r->group_len == 0 || r->off > r->group_len)
return -EINVAL;
 
+   if (r->end >= r->nbits)
+   return -ERANGE;
+
return 0;
 }
 
@@ -656,9 +654,7 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, 
int nmaskbits)
if (ret)
return ret;
 
-   ret = bitmap_set_region(&r, maskp);
-   if (ret)
-   return ret;
+   bitmap_set_region(&r, maskp);
}
 
return 0;
-- 
2.9.5

[PATCH tip/core/rcu 02/12] timer: Report ignored local enqueue in nohz mode

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

Enqueuing a local timer after the tick has been stopped will result in
the timer being ignored until the next random interrupt.

Perform sanity checks to report these situations.

Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Ingo Molnar
Cc: Rafael J. Wysocki 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/sched/core.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca2bb62..4822371 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -674,6 +674,26 @@ int get_nohz_timer_target(void)
return cpu;
 }
 
+static void wake_idle_assert_possible(void)
+{
+#ifdef CONFIG_SCHED_DEBUG
+   /* Timers are re-evaluated after idle IRQs */
+   if (in_hardirq())
+   return;
+   /*
+* Same as hardirqs, assuming they are executing
+* on IRQ tail. Ksoftirqd shouldn't reach here
+* as the timer base wouldn't be idle. And inline
+* softirq processing after a call to local_bh_enable()
+* within idle loop sound too fun to be considered here.
+*/
+   if (in_serving_softirq())
+   return;
+
+   WARN_ON_ONCE("Late timer enqueue may be ignored\n");
+#endif
+}
+
 /*
  * When add_timer_on() enqueues a timer into the timer wheel of an
  * idle CPU then this timer might expire before the next timer event
@@ -688,8 +708,10 @@ static void wake_up_idle_cpu(int cpu)
 {
struct rq *rq = cpu_rq(cpu);
 
-   if (cpu == smp_processor_id())
+   if (cpu == smp_processor_id()) {
+   wake_idle_assert_possible();
return;
+   }
 
if (set_nr_and_not_polling(rq->idle))
smp_send_reschedule(cpu);
-- 
2.9.5

[PATCH tip/core/rcu 01/12] rcu/nocb: Detect unsafe checks for offloaded rdp

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading
the offloaded state of an rdp in a safe and stable way and prevent from
its value to be changed under us. We must either hold the barrier mutex,
the cpu-hotplug lock (read or write) or the nocb lock.
Local non-preemptible reads are also safe. NOCB kthreads and timers have
their own means of synchronization against the offloaded state updaters.

Cc: Josh Triplett 
Cc: Steven Rostedt 
Cc: Mathieu Desnoyers 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Thomas Gleixner 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c| 21 +--
 kernel/rcu/tree_plugin.h | 90 
 2 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..03503e2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -156,6 +156,7 @@ static void invoke_rcu_core(void);
 static void rcu_report_exp_rdp(struct rcu_data *rdp);
 static void sync_sched_exp_online_cleanup(int cpu);
 static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp);
+static bool rcu_rdp_is_offloaded(struct rcu_data *rdp);
 
 /* rcuc/rcub kthread realtime priority */
 static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
@@ -1672,7 +1673,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, 
struct rcu_data *rdp)
 {
bool ret = false;
bool need_qs;
-   const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
 
raw_lockdep_assert_held_rcu_node(rnp);
 
@@ -2128,7 +2129,7 @@ static void rcu_gp_cleanup(void)
needgp = true;
}
/* Advance CBs to reduce false positives below. */
-   offloaded = rcu_segcblist_is_offloaded(&rdp->cblist);
+   offloaded = rcu_rdp_is_offloaded(rdp);
if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) {
WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT);
WRITE_ONCE(rcu_state.gp_req_activity, jiffies);
@@ -2327,7 +2328,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
unsigned long flags;
unsigned long mask;
bool needwake = false;
-   const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_node *rnp;
 
WARN_ON_ONCE(rdp->cpu != smp_processor_id());
@@ -2497,7 +2498,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
int div;
bool __maybe_unused empty;
unsigned long flags;
-   const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist);
+   const bool offloaded = rcu_rdp_is_offloaded(rdp);
struct rcu_head *rhp;
struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl);
long bl, count = 0;
@@ -3066,7 +3067,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
trace_rcu_segcb_stats(&rdp->cblist, TPS("SegCBQueued"));
 
/* Go handle any RCU core processing required. */
-   if (unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) {
+   if (unlikely(rcu_rdp_is_offloaded(rdp))) {
__call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */
} else {
__call_rcu_core(rdp, head, flags);
@@ -3843,13 +3844,13 @@ static int rcu_pending(int user)
return 1;
 
/* Does this CPU have callbacks ready to invoke? */
-   if (!rcu_segcblist_is_offloaded(&rdp->cblist) &&
+   if (!rcu_rdp_is_offloaded(rdp) &&
rcu_segcblist_ready_cbs(&rdp->cblist))
return 1;
 
/* Has RCU gone idle with this CPU needing another grace period? */
if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) &&
-   !rcu_segcblist_is_offloaded(&rdp->cblist) &&
+   !rcu_rdp_is_offloaded(rdp) &&
!rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL))
return 1;
 
@@ -3968,7 +3969,7 @@ void rcu_barrier(void)
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(&rcu_data, cpu);
if (cpu_is_offline(cpu) &&
-   !rcu_segcblist_is_offloaded(&rdp->cblist))
+   !rcu_rdp_is_offloaded(rdp))
continue;
if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) {
rcu_barrier_trace(TPS("OnlineQ"), cpu,
@@ -4291,7 +4292,7 @@ void rcutree_migrate_callbacks(int cpu)
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
bool needwake;
 
-   if (rcu_segcblist_is_offloaded(&rdp->cblist) ||
+   if (rcu_rdp_is_offloaded(rdp) ||
rcu_segcblist_empty(&rdp->cblist))
return;  /* No callbacks to migrate. */
 
@@ -4309,7 +4310,7 @@ void rcutree_migrate_callbacks(int cpu)
rcu_segcblist_disable(&rdp->cbl

[PATCH tip/core/rcu 05/12] rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to
true by after invoking the callbacks, and then sets it back to false if
it finds more callbacks that are ready to invoke.

This is confusing and will become unsafe if this flag is ever read
locklessly.  This commit therefore writes it only once, based on the
state after both callback invocation and checking.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 9fd8588..6a7f77d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2230,6 +2230,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
unsigned long flags;
bool needwake_state = false;
bool needwake_gp = false;
+   bool can_sleep = true;
struct rcu_node *rnp = rdp->mynode;
 
local_irq_save(flags);
@@ -2253,8 +2254,6 @@ static void nocb_cb_wait(struct rcu_data *rdp)
raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */
}
 
-   WRITE_ONCE(rdp->nocb_cb_sleep, true);
-
if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) {
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) {
rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB);
@@ -2262,7 +2261,7 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
if (rcu_segcblist_ready_cbs(cblist))
-   WRITE_ONCE(rdp->nocb_cb_sleep, false);
+   can_sleep = false;
} else {
/*
 * De-offloading. Clear our flag and notify the de-offload 
worker.
@@ -2275,6 +2274,8 @@ static void nocb_cb_wait(struct rcu_data *rdp)
needwake_state = true;
}
 
+   WRITE_ONCE(rdp->nocb_cb_sleep, can_sleep);
+
if (rdp->nocb_cb_sleep)
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep"));
 
-- 
2.9.5

[PATCH tip/core/rcu 03/12] rcu/nocb: Comment the reason behind BH disablement on batch processing

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

This commit explains why softirqs need to be disabled while invoking
callbacks, even when callback processing has been offloaded.  After
all, invoking callbacks concurrently is one thing, but concurrently
invoking the same callback is quite another.

Reported-by: Boqun Feng 
Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index cd513ea..013142d 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2235,6 +2235,12 @@ static void nocb_cb_wait(struct rcu_data *rdp)
local_irq_save(flags);
rcu_momentary_dyntick_idle();
local_irq_restore(flags);
+   /*
+* Disable BH to provide the expected environment.  Also, when
+* transitioning to/from NOCB mode, a self-requeuing callback might
+* be invoked from softirq.  A short grace period could cause both
+* instances of this callback would execute concurrently.
+*/
local_bh_disable();
rcu_do_batch(rdp);
local_bh_enable();
-- 
2.9.5

[PATCH tip/core/rcu 06/12] rcu/nocb: Only (re-)initialize segcblist when needed on CPU up

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

At the start of a CPU-hotplug operation, the incoming CPU's callback
list can be in a number of states:

1.  Disabled and empty.  This is the case when the boot CPU has
not invoked call_rcu(), when a non-boot CPU first comes online,
and when a non-offloaded CPU comes back online.  In this case,
it is both necessary and permissible to initialize ->cblist.
Because either the CPU is currently running with interrupts
disabled (boot CPU) or is not yet running at all (other CPUs),
it is not necessary to acquire ->nocb_lock.

In this case, initialization is required.

2.  Disabled and non-empty.  This cannot occur, because early boot
call_rcu() invocations enable the callback list before enqueuing
their callback.

3.  Enabled, whether empty or not.  In this case, the callback
list has already been initialized.  This case occurs when the
boot CPU has executed an early boot call_rcu() and also when
an offloaded CPU comes back online.  In both cases, there is
no need to initialize the callback list: In the boot-CPU case,
the CPU has not (yet) gone offline, and in the offloaded case,
the rcuo kthreads are taking care of business.

Because it is not necessary to initialize the callback list,
it is also not necessary to acquire ->nocb_lock.

Therefore, checking if the segcblist is enabled suffices.  This commit
therefore initializes the callback list at rcutree_prepare_cpu() time
only if that list is disabled.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ee77858..402ea36 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4084,14 +4084,13 @@ int rcutree_prepare_cpu(unsigned int cpu)
rdp->dynticks_nesting = 1;  /* CPU not up, no tearing. */
rcu_dynticks_eqs_online();
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
+
/*
-* Lock in case the CB/GP kthreads are still around handling
-* old callbacks.
+* Only non-NOCB CPUs that didn't have early-boot callbacks need to be
+* (re-)initialized.
 */
-   rcu_nocb_lock(rdp);
-   if (rcu_segcblist_empty(&rdp->cblist)) /* No early-boot CBs? */
+   if (!rcu_segcblist_is_enabled(&rdp->cblist))
rcu_segcblist_init(&rdp->cblist);  /* Re-enable callbacks. */
-   rcu_nocb_unlock(rdp);
 
/*
 * Add CPU to leaf rcu_node pending-online bitmask.  Any needed
-- 
2.9.5

[PATCH tip/core/rcu 04/12] rcu/nocb: Forbid NOCB toggling on offline CPUs

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

It makes no sense to de-offload an offline CPU because that CPU will never
invoke any remaining callbacks.  It also makes little sense to offload an
offline CPU because any pending RCU callbacks were migrated when that CPU
went offline.  Yes, it is in theory possible to use a number of tricks
to permit offloading and deoffloading offline CPUs in certain cases, but
in practice it is far better to have the simple and deterministic rule
"Toggling the offload state of an offline CPU is forbidden".

For but one example, consider that an offloaded offline CPU might have
millions of callbacks queued.  Best to just say "no".

This commit therefore forbids toggling of the offloaded state of
offline CPUs.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree.c|  3 +--
 kernel/rcu/tree_plugin.h | 57 ++--
 2 files changed, 22 insertions(+), 38 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 03503e2..ee77858 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4086,8 +4086,7 @@ int rcutree_prepare_cpu(unsigned int cpu)
raw_spin_unlock_rcu_node(rnp);  /* irqs remain disabled. */
/*
 * Lock in case the CB/GP kthreads are still around handling
-* old callbacks (longer term we should flush all callbacks
-* before completing CPU offline)
+* old callbacks.
 */
rcu_nocb_lock(rdp);
if (rcu_segcblist_empty(&rdp->cblist)) /* No early-boot CBs? */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 013142d..9fd8588 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2399,23 +2399,18 @@ static int rdp_offload_toggle(struct rcu_data *rdp,
return 0;
 }
 
-static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_deoffload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = &rdp->cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
+
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-   /*
-* If there are still pending work offloaded, the offline
-* CPU won't help much handling them.
-*/
-   if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(&rdp->cblist)) {
-   rcu_nocb_unlock_irqrestore(rdp, flags);
-   return -EBUSY;
-   }
 
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
@@ -2446,14 +2441,6 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp)
return ret;
 }
 
-static long rcu_nocb_rdp_deoffload(void *arg)
-{
-   struct rcu_data *rdp = arg;
-
-   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
-   return __rcu_nocb_rdp_deoffload(rdp);
-}
-
 int rcu_nocb_cpu_deoffload(int cpu)
 {
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
@@ -2466,12 +2453,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
mutex_lock(&rcu_state.barrier_mutex);
cpus_read_lock();
if (rcu_rdp_is_offloaded(rdp)) {
-   if (cpu_online(cpu))
+   if (cpu_online(cpu)) {
ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp);
-   else
-   ret = __rcu_nocb_rdp_deoffload(rdp);
-   if (!ret)
-   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   if (!ret)
+   cpumask_clear_cpu(cpu, rcu_nocb_mask);
+   } else {
+   pr_info("NOCB: Can't CB-deoffload an offline CPU\n");
+   ret = -EINVAL;
+   }
}
cpus_read_unlock();
mutex_unlock(&rcu_state.barrier_mutex);
@@ -2480,12 +2469,14 @@ int rcu_nocb_cpu_deoffload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_deoffload);
 
-static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)
+static long rcu_nocb_rdp_offload(void *arg)
 {
+   struct rcu_data *rdp = arg;
struct rcu_segcblist *cblist = &rdp->cblist;
unsigned long flags;
int ret;
 
+   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
/*
 * For now we only support re-offload, ie: the rdp must have been
 * offloaded on boot first.
@@ -2525,14 +2516,6 @@ static int __rcu_nocb_rdp_offload(struct rcu_data *rdp)
return ret;
 }
 
-static long rcu_nocb_rdp_offload(void *arg)
-{
-   struct rcu_data *rdp = arg;
-
-   WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id());
-   return __rcu_nocb_rdp_offload(rdp);
-}
-
 int rcu_nocb_cpu_offload(int cpu)
 {
struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
@@ -2541,12 +2524,14 @@ int rcu_nocb_cpu_offload(int c

[PATCH tip/core/rcu 07/12] rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

The name nocb_gp_update_state() is unenlightening, so this commit changes
it to nocb_gp_update_state_deoffloading().  This function now does what
its name says, updates state and returns true if the CPU corresponding to
the specified rcu_data structure is in the process of being de-offloaded.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 6a7f77d..93d3938 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2016,7 +2016,8 @@ static inline bool nocb_gp_enabled_cb(struct rcu_data 
*rdp)
return rcu_segcblist_test_flags(&rdp->cblist, flags);
 }
 
-static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool 
*needwake_state)
+static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp,
+bool *needwake_state)
 {
struct rcu_segcblist *cblist = &rdp->cblist;
 
@@ -2026,7 +2027,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
if (rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
}
-   return true;
+   return false;
}
 
/*
@@ -2037,7 +2038,7 @@ static inline bool nocb_gp_update_state(struct rcu_data 
*rdp, bool *needwake_sta
rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP);
if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB))
*needwake_state = true;
-   return false;
+   return true;
 }
 
 
@@ -2075,7 +2076,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
continue;
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
rcu_nocb_lock_irqsave(rdp, flags);
-   if (!nocb_gp_update_state(rdp, &needwake_state)) {
+   if (nocb_gp_update_state_deoffloading(rdp, &needwake_state)) {
rcu_nocb_unlock_irqrestore(rdp, flags);
if (needwake_state)
swake_up_one(&rdp->nocb_state_wq);
-- 
2.9.5

[PATCH tip/core/rcu 10/12] rcu/nocb: Disable bypass when CPU isn't completely offloaded

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

Currently, the bypass is flushed at the very last moment in the
deoffloading procedure.  However, this approach leads to a larger state
space than would be preferred.  This commit therefore disables the
bypass at soon as the deoffloading procedure begins, then flushes it.
This guarantees that the bypass remains empty and thus out of the way
of the deoffloading procedure.

Symmetrically, this commit waits to enable the bypass until the offloading
procedure has completed.

Reported-by: Paul E. McKenney 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 include/linux/rcu_segcblist.h |  7 ---
 kernel/rcu/tree_plugin.h  | 38 +-
 2 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 8afe886..3db96c4 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -109,7 +109,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   Kthreads handle callbacks holding nocb_lock, local rcu_core() stops
|
- *  |   handling callbacks.
|
+ *  |   handling callbacks. Enable bypass queueing.
|
  *  

  */
 
@@ -125,7 +125,7 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core()
|
- *  |   ignores callbacks. 
|
+ *  |   ignores callbacks. Bypass enqueue is enabled.  
|
  *  

  *  |
  *  v
@@ -134,7 +134,8 @@ struct rcu_cblist {
  *  |   SEGCBLIST_KTHREAD_GP   
|
  *  |  
|
  *  |   CB/GP kthreads and local rcu_core() handle callbacks concurrently  
|
- *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary.
|
+ *  |   holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable
|
+ *  |   bypass enqueue.
|
  *  

  *  |
  *  v
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e392bd1..b08564b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1830,11 +1830,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, 
struct rcu_head *rhp,
unsigned long j = jiffies;
long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 
+   lockdep_assert_irqs_disabled();
+
+   // Pure softirq/rcuc based processing: no bypassing, no
+   // locking.
if (!rcu_rdp_is_offloaded(rdp)) {
*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+   return false;
+   }
+
+   // In the process of (de-)offloading: no bypassing, but
+   // locking.
+   if (!rcu_segcblist_completely_offloaded(&rdp->cblist)) {
+   rcu_nocb_lock(rdp);
+   *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
return false; /* Not offloaded, no bypassing. */
}
-   lockdep_assert_irqs_disabled();
 
// Don't use ->nocb_bypass during early boot.
if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) {
@@ -2416,7 +2427,16 @@ static long rcu_nocb_rdp_deoffload(void *arg)
pr_info("De-offloading %d\n", rdp->cpu);
 
rcu_nocb_lock_irqsave(rdp, flags);
-
+   /*
+* Flush once and for all now. This suffices because we are
+* running on the target CPU holding ->nocb_lock (thus having
+* interrupts disabled), and because rdp_offload_toggle()
+* invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED.
+* Thus future calls to rcu_segcblist_completely_offloaded() will
+* return false, which means that future calls to rcu_nocb_try_bypass()
+* will refuse to put anything into the bypass.
+*/
+   WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
ret = rdp_offload_toggle(rdp, false, flags);
swait_event_exclusive(rdp->nocb_state_wq,
  !rcu_segcblist_test_flags(cblist, 
SEGCBLIST_KTHREAD_CB |
@@ -2428,21 +24

[PATCH tip/core/rcu 12/12] rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

Those tracing calls don't need to be under ->nocb_lock.  This commit
therefore moves them outside of that lock.

Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index b08564b..9846c8a 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1703,9 +1703,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force,
 
lockdep_assert_held(&rdp->nocb_lock);
if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("AlreadyAwake"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
 
@@ -1955,9 +1955,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
// If we are being polled or there is no kthread, just leave.
t = READ_ONCE(rdp->nocb_gp_kthread);
if (rcu_nocb_poll || !t) {
+   rcu_nocb_unlock_irqrestore(rdp, flags);
trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
TPS("WakeNotPoll"));
-   rcu_nocb_unlock_irqrestore(rdp, flags);
return;
}
// Need to actually to a wakeup.
@@ -1992,8 +1992,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, 
bool was_alldone,
   TPS("WakeOvfIsDeferred"));
rcu_nocb_unlock_irqrestore(rdp, flags);
} else {
-   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
rcu_nocb_unlock_irqrestore(rdp, flags);
+   trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot"));
}
return;
 }
-- 
2.9.5

[PATCH tip/core/rcu 08/12] rcu: Make nocb_nobypass_lim_per_jiffy static

2021-03-03 Thread paulmck

From: Jiapeng Chong 

RCU triggerse the following sparse warning:

kernel/rcu/tree_plugin.h:1497:5: warning: symbol
'nocb_nobypass_lim_per_jiffy' was not declared. Should it be static?

This commit therefore makes this variable static.

Reported-by: Abaci Robot 
Frederic Weisbecker 
Signed-off-by: Jiapeng Chong 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 93d3938..a1a17ad 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1556,7 +1556,7 @@ early_param("rcu_nocb_poll", parse_rcu_nocb_poll);
  * After all, the main point of bypassing is to avoid lock contention
  * on ->nocb_lock, which only can happen at high call_rcu() rates.
  */
-int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ;
+static int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ;
 module_param(nocb_nobypass_lim_per_jiffy, int, 0);
 
 /*
-- 
2.9.5

[PATCH tip/core/rcu 09/12] rcu/nocb: Fix missed nocb_timer requeue

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

This sequence of events can lead to a failure to requeue a CPU's
->nocb_timer:

1.  There are no callbacks queued for any CPU covered by CPU 0-2's
->nocb_gp_kthread.  Note that ->nocb_gp_kthread is associated
with CPU 0.

2.  CPU 1 enqueues its first callback with interrupts disabled, and
thus must defer awakening its ->nocb_gp_kthread.  It therefore
queues its rcu_data structure's ->nocb_timer.  At this point,
CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE.

3.  CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a
callback, but with interrupts enabled, allowing it to directly
awaken the ->nocb_gp_kthread.

4.  The newly awakened ->nocb_gp_kthread associates both CPU 1's
and CPU 2's callbacks with a future grace period and arranges
for that grace period to be started.

5.  This ->nocb_gp_kthread goes to sleep waiting for the end of this
future grace period.

6.  This grace period elapses before the CPU 1's timer fires.
This is normally improbably given that the timer is set for only
one jiffy, but timers can be delayed.  Besides, it is possible
that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y.

7.  The grace period ends, so rcu_gp_kthread awakens the
->nocb_gp_kthread, which in turn awakens both CPU 1's and
CPU 2's ->nocb_cb_kthread.  Then ->nocb_gb_kthread sleeps
waiting for more newly queued callbacks.

8.  CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps
waiting for more invocable callbacks.

9.  Note that neither kthread updated any ->nocb_timer state,
so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE.

10. CPU 1 enqueues its second callback, this time with interrupts
enabled so it can wake directly ->nocb_gp_kthread.
It does so with calling wake_nocb_gp() which also cancels the
pending timer that got queued in step 2. But that doesn't reset
CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE.
So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now
desynchronized.

11. ->nocb_gp_kthread associates the callback queued in 10 with a new
grace period, arranges for that grace period to start and sleeps
waiting for it to complete.

12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread,
which in turn wakes up CPU 1's ->nocb_cb_kthread which then
invokes the callback queued in 10.

13. CPU 1 enqueues its third callback, this time with interrupts
disabled so it must queue a timer for a deferred wakeup. However
the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which
incorrectly indicates that a timer is already queued.  Instead,
CPU 1's ->nocb_timer was cancelled in 10.  CPU 1 therefore fails
to queue the ->nocb_timer.

14. CPU 1 has its pending callback and it may go unnoticed until
some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever
calls an explicit deferred wakeup, for example, during idle entry.

This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime
we delete the ->nocb_timer.

It is quite possible that there is a similar scenario involving
->nocb_bypass_timer and ->nocb_defer_wakeup.  However, despite some
effort from several people, a failure scenario has not yet been located.
However, that by no means guarantees that no such scenario exists.
Finding a failure scenario is left as an exercise for the reader, and the
"Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer.

Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing)
Cc: 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Boqun Feng 
Reviewed-by: Neeraj Upadhyay 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index a1a17ad..e392bd1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1708,7 +1708,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool 
force,
rcu_nocb_unlock_irqrestore(rdp, flags);
return false;
}
-   del_timer(&rdp->nocb_timer);
+
+   if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) {
+   WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT);
+   del_timer(&rdp->nocb_timer);
+   }
rcu_nocb_unlock_irqrestore(rdp, flags);
raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) {
@@ -2335,7 +2339,6 @@ static bool do_nocb_deferred_wakeup_common(struct 
rcu_data *rdp)
return false;
}
ndw = READ_ONCE(rdp->nocb_defer_wakeup);
-   WRITE_ONCE(rdp->nocb_defer

[PATCH tip/core/rcu 11/12] rcu/nocb: Remove stale comment above rcu_segcblist_offload()

2021-03-03 Thread paulmck

From: Frederic Weisbecker 

This commit removes a stale comment claiming that the cblist must be
empty before changing the offloading state.  This claim was correct back
when the offloaded state was defined exclusively at boot.

Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Josh Triplett 
Cc: Lai Jiangshan 
Cc: Joel Fernandes 
Cc: Neeraj Upadhyay 
Cc: Boqun Feng 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcu_segcblist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 7f181c9..aaa1112 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp)
 }
 
 /*
- * Mark the specified rcu_segcblist structure as offloaded.  This
- * structure must be empty.
+ * Mark the specified rcu_segcblist structure as offloaded.
  */
 void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload)
 {
-- 
2.9.5

[PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

There is a need for a non-blocking polling interface for RCU grace
periods, so this commit supplies start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() for this purpose.  Note that the existing
get_state_synchronize_rcu() may be used if future grace periods are
inevitable (perhaps due to a later call_rcu() invocation).  The new
start_poll_synchronize_rcu() is to be used if future grace periods
might not otherwise happen.  Finally, poll_state_synchronize_rcu()
provides a lockless check for a grace period having elapsed since
the corresponding call to either of the get_state_synchronize_rcu()
or start_poll_synchronize_rcu().

As with get_state_synchronize_rcu(), the return value from either
get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
to a later call to either poll_state_synchronize_rcu() or the existing
(might_sleep) cond_synchronize_rcu().

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcutree.h |  2 ++
 kernel/rcu/tree.c   | 70 +
 2 files changed, 67 insertions(+), 5 deletions(-)

diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index df578b7..b89b541 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -41,6 +41,8 @@ void rcu_momentary_dyntick_idle(void);
 void kfree_rcu_scheduler_running(void);
 bool rcu_gp_might_be_stalled(void);
 unsigned long get_state_synchronize_rcu(void);
+unsigned long start_poll_synchronize_rcu(void);
+bool poll_state_synchronize_rcu(unsigned long oldstate);
 void cond_synchronize_rcu(unsigned long oldstate);
 
 void rcu_idle_enter(void);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index da6f521..8e0a140 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3774,8 +3774,8 @@ EXPORT_SYMBOL_GPL(synchronize_rcu);
  * get_state_synchronize_rcu - Snapshot current RCU state
  *
  * Returns a cookie that is used by a later call to cond_synchronize_rcu()
- * to determine whether or not a full grace period has elapsed in the
- * meantime.
+ * or poll_state_synchronize_rcu() to determine whether or not a full
+ * grace period has elapsed in the meantime.
  */
 unsigned long get_state_synchronize_rcu(void)
 {
@@ -3789,13 +3789,73 @@ unsigned long get_state_synchronize_rcu(void)
 EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
 
 /**
+ * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period
+ *
+ * Returns a cookie that is used by a later call to cond_synchronize_rcu()
+ * or poll_state_synchronize_rcu() to determine whether or not a full
+ * grace period has elapsed in the meantime.  If the needed grace period
+ * is not already slated to start, notifies RCU core of the need for that
+ * grace period.
+ *
+ * Interrupts must be enabled for the case where it is necessary to awaken
+ * the grace-period kthread.
+ */
+unsigned long start_poll_synchronize_rcu(void)
+{
+   unsigned long flags;
+   unsigned long gp_seq = get_state_synchronize_rcu();
+   bool needwake;
+   struct rcu_data *rdp;
+   struct rcu_node *rnp;
+
+   lockdep_assert_irqs_enabled();
+   local_irq_save(flags);
+   rdp = this_cpu_ptr(&rcu_data);
+   rnp = rdp->mynode;
+   raw_spin_lock_rcu_node(rnp); // irqs already disabled.
+   needwake = rcu_start_this_gp(rnp, rdp, gp_seq);
+   raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+   if (needwake)
+   rcu_gp_kthread_wake();
+   return gp_seq;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
+
+/**
+ * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period
+ *
+ * @oldstate: return from call to get_state_synchronize_rcu() or 
start_poll_synchronize_rcu()
+ *
+ * If a full RCU grace period has elapsed since the earlier call from
+ * which oldstate was obtained, return @true, otherwise return @false.
+ * Otherwise, invoke synchronize_rcu() to wait for a full grace period.
+ *
+ * Yes, this function does not take counter wrap into account.
+ * But counter wrap is harmless.  If the counter wraps, we have waited for
+ * more than 2 billion grace periods (and way more on a 64-bit system!).
+ * Those needing to keep oldstate values for very long time periods
+ * (many hours even on 32-bit systems) should check them occasionally
+ * and either refresh them or set a flag indicating that the grace period
+ * has completed.
+ */
+bool poll_state_synchronize_rcu(unsigned long oldstate)
+{
+   if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) {
+   smp_mb(); /* Ensure GP ends before subsequent accesses. */
+   return true;
+   }
+   return false;
+}
+EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
+
+/**
  * cond_synchronize_rcu - Conditionally wait for an RCU grace period
  *
  * @oldstate: return value from earlier call to get_state_synchronize_rcu()
  *
  * If a full RCU grace period has elapsed since the earlier call to
- * get_state_synchronize_rcu(), just return.  Otherwise, invoke
- * syn

[PATCH tip/core/rcu 3/3] rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit causes rcutorture to test the new start_poll_synchronize_rcu()
and poll_state_synchronize_rcu() functions.  Because of the difficulty of
determining the nature of a synchronous RCU grace (expedited or not),
the test that insisted that poll_state_synchronize_rcu() detect an
intervening synchronize_rcu() had to be dropped.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 99657ff..956e6bf 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -494,6 +494,8 @@ static struct rcu_torture_ops rcu_ops = {
.sync   = synchronize_rcu,
.exp_sync   = synchronize_rcu_expedited,
.get_gp_state   = get_state_synchronize_rcu,
+   .start_gp_poll  = start_poll_synchronize_rcu,
+   .poll_gp_state  = poll_state_synchronize_rcu,
.cond_sync  = cond_synchronize_rcu,
.call   = call_rcu,
.cb_barrier = rcu_barrier,
@@ -1223,14 +1225,6 @@ rcu_torture_writer(void *arg)
WARN_ON_ONCE(1);
break;
}
-   if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
-   WARN_ONCE(rcu_torture_writer_state != 
RTWS_DEF_FREE &&
- !cur_ops->poll_gp_state(cookie),
- "%s: Cookie check 2 failed %s(%d) 
%lu->%lu\n",
- __func__,
- rcu_torture_writer_state_getname(),
- rcu_torture_writer_state,
- cookie, cur_ops->get_gp_state());
}
WRITE_ONCE(rcu_torture_current_version,
   rcu_torture_current_version + 1);
@@ -1589,7 +1583,7 @@ static bool rcu_torture_one_read(struct 
torture_random_state *trsp, long myid)
preempt_enable();
if (cur_ops->get_gp_state && cur_ops->poll_gp_state)
WARN_ONCE(cur_ops->poll_gp_state(cookie),
- "%s: Cookie check 3 failed %s(%d) %lu->%lu\n",
+ "%s: Cookie check 2 failed %s(%d) %lu->%lu\n",
  __func__,
  rcu_torture_writer_state_getname(),
  rcu_torture_writer_state,
-- 
2.9.5

[PATCH tip/core/rcu 2/3] rcu: Provide polling interfaces for Tiny RCU grace periods

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

There is a need for a non-blocking polling interface for RCU grace
periods, so this commit supplies start_poll_synchronize_rcu() and
poll_state_synchronize_rcu() for this purpose.  Note that the existing
get_state_synchronize_rcu() may be used if future grace periods are
inevitable (perhaps due to a later call_rcu() invocation).  The new
start_poll_synchronize_rcu() is to be used if future grace periods
might not otherwise happen.  Finally, poll_state_synchronize_rcu()
provides a lockless check for a grace period having elapsed since
the corresponding call to either of the get_state_synchronize_rcu()
or start_poll_synchronize_rcu().

As with get_state_synchronize_rcu(), the return value from either
get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in
to a later call to either poll_state_synchronize_rcu() or the existing
(might_sleep) cond_synchronize_rcu().

Signed-off-by: Paul E. McKenney 
---
 include/linux/rcutiny.h | 11 ++-
 kernel/rcu/tiny.c   | 40 
 2 files changed, 46 insertions(+), 5 deletions(-)

diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 2a97334..69108cf4 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -17,14 +17,15 @@
 /* Never flag non-existent other CPUs! */
 static inline bool rcu_eqs_special_set(int cpu) { return false; }
 
-static inline unsigned long get_state_synchronize_rcu(void)
-{
-   return 0;
-}
+unsigned long get_state_synchronize_rcu(void);
+unsigned long start_poll_synchronize_rcu(void);
+bool poll_state_synchronize_rcu(unsigned long oldstate);
 
 static inline void cond_synchronize_rcu(unsigned long oldstate)
 {
-   might_sleep();
+   if (poll_state_synchronize_rcu(oldstate))
+   return;
+   synchronize_rcu();
 }
 
 extern void rcu_barrier(void);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index aa897c3..c8a029f 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -32,12 +32,14 @@ struct rcu_ctrlblk {
struct rcu_head *rcucblist; /* List of pending callbacks (CBs). */
struct rcu_head **donetail; /* ->next pointer of last "done" CB. */
struct rcu_head **curtail;  /* ->next pointer of last CB. */
+   unsigned long gp_seq;   /* Grace-period counter. */
 };
 
 /* Definition for rcupdate control block. */
 static struct rcu_ctrlblk rcu_ctrlblk = {
.donetail   = &rcu_ctrlblk.rcucblist,
.curtail= &rcu_ctrlblk.rcucblist,
+   .gp_seq = 0 - 300UL,
 };
 
 void rcu_barrier(void)
@@ -56,6 +58,7 @@ void rcu_qs(void)
rcu_ctrlblk.donetail = rcu_ctrlblk.curtail;
raise_softirq_irqoff(RCU_SOFTIRQ);
}
+   WRITE_ONCE(rcu_ctrlblk.gp_seq, rcu_ctrlblk.gp_seq + 1);
local_irq_restore(flags);
 }
 
@@ -177,6 +180,43 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
+/*
+ * Return a grace-period-counter "cookie".  For more information,
+ * see the Tree RCU header comment.
+ */
+unsigned long get_state_synchronize_rcu(void)
+{
+   return READ_ONCE(rcu_ctrlblk.gp_seq);
+}
+EXPORT_SYMBOL_GPL(get_state_synchronize_rcu);
+
+/*
+ * Return a grace-period-counter "cookie" and ensure that a future grace
+ * period completes.  For more information, see the Tree RCU header comment.
+ */
+unsigned long start_poll_synchronize_rcu(void)
+{
+   unsigned long gp_seq = get_state_synchronize_rcu();
+
+   if (unlikely(is_idle_task(current))) {
+   /* force scheduling for rcu_qs() */
+   resched_cpu(0);
+   }
+   return gp_seq;
+}
+EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu);
+
+/*
+ * Return true if the grace period corresponding to oldstate has completed
+ * and false otherwise.  For more information, see the Tree RCU header
+ * comment.
+ */
+bool poll_state_synchronize_rcu(unsigned long oldstate)
+{
+   return READ_ONCE(rcu_ctrlblk.gp_seq) != oldstate;
+}
+EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu);
+
 void __init rcu_init(void)
 {
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
-- 
2.9.5

[PATCH tip/core/rcu 4/5] rcu: Make rcu_read_unlock_special() expedite strict grace periods

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

In kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, every grace
period is an expedited grace period.  However, rcu_read_unlock_special()
does not treat them that way, instead allowing the deferred quiescent
state to be reported whenever.  This commit therefore adds a check of
this Kconfig option that causes rcu_read_unlock_special() to treat all
grace periods as expedited for CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels.

Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index e17cb23..a21c41c 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -615,6 +615,7 @@ static void rcu_read_unlock_special(struct task_struct *t)
 
expboost = (t->rcu_blocked_node && 
READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
   (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
+  IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ||
   (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
&&
t->rcu_blocked_node);
// Need to defer quiescent state until everything is enabled.
-- 
2.9.5

[PATCH tip/core/rcu 1/5] rcu: Expedite deboost in case of deferred quiescent state

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Historically, a task that has been subjected to RCU priority boosting is
deboosted at rcu_read_unlock() time.  However, with the advent of deferred
quiescent states, if the outermost rcu_read_unlock() was invoked with
either bottom halves, interrupts, or preemption disabled, the deboosting
will be delayed for some time.  During this time, a low-priority process
might be incorrectly running at a high real-time priority level.

Fortunately, rcu_read_unlock_special() already provides mechanisms for
forcing a minimal deferral of quiescent states, at least for kernels
built with CONFIG_IRQ_WORK=y.  These mechanisms are currently used
when expedited grace periods are pending that might be blocked by the
current task.  This commit therefore causes those mechanisms to also be
used in cases where the current task has been or might soon be subjected
to RCU priority boosting.  Note that this applies to all kernels built
with CONFIG_RCU_BOOST=y, regardless of whether or not they are also
built with CONFIG_PREEMPT_RT=y.

This approach assumes that kernels build for use with aggressive real-time
applications are built with CONFIG_IRQ_WORK=y.  It is likely to be far
simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting
scheme that works correctly in its absence.

While in the area, alphabetize the rcu_preempt_deferred_qs_handler()
function's local variables.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Lai Jiangshan 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tree_plugin.h | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 2d60377..e17cb23 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -598,9 +598,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work 
*iwp)
 static void rcu_read_unlock_special(struct task_struct *t)
 {
unsigned long flags;
+   bool irqs_were_disabled;
bool preempt_bh_were_disabled =
!!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK));
-   bool irqs_were_disabled;
 
/* NMI handlers cannot block and cannot safely manipulate state. */
if (in_nmi())
@@ -609,30 +609,32 @@ static void rcu_read_unlock_special(struct task_struct *t)
local_irq_save(flags);
irqs_were_disabled = irqs_disabled_flags(flags);
if (preempt_bh_were_disabled || irqs_were_disabled) {
-   bool exp;
+   bool expboost; // Expedited GP in flight or possible boosting.
struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
struct rcu_node *rnp = rdp->mynode;
 
-   exp = (t->rcu_blocked_node &&
-  READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
- (rdp->grpmask & READ_ONCE(rnp->expmask));
+   expboost = (t->rcu_blocked_node && 
READ_ONCE(t->rcu_blocked_node->exp_tasks)) ||
+  (rdp->grpmask & READ_ONCE(rnp->expmask)) ||
+  (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled 
&&
+   t->rcu_blocked_node);
// Need to defer quiescent state until everything is enabled.
-   if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) {
+   if (use_softirq && (in_irq() || (expboost && 
!irqs_were_disabled))) {
// Using softirq, safe to awaken, and either the
-   // wakeup is free or there is an expedited GP.
+   // wakeup is free or there is either an expedited
+   // GP in flight or a potential need to deboost.
raise_softirq_irqoff(RCU_SOFTIRQ);
} else {
// Enabling BH or preempt does reschedule, so...
-   // Also if no expediting, slow is OK.
-   // Plus nohz_full CPUs eventually get tick enabled.
+   // Also if no expediting and no possible deboosting,
+   // slow is OK.  Plus nohz_full CPUs eventually get
+   // tick enabled.
set_tsk_need_resched(current);
set_preempt_need_resched();
if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled &&
-   !rdp->defer_qs_iw_pending && exp && 
cpu_online(rdp->cpu)) {
+   expboost && !rdp->defer_qs_iw_pending && 
cpu_online(rdp->cpu)) {
// Get scheduler to re-evaluate and call hooks.
// If !IRQ_WORK, FQS scan will eventually IPI.
-   init_irq_work(&rdp->defer_qs_iw,
- rcu_preempt_deferred_qs_handler);
+   init_irq_work(&rdp->defer_qs_iw, 
rcu_preempt_deferred_qs_handler);

[PATCH tip/core/rcu 3/5] rcutorture: Fix testing of RCU priority boosting

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Currently, rcutorture refuses to test RCU priority boosting in
CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on
x86 these days.  This commit therefore updates rcutorture's tests of RCU
priority boosting to make them safe for CPU hotplug.  However, these tests
will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not
happen in current mainline.  This commit therefore also refuses to test
RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y.

While in the area, this commt adds some debug output at boost-fail time
that helps diagnose the cause of the failure, for example, failing to
run TIMER_SOFTIRQ at realtime priority.

Cc: Sebastian Andrzej Siewior 
Cc: Scott Wood 
Cc: Thomas Gleixner 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 99657ff..af64bd8 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -245,11 +245,11 @@ static const char *rcu_torture_writer_state_getname(void)
return rcu_torture_writer_state_names[i];
 }
 
-#if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU)
-#define rcu_can_boost() 1
-#else /* #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */
-#define rcu_can_boost() 0
-#endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) 
*/
+#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_PREEMPT_RT)
+# define rcu_can_boost() 1
+#else
+# define rcu_can_boost() 0
+#endif
 
 #ifdef CONFIG_RCU_TRACE
 static u64 notrace rcu_trace_clock_local(void)
@@ -923,9 +923,13 @@ static void rcu_torture_enable_rt_throttle(void)
 
 static bool rcu_torture_boost_failed(unsigned long start, unsigned long end)
 {
+   static int dbg_done;
+
if (end - start > test_boost_duration * HZ - HZ / 2) {
VERBOSE_TOROUT_STRING("rcu_torture_boost boosting failed");
n_rcu_torture_boost_failure++;
+   if (!xchg(&dbg_done, 1) && cur_ops->gp_kthread_dbg)
+   cur_ops->gp_kthread_dbg();
 
return true; /* failed */
}
@@ -948,8 +952,8 @@ static int rcu_torture_boost(void *arg)
init_rcu_head_on_stack(&rbi.rcu);
/* Each pass through the following loop does one boost-test cycle. */
do {
-   /* Track if the test failed already in this test interval? */
-   bool failed = false;
+   bool failed = false; // Test failed already in this test 
interval
+   bool firsttime = true;
 
/* Increment n_rcu_torture_boosts once per boost-test */
while (!kthread_should_stop()) {
@@ -975,18 +979,17 @@ static int rcu_torture_boost(void *arg)
 
/* Do one boost-test interval. */
endtime = oldstarttime + test_boost_duration * HZ;
-   call_rcu_time = jiffies;
while (time_before(jiffies, endtime)) {
/* If we don't have a callback in flight, post one. */
if (!smp_load_acquire(&rbi.inflight)) {
/* RCU core before ->inflight = 1. */
smp_store_release(&rbi.inflight, 1);
-   call_rcu(&rbi.rcu, rcu_torture_boost_cb);
+   cur_ops->call(&rbi.rcu, rcu_torture_boost_cb);
/* Check if the boost test failed */
-   failed = failed ||
-rcu_torture_boost_failed(call_rcu_time,
-jiffies);
+   if (!firsttime && !failed)
+   failed = 
rcu_torture_boost_failed(call_rcu_time, jiffies);
call_rcu_time = jiffies;
+   firsttime = false;
}
if (stutter_wait("rcu_torture_boost"))
sched_set_fifo_low(current);
@@ -999,7 +1002,7 @@ static int rcu_torture_boost(void *arg)
 * this case the boost check would never happen in the above
 * loop so do another one here.
 */
-   if (!failed && smp_load_acquire(&rbi.inflight))
+   if (!firsttime && !failed && smp_load_acquire(&rbi.inflight))
rcu_torture_boost_failed(call_rcu_time, jiffies);
 
/*
@@ -1025,6 +1028,9 @@ checkwait:if (stutter_wait("rcu_torture_boost"))
sched_set_fifo_low(current);
} while (!torture_must_stop());
 
+   while (smp_load_acquire(&rbi.inflight))
+   schedule_timeout_uninterruptible(1); // rcu_barrier() deadlocks.
+
/* Clean up and exit. */
wh

[PATCH tip/core/rcu 2/5] rcutorture: Make TREE03 use real-time tree.use_softirq setting

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

TREE03 tests RCU priority boosting, which is a real-time feature.
It would also be good if it tested something closer to what is
actually used by the real-time folks.  This commit therefore adds
tree.use_softirq=0 to the TREE03 kernel boot parameters in TREE03.boot.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot 
b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
index 1c21894..64f864f1 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot
@@ -4,3 +4,4 @@ rcutree.gp_init_delay=3
 rcutree.gp_cleanup_delay=3
 rcutree.kthread_prio=2
 threadirqs
+tree.use_softirq=0
-- 
2.9.5

[PATCH tip/core/rcu 1/2] rcu-tasks: Rectify kernel-doc for struct rcu_tasks

2021-03-03 Thread paulmck

From: Lukas Bulwahn 

The command 'find ./kernel/rcu/ | xargs ./scripts/kernel-doc -none'
reported an issue with the kernel-doc of struct rcu_tasks.

This commit rectifies the kernel-doc, such that no issues remain for
./kernel/rcu/.

Signed-off-by: Lukas Bulwahn 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tasks.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index af7c194..17c8ebe 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -20,7 +20,7 @@ typedef void (*holdouts_func_t)(struct list_head *hop, bool 
ndrpt, bool *frptp);
 typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
 
 /**
- * Definition for a Tasks-RCU-like mechanism.
+ * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
  * @cbs_head: Head of callback list.
  * @cbs_tail: Tail pointer for callback list.
  * @cbs_wq: Wait queue allowning new callback to get kthread's attention.
@@ -38,7 +38,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp);
  * @pregp_func: This flavor's pre-grace-period function (optional).
  * @pertask_func: This flavor's per-task scan function (optional).
  * @postscan_func: This flavor's post-task scan function (optional).
- * @holdout_func: This flavor's holdout-list scan function (optional).
+ * @holdouts_func: This flavor's holdout-list scan function (optional).
  * @postgp_func: This flavor's post-grace-period function (optional).
  * @call_func: This flavor's call_rcu()-equivalent function.
  * @name: This flavor's textual name.
-- 
2.9.5

[PATCH tip/core/rcu 2/2] rcu-tasks: Add block comment laying out RCU Tasks Trace design

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit adds a block comment that gives a high-level overview of
how RCU tasks trace grace periods progress.  It also adds a note about
how exiting tasks are handled, plus it gives an overview of the memory
ordering.

Reported-by: Peter Zijlstra 
Reported-by: Mathieu Desnoyers 
[ paulmck: Fix commit log per Mathieu Desnoyers feedback. ]
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/tasks.h | 36 
 1 file changed, 36 insertions(+)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 17c8ebe..f818357 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -726,6 +726,42 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread);
 // flavors, rcu_preempt and rcu_sched.  The fact that RCU Tasks Trace
 // readers can operate from idle, offline, and exception entry/exit in no
 // way allows rcu_preempt and rcu_sched readers to also do so.
+//
+// The implementation uses rcu_tasks_wait_gp(), which relies on function
+// pointers in the rcu_tasks structure.  The rcu_spawn_tasks_trace_kthread()
+// function sets these function pointers up so that rcu_tasks_wait_gp()
+// invokes these functions in this order:
+//
+// rcu_tasks_trace_pregp_step():
+// Initialize the count of readers and block CPU-hotplug operations.
+// rcu_tasks_trace_pertask(), invoked on every non-idle task:
+// Initialize per-task state and attempt to identify an immediate
+// quiescent state for that task, or, failing that, attempt to set
+// that task's .need_qs flag so that that task's next outermost
+// rcu_read_unlock_trace() will report the quiescent state (in which
+// case the count of readers is incremented).  If both attempts fail,
+// the task is added to a "holdout" list.
+// rcu_tasks_trace_postscan():
+// Initialize state and attempt to identify an immediate quiescent
+// state as above (but only for idle tasks), unblock CPU-hotplug
+// operations, and wait for an RCU grace period to avoid races with
+// tasks that are in the process of exiting.
+// check_all_holdout_tasks_trace(), repeatedly until holdout list is empty:
+// Scans the holdout list, attempting to identify a quiescent state
+// for each task on the list.  If there is a quiescent state, the
+// corresponding task is removed from the holdout list.
+// rcu_tasks_trace_postgp():
+// Wait for the count of readers do drop to zero, reporting any stalls.
+// Also execute full memory barriers to maintain ordering with code
+// executing after the grace period.
+//
+// The exit_tasks_rcu_finish_trace() synchronizes with exiting tasks.
+//
+// Pre-grace-period update-side code is ordered before the grace
+// period via the ->cbs_lock and barriers in rcu_tasks_kthread().
+// Pre-grace-period read-side code is ordered before the grace period by
+// atomic_dec_and_test() of the count of readers (for IPIed readers) and by
+// scheduler context-switch ordering (for locked-down non-running readers).
 
 // The lockdep state must be outside of #ifdef to be useful.
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-- 
2.9.5

[PATCH tip/core/rcu 2/2] rcutorture: Replace rcu_torture_stall string with %s

2021-03-03 Thread paulmck

From: Stephen Zhang 

This commit replaces a hard-coded "rcu_torture_stall" string in a
pr_alert() format with "%s" and __func__.

Signed-off-by: Stephen Zhang 
Signed-off-by: Paul E. McKenney 
---
 kernel/rcu/rcutorture.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 99657ff..271726e 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -1971,8 +1971,8 @@ static int rcu_torture_stall(void *args)
local_irq_disable();
else if (!stall_cpu_block)
preempt_disable();
-   pr_alert("rcu_torture_stall start on CPU %d.\n",
-raw_smp_processor_id());
+   pr_alert("%s start on CPU %d.\n",
+ __func__, raw_smp_processor_id());
while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(),
stop_at))
if (stall_cpu_block)
@@ -1983,7 +1983,7 @@ static int rcu_torture_stall(void *args)
preempt_enable();
cur_ops->readunlock(idx);
}
-   pr_alert("rcu_torture_stall end.\n");
+   pr_alert("%s end.\n", __func__);
torture_shutdown_absorb("rcu_torture_stall");
while (!kthread_should_stop())
schedule_timeout_interruptible(10 * HZ);
-- 
2.9.5

[PATCH tip/core/rcu 1/2] torture: Replace torture_init_begin string with %s

2021-03-03 Thread paulmck

From: Stephen Zhang 

This commit replaces a hard-coded "torture_init_begin" string in
a pr_alert() format with "%s" and __func__.

Signed-off-by: Stephen Zhang 
Signed-off-by: Paul E. McKenney 
---
 kernel/torture.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/torture.c b/kernel/torture.c
index 01e336f..0a315c3 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -816,9 +816,9 @@ bool torture_init_begin(char *ttype, int v)
 {
mutex_lock(&fullstop_mutex);
if (torture_type != NULL) {
-   pr_alert("torture_init_begin: Refusing %s init: %s running.\n",
-ttype, torture_type);
-   pr_alert("torture_init_begin: One torture test at a time!\n");
+   pr_alert("%s: Refusing %s init: %s running.\n",
+ __func__, ttype, torture_type);
+   pr_alert("%s: One torture test at a time!\n", __func__);
mutex_unlock(&fullstop_mutex);
return false;
}
-- 
2.9.5

[PATCH tip/core/rcu 01/28] torturescript: Don't rerun failed rcutorture builds

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

If the build fails when running multiple instances of a given rcutorture
scenario, for example, using the kvm.sh --configs "8*RUDE01" argument,
the build will be rerun an additional seven times.  This is in some sense
correct, but it can waste significant time.  This commit therefore checks
for a prior failed build and simply copies over that build's output.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 536d103..9d8a82c 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -73,7 +73,7 @@ config_override_param "--kconfig argument" KcList 
"$TORTURE_KCONFIG_ARG"
 cp $T/KcList $resdir/ConfigFragment
 
 base_resdir=`echo $resdir | sed -e 's/\.[0-9]\+$//'`
-if test "$base_resdir" != "$resdir" -a -f $base_resdir/bzImage -a -f 
$base_resdir/vmlinux
+if test "$base_resdir" != "$resdir" && test -f $base_resdir/bzImage && test -f 
$base_resdir/vmlinux
 then
# Rerunning previous test, so use that test's kernel.
QEMU="`identify_qemu $base_resdir/vmlinux`"
@@ -83,6 +83,17 @@ then
ln -s $base_resdir/.config $resdir  # for kvm-recheck.sh
# Arch-independent indicator
touch $resdir/builtkernel
+elif test "$base_resdir" != "$resdir"
+then
+   # Rerunning previous test for which build failed
+   ln -s $base_resdir/Make*.out $resdir  # for kvm-recheck.sh
+   ln -s $base_resdir/.config $resdir  # for kvm-recheck.sh
+   echo Initial build failed, not running KVM, see $resdir.
+   if test -f $builddir.wait
+   then
+   mv $builddir.wait $builddir.ready
+   fi
+   exit 1
 elif kvm-build.sh $T/KcList $resdir
 then
# Had to build a kernel for this test.
-- 
2.9.5

[PATCH tip/core/rcu 5/5] torture: Make jitter.sh handle large systems

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

The current jitter.sh script expects cpumask bits to fit into whatever
the awk interpreter uses for an integer, which clearly does not hold for
even medium-sized systems these days.  This means that on a large system,
only the first 32 or 64 CPUs (depending) are subjected to jitter.sh
CPU-time perturbations.  This commit therefore computes a given CPU's
cpumask using text manipulation rather than arithmetic shifts.

Reported-by: Sebastian Andrzej Siewior 
Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/jitter.sh | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/jitter.sh 
b/tools/testing/selftests/rcutorture/bin/jitter.sh
index 188b864..3a856ec 100755
--- a/tools/testing/selftests/rcutorture/bin/jitter.sh
+++ b/tools/testing/selftests/rcutorture/bin/jitter.sh
@@ -67,10 +67,10 @@ do
srand(n + me + systime());
ncpus = split(cpus, ca);
curcpu = ca[int(rand() * ncpus + 1)];
-   mask = lshift(1, curcpu);
-   if (mask + 0 <= 0)
-   mask = 1;
-   printf("%#x\n", mask);
+   z = "";
+   for (i = 1; 4 * i <= curcpu; i++)
+   z = z "0";
+   print "0x" 2 ^ (curcpu % 4) z;
}' < /dev/null`
n=$(($n+1))
if ! taskset -p $cpumask $$ > /dev/null 2>&1
-- 
2.9.5

[PATCH tip/core/rcu 02/28] torture: Allow 1G of memory for torture.sh kvfree testing

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Yes, I do recall a time when 512MB of memory was a lot of mass storage,
much less main memory, but the rcuscale kvfree_rcu() testing invoked by
torture.sh can sometimes exceed it on large systems, resulting in OOM.
This commit therefore causes torture.sh to pase the "--memory 1G"
argument to kvm.sh to reserve a full gigabyte for this purpose.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/torture.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/torture.sh 
b/tools/testing/selftests/rcutorture/bin/torture.sh
index ad7525b..56e2e1a 100755
--- a/tools/testing/selftests/rcutorture/bin/torture.sh
+++ b/tools/testing/selftests/rcutorture/bin/torture.sh
@@ -374,7 +374,7 @@ done
 if test "$do_kvfree" = "yes"
 then
torture_bootargs="rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 
rcuscale.holdoff=20 rcuscale.kfree_loops=1 torture.disable_onoff_at_boot"
-   torture_set "rcuscale-kvfree" 
tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus 
--duration 10 --kconfig "CONFIG_NR_CPUS=$HALF_ALLOTED_CPUS" --trust-make
+   torture_set "rcuscale-kvfree" 
tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus 
--duration 10 --kconfig "CONFIG_NR_CPUS=$HALF_ALLOTED_CPUS" --memory 1G 
--trust-make
 fi
 
 echo " --- " $scriptname $args
-- 
2.9.5

[PATCH tip/core/rcu 03/28] torture: Provide bare-metal modprobe-based advice

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

In some environments, the torture-testing use of virtualization is
inconvenient.  In such cases, the modprobe and rmmod commands may be used
to do torture testing, but significant setup is required to build, boot,
and modprobe a kernel so as to match a given torture-test scenario.
This commit therefore creates a "bare-metal" file in each results
directory containing steps to run the corresponding scenario using the
modprobe command on bare metal.  For example, the contents of this file
after using kvm.sh to build an rcutorture TREE01 kernel, perhaps with
the --buildonly argument, is as follows:

To run this scenario on bare metal:

 1. Set your bare-metal build tree to the state shown in this file:

/home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/testid.txt
 2. Update your bare-metal build tree's .config based on this file:

/home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/ConfigFragment
 3. Make the bare-metal kernel's build system aware of your .config updates:
$ yes "" | make oldconfig
 4. Build your bare-metal kernel.
 5. Boot your bare-metal kernel with the following parameters:
maxcpus=8 nr_cpus=43 rcutree.gp_preinit_delay=3 rcutree.gp_init_delay=3 
rcutree.gp_cleanup_delay=3 rcu_nocbs=0-1,3-7
 6. Start the test with the following command:
$ modprobe rcutorture nocbs_nthreads=8 nocbs_toggle=1000 fwd_progress=0 
onoff_interval=1000 onoff_holdoff=30 n_barrier_cbs=4 stat_interval=15 
shutdown_secs=120 test_no_idle_hz=1 verbose=1
 7. After some time, end the test with the following command:
$ rmmod rcutorture
 8. Copy your bare-metal kernel's .config file, overwriting this file:

/home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/.config
 9. Copy the console output from just before the modprobe to just after
the rmmod into this file:

/home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/console.log
10. Check for runtime errors using the following command:
   $ tools/testing/selftests/rcutorture/bin/kvm-recheck.sh 
/home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19

Signed-off-by: Paul E. McKenney 
---
 .../selftests/rcutorture/bin/kvm-test-1-run.sh | 44 +++---
 tools/testing/selftests/rcutorture/bin/kvm.sh  |  4 ++
 2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 9d8a82c..03c0410 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -7,15 +7,15 @@
 # Execute this in the source tree.  Do not run it as a background task
 # because qemu does not seem to like that much.
 #
-# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args
+# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args 
boot_args_in
 #
 # qemu-args defaults to "-enable-kvm -nographic", along with arguments
 #  specifying the number of CPUs and other options
 #  generated from the underlying CPU architecture.
-# boot_args defaults to value returned by the per_version_boot_params
+# boot_args_in defaults to value returned by the per_version_boot_params
 #  shell function.
 #
-# Anything you specify for either qemu-args or boot_args is appended to
+# Anything you specify for either qemu-args or boot_args_in is appended to
 # the default values.  The "-smp" value is deduced from the contents of
 # the config fragment.
 #
@@ -134,7 +134,7 @@ do
 done
 seconds=$4
 qemu_args=$5
-boot_args=$6
+boot_args_in=$6
 
 if test -z "$TORTURE_BUILDONLY"
 then
@@ -144,7 +144,7 @@ fi
 # Generate -smp qemu argument.
 qemu_args="-enable-kvm -nographic $qemu_args"
 cpu_count=`configNR_CPUS.sh $resdir/ConfigFragment`
-cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"`
+cpu_count=`configfrag_boot_cpus "$boot_args_in" "$config_template" 
"$cpu_count"`
 if test "$cpu_count" -gt "$TORTURE_ALLOTED_CPUS"
 then
echo CPU count limited from $cpu_count to $TORTURE_ALLOTED_CPUS | tee 
-a $resdir/Warnings
@@ -160,13 +160,45 @@ qemu_args="$qemu_args `identify_qemu_args "$QEMU" 
"$resdir/console.log"`"
 qemu_append="`identify_qemu_append "$QEMU"`"
 
 # Pull in Kconfig-fragment boot parameters
-boot_args="`configfrag_boot_params "$boot_args" "$config_template"`"
+boot_args="`configfrag_boot_params "$boot_args_in" "$config_template"`"
 # Generate kernel-version-specific boot parameters
 boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`"
 if test -n "$TORTURE_BOOT_GDB_ARG"
 then
boot_args="$boot_args $TORTURE_BOOT_GDB_ARG"
 fi
+
+# Give bare-metal advice
+modprobe_args="`echo $boot_args | tr -s ' ' '\012' | grep "^$TORTURE_MOD\." | 
sed -e "s/$TORTURE_MOD\.//g"

[PATCH tip/core/rcu 05/28] rcuscale: Disable verbose torture-test output

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Given large numbers of threads, the quantity of torture-test output is
sufficient to sometimes result in RCU CPU stall warnings.  The probability
of these stall warnings was greatly reduced by batching the output,
but the warnings were not eliminated.  However, the actual test only
depends on console output that is printed even when rcuscale.verbose=0.
This commit therefore causes this test to run with rcuscale.verbose=0.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh 
b/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh
index 0333e9b..ffbe151 100644
--- a/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh
+++ b/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh
@@ -12,5 +12,5 @@
 # Adds per-version torture-module parameters to kernels supporting them.
 per_version_boot_params () {
echo $1 rcuscale.shutdown=1 \
-   rcuscale.verbose=1
+   rcuscale.verbose=0
 }
-- 
2.9.5

[PATCH tip/core/rcu 04/28] torture: Improve readability of the testid.txt file

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

The testid.txt file was intended for occasional in extremis use, but
now that the new "bare-metal" file references it, it might see more use.
This commit therefore labels sections of output and adds spacing to make
it easier to see what needs to be done to make a bare-metal build tree
match an rcutorture build tree.

Of course, you can avoid this whole issue by building your bare-metal
kernel in the same directory in which you ran rcutorture, but that might
not always be an option.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/kvm.sh | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh 
b/tools/testing/selftests/rcutorture/bin/kvm.sh
index 35a2132..1de198d 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -404,11 +404,16 @@ echo $scriptname $args
 touch $resdir/$ds/log
 echo $scriptname $args >> $resdir/$ds/log
 echo ${TORTURE_SUITE} > $resdir/$ds/TORTURE_SUITE
-pwd > $resdir/$ds/testid.txt
+echo Build directory: `pwd` > $resdir/$ds/testid.txt
 if test -d .git
 then
+   echo Current commit: `git rev-parse HEAD` >> $resdir/$ds/testid.txt
+   echo >> $resdir/$ds/testid.txt
+   echo ' ---' Output of "'"git status"'": >> $resdir/$ds/testid.txt
git status >> $resdir/$ds/testid.txt
-   git rev-parse HEAD >> $resdir/$ds/testid.txt
+   echo >> $resdir/$ds/testid.txt
+   echo >> $resdir/$ds/testid.txt
+   echo ' ---' Output of "'"git diff HEAD"'": >> $resdir/$ds/testid.txt
git diff HEAD >> $resdir/$ds/testid.txt
 fi
 ___EOF___
-- 
2.9.5

[PATCH tip/core/rcu 16/28] torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

This commit records the process IDs of the kvm-test-1-run.sh and
kvm-test-1-run-qemu.sh scripts to ease monitoring of remotely running
instances of these scripts.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh | 2 ++
 tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh  | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh
index 6b0d71b..576a9b7 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh
@@ -33,6 +33,8 @@ then
exit 1
 fi
 
+echo ' ---' `date`: Starting kernel, PID $$
+
 # Obtain settings from the qemu-cmd file.
 grep '^#' $resdir/qemu-cmd | sed -e 's/^# //' > $T/qemu-cmd-settings
 . $T/qemu-cmd-settings
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index a69f8ae..a386ca8d 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -41,7 +41,7 @@ then
echo "kvm-test-1-run.sh :$resdir: Not a writable directory, cannot 
store results into it"
exit 1
 fi
-echo ' ---' `date`: Starting build
+echo ' ---' `date`: Starting build, PID $$
 echo ' ---' Kconfig fragment at: $config_template >> $resdir/log
 touch $resdir/ConfigFragment.input
 
-- 
2.9.5

[PATCH tip/core/rcu 06/28] refscale: Disable verbose torture-test output

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Given large numbers of threads, the quantity of torture-test output is
sufficient to sometimes result in RCU CPU stall warnings.  The probability
of these stall warnings was greatly reduced by batching the output,
but the warnings were not eliminated.  However, the actual test only
depends on console output that is printed even when refscale.verbose=0.
This commit therefore causes this test to run with refscale.verbose=0.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh 
b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh
index 321e826..f81fa2c 100644
--- a/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh
+++ b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh
@@ -12,5 +12,5 @@
 # Adds per-version torture-module parameters to kernels supporting them.
 per_version_boot_params () {
echo $1 refscale.shutdown=1 \
-   refscale.verbose=1
+   refscale.verbose=0
 }
-- 
2.9.5

[PATCH tip/core/rcu 07/28] torture: Move build/run synchronization files into scenario directories

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Currently the bN.ready and bN.wait files are placed in the
rcutorture directory, which really is not at all a good place
for run-specific files.  This commit therefore renames these
files to build.ready and build.wait and then moves them into the
scenario directories within the "res" directory, for example, into
tools/testing/selftests/rcutorture/res/2021.02.10-15.08.23/TINY01.

Signed-off-by: Paul E. McKenney 
---
 .../selftests/rcutorture/bin/kvm-test-1-run.sh | 25 +++---
 tools/testing/selftests/rcutorture/bin/kvm.sh  | 10 -
 2 files changed, 16 insertions(+), 19 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh 
b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 03c0410..91578d3 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -7,7 +7,7 @@
 # Execute this in the source tree.  Do not run it as a background task
 # because qemu does not seem to like that much.
 #
-# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args 
boot_args_in
+# Usage: kvm-test-1-run.sh config resdir seconds qemu-args boot_args_in
 #
 # qemu-args defaults to "-enable-kvm -nographic", along with arguments
 #  specifying the number of CPUs and other options
@@ -35,8 +35,7 @@ mkdir $T
 config_template=${1}
 config_dir=`echo $config_template | sed -e 's,/[^/]*$,,'`
 title=`echo $config_template | sed -e 's/^.*\///'`
-builddir=${2}
-resdir=${3}
+resdir=${2}
 if test -z "$resdir" -o ! -d "$resdir" -o ! -w "$resdir"
 then
echo "kvm-test-1-run.sh :$resdir: Not a writable directory, cannot 
store results into it"
@@ -89,9 +88,9 @@ then
ln -s $base_resdir/Make*.out $resdir  # for kvm-recheck.sh
ln -s $base_resdir/.config $resdir  # for kvm-recheck.sh
echo Initial build failed, not running KVM, see $resdir.
-   if test -f $builddir.wait
+   if test -f $resdir/build.wait
then
-   mv $builddir.wait $builddir.ready
+   mv $resdir/build.wait $resdir/build.ready
fi
exit 1
 elif kvm-build.sh $T/KcList $resdir
@@ -118,23 +117,23 @@ else
# Build failed.
cp .config $resdir || :
echo Build failed, not running KVM, see $resdir.
-   if test -f $builddir.wait
+   if test -f $resdir/build.wait
then
-   mv $builddir.wait $builddir.ready
+   mv $resdir/build.wait $resdir/build.ready
fi
exit 1
 fi
-if test -f $builddir.wait
+if test -f $resdir/build.wait
 then
-   mv $builddir.wait $builddir.ready
+   mv $resdir/build.wait $resdir/build.ready
 fi
-while test -f $builddir.ready
+while test -f $resdir/build.ready
 do
sleep 1
 done
-seconds=$4
-qemu_args=$5
-boot_args_in=$6
+seconds=$3
+qemu_args=$4
+boot_args_in=$5
 
 if test -z "$TORTURE_BUILDONLY"
 then
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh 
b/tools/testing/selftests/rcutorture/bin/kvm.sh
index 1de198d..7944510 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -444,7 +444,6 @@ function dump(first, pastlast, batchnum)
print "needqemurun="
jn=1
for (j = first; j < pastlast; j++) {
-   builddir=KVM "/b" j - first + 1
cpusr[jn] = cpus[j];
if (cfrep[cf[j]] == "") {
cfr[jn] = cf[j];
@@ -453,15 +452,15 @@ function dump(first, pastlast, batchnum)
cfrep[cf[j]]++;
cfr[jn] = cf[j] "." cfrep[cf[j]];
}
+   builddir=rd cfr[jn] "/build";
if (cpusr[jn] > ncpus && ncpus != 0)
ovf = "-ovf";
else
ovf = "";
print "echo ", cfr[jn], cpusr[jn] ovf ": Starting build. `date` 
| tee -a " rd "log";
-   print "rm -f " builddir ".*";
-   print "touch " builddir ".wait";
print "mkdir " rd cfr[jn] " || :";
-   print "kvm-test-1-run.sh " CONFIGDIR cf[j], builddir, rd 
cfr[jn], dur " \"" TORTURE_QEMU_ARG "\" \"" TORTURE_BOOTARGS "\" > " rd cfr[jn] 
 "/kvm-test-1-run.sh.out 2>&1 &"
+   print "touch " builddir ".wait";
+   print "kvm-test-1-run.sh " CONFIGDIR cf[j], rd cfr[jn], dur " 
\"" TORTURE_QEMU_ARG "\" \"" TORTURE_BOOTARGS "\" > " rd cfr[jn]  
"/kvm-test-1-run.sh.out 2>&1 &"
print "echo ", cfr[jn], cpusr[jn] ovf ": Waiting for build to 
complete. `date` | tee -a " rd "log";
print "while test -f " builddir ".wait"
print "do"
@@ -471,7 +470,7 @@ function dump(first, pastlast, batchnum)
jn++;
}
for (j = 1; j < jn; j++) {
-   builddir=KVM "/b" j
+   builddir=rd cfr[j] "/build";
print "rm -f " builddir "

[PATCH tip/core/rcu 11/28] torture: Reverse jittering and duration parameters for jitter.sh

2021-03-03 Thread paulmck

From: "Paul E. McKenney" 

Remote rcutorture testing requires that jitter.sh continue to be
invoked from the generated script for local runs, but that it instead
be invoked on the remote system for distributed runs.  This argues
for common jitterstart and jitterstop scripts.  But it would be good
for jitterstart and jitterstop to control the name and location of the
"jittering" file, while continuing to have the duration controlled by
the caller of these new scripts.

This commit therefore reverses the order of the jittering and duration
parameters for jitter.sh, so that the jittering parameter precedes the
duration parameter.

Signed-off-by: Paul E. McKenney 
---
 tools/testing/selftests/rcutorture/bin/jitter.sh | 6 +++---
 tools/testing/selftests/rcutorture/bin/kvm.sh| 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/rcutorture/bin/jitter.sh 
b/tools/testing/selftests/rcutorture/bin/jitter.sh
index ed0ea86..ff1d3e4 100755
--- a/tools/testing/selftests/rcutorture/bin/jitter.sh
+++ b/tools/testing/selftests/rcutorture/bin/jitter.sh
@@ -5,7 +5,7 @@
 # of this script is to inflict random OS jitter on a concurrently running
 # test.
 #
-# Usage: jitter.sh me duration jittering-path [ sleepmax [ spinmax ] ]
+# Usage: jitter.sh me jittering-path duration [ sleepmax [ spinmax ] ]
 #
 # me: Random-number-generator seed salt.
 # duration: Time to run in seconds.
@@ -18,8 +18,8 @@
 # Authors: Paul E. McKenney 
 
 me=$(($1 * 1000))
-duration=$2
-jittering=$3
+jittering=$2
+duration=$3
 sleepmax=${4-100}
 spinmax=${5-1000}
 
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh 
b/tools/testing/selftests/rcutorture/bin/kvm.sh
index de93802..a2ee3f2 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -504,7 +504,7 @@ function dump(first, pastlast, batchnum)
print "\techo  Starting kernels. `date` | tee -a " rd "log";
print "\ttouch " rd "jittering"
for (j = 0; j < njitter; j++)
-   print "\tjitter.sh " j " " dur " " rd "jittering " ja[2] " " 
ja[3] "&"
+   print "\tjitter.sh " j " " rd "jittering " dur " " ja[2] " " 
ja[3] "&"
print "\twhile ls $runfiles > /dev/null 2>&1"
print "\tdo"
print "\t\t:"
-- 
2.9.5

1 2 3 4 5 6 7 8 >

1 - 100 of 732 matches

Mail list logo