[PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. The first step is a way of injecting such delays, and this commit therefore provides a clocksource.inject_delay_freq and clocksource.inject_delay_run kernel boot parameters that specify that sufficient delay be injected to cause the clocksource_watchdog() function to mark a clock unstable. This delay is injected every Nth set of M calls to clocksource_watchdog(), where N is the value specified for the inject_delay_freq boot parameter and M is the value specified for the inject_delay_run boot parameter. Values of zero or less for either parameter disable delay injection, and the default for clocksource.inject_delay_freq is zero, that is, disabled. The default for clocksource.inject_delay_run is the value one, that is single-call runs. This facility is intended for diagnostic use only, and should be avoided on production systems. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen [ paulmck: Apply Rik van Riel feedback. ] Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 22 kernel/time/clocksource.c | 27 + 2 files changed, 49 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a10b545..9965266 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -577,6 +577,28 @@ loops can be debugged more effectively on production systems. + clocksource.inject_delay_freq= [KNL] + Number of runs of calls to clocksource_watchdog() + before delays are injected between reads from the + two clocksources. Values less than or equal to + zero disable this delay injection. These delays + can cause clocks to be marked unstable, so use + of this parameter should therefore be avoided on + production systems. Defaults to zero (disabled). + + clocksource.inject_delay_run= [KNL] + Run lengths of clocksource_watchdog() delay + injections. Specifying the value 8 will result + in eight consecutive delays followed by eight + times the value specified for inject_delay_freq + of consecutive non-delays. + + clocksource.max_read_retries= [KNL] + Number of clocksource_watchdog() retries due to + external delays before the clock will be marked + unstable. Defaults to three retries, that is, + four attempts to read the clock under test. + clearcpuid=BITNUM[,BITNUM...] [X86] Disable CPUID feature X for the kernel. See arch/x86/include/asm/cpufeatures.h for the valid bit diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cce484a..4be4391 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -14,6 +14,7 @@ #include /* for spin_unlock_irq() using preempt_count() m68k */ #include #include +#include #include "tick-internal.h" #include "timekeeping_internal.h" @@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs) spin_unlock_irqrestore(&watchdog_lock, flags); } +static int inject_delay_freq; +module_param(inject_delay_freq, int, 0644); +static int inject_delay_run = 1; +module_param(inject_delay_run, int, 0644); +static int max_read_retries = 3; +module_param(max_read_retries, int, 0644); + +static void clocksource_watchdog_inject_delay(void) +{ + int i; + static int injectfail = -1; + + if (inject_delay_freq <= 0 || inject_delay_run <= 0) + return; + if (injectfail < 0 || injectfail > INT_MAX / 2) + injectfail = inject_delay_run; + if (!(++injectfail / inject_delay_run % inject_delay_freq)) { + pr_warn("%s(): Injecting delay.\n", __func__); + for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++) + udelay(1000); + pr_warn("%s(): Done injecting delay.\n", __func_
[PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. This commit therefore re-reads the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If retries were required, a message is printed on the console. If the number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason [ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ] [ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 4be4391..3f734c6 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating); */ #define WATCHDOG_INTERVAL (HZ >> 1) #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4) +#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6) static void clocksource_watchdog_work(struct work_struct *work) { @@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void) static void clocksource_watchdog(struct timer_list *unused) { struct clocksource *cs; - u64 csnow, wdnow, cslast, wdlast, delta; - int64_t wd_nsec, cs_nsec; + u64 csnow, wdnow, wdagain, cslast, wdlast, delta; + int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec; int next_cpu, reset_pending; + int nretries; spin_lock(&watchdog_lock); if (!watchdog_running) @@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused) reset_pending = atomic_read(&watchdog_reset_pending); list_for_each_entry(cs, &watchdog_list, wd_list) { + nretries = 0; /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { @@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused) continue; } +retry: local_irq_disable(); - csnow = cs->read(cs); - clocksource_watchdog_inject_delay(); wdnow = watchdog->read(watchdog); + clocksource_watchdog_inject_delay(); + csnow = cs->read(cs); + wdagain = watchdog->read(watchdog); local_irq_enable(); + delta = clocksource_delta(wdagain, wdnow, watchdog->mask); + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift); + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) { + wderr_nsec = wdagain_nsec; + if (nretries++ < max_read_retries) + goto retry; + } + if (nretries) + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n", + smp_processor_id(), watchdog->name, wderr_nsec, nretries); /* Clocksource initialized ? */ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || -- 2.9.5
[PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking
From: "Paul E. McKenney" Code that checks for clock desynchronization must itself be tested, so this commit creates a new clocksource.inject_delay_shift_percpu= kernel boot parameter that adds or subtracts a large value from the check read, using the specified bit of the CPU ID to determine whether to add or to subtract. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason [ paulmck: Apply Randy Dunlap feedback. ] Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 16 kernel/time/clocksource.c | 10 +- 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9965266..628e87f 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -593,6 +593,22 @@ times the value specified for inject_delay_freq of consecutive non-delays. + clocksource.inject_delay_shift_percpu= [KNL] + Clocksource delay injection partitions the CPUs + into two sets, one whose clocks are moved ahead + and the other whose clocks are moved behind. + This kernel parameter selects the CPU-number + bit that determines which of these two sets the + corresponding CPU is placed into. For example, + setting this parameter to the value 4 will result + in the first set containing alternating groups + of 16 CPUs whose clocks are moved ahead, while + the second set will contain the rest of the CPUs, + whose clocks are moved behind. + + The default value of -1 disables this type of + error injection. + clocksource.max_read_retries= [KNL] Number of clocksource_watchdog() retries due to external delays before the clock will be marked diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 663bc53..df48416 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -190,6 +190,8 @@ static int inject_delay_freq; module_param(inject_delay_freq, int, 0644); static int inject_delay_run = 1; module_param(inject_delay_run, int, 0644); +static int inject_delay_shift_percpu = -1; +module_param(inject_delay_shift_percpu, int, 0644); static int max_read_retries = 3; module_param(max_read_retries, int, 0644); @@ -219,8 +221,14 @@ static cpumask_t cpus_behind; static void clocksource_verify_one_cpu(void *csin) { struct clocksource *cs = (struct clocksource *)csin; + s64 delta = 0; + int sign; - __this_cpu_write(csnow_mid, cs->read(cs)); + if (inject_delay_shift_percpu >= 0) { + sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; + delta = sign * NSEC_PER_SEC; + } + __this_cpu_write(csnow_mid, cs->read(cs) + delta); } static void clocksource_verify_percpu_wq(struct work_struct *unused) -- 2.9.5
[PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable
From: "Paul E. McKenney" Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? This commit therefore adds CPU-to-CPU synchronization checking for newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason [ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- arch/x86/kernel/kvmclock.c | 2 +- arch/x86/kernel/tsc.c | 3 +- include/linux/clocksource.h | 2 +- kernel/time/clocksource.c | 73 + 4 files changed, 77 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index aa59374..337bb2c 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -169,7 +169,7 @@ struct clocksource kvm_clock = { .read = kvm_clock_get_cycles, .rating = 400, .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU, .enable = kvm_cs_enable, }; EXPORT_SYMBOL_GPL(kvm_clock); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index f70dffc..5628917 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = { .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VALID_FOR_HRES | - CLOCK_SOURCE_MUST_VERIFY, + CLOCK_SOURCE_MUST_VERIFY | + CLOCK_SOURCE_VERIFY_PERCPU, .vdso_clock_mode= VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 86d143d..83a3ebf 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -131,7 +131,7 @@ struct clocksource { #define CLOCK_SOURCE_UNSTABLE 0x40 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 #define CLOCK_SOURCE_RESELECT 0x100 - +#define CLOCK_SOURCE_VERIFY_PERCPU 0x200 /* simplify initialization of mask field */ #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 3f734c6..663bc53 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void) WARN_ON_ONCE(injectfail < 0); } +static struct clocksource *clocksource_verify_work_cs; +static DEFINE_PER_CPU(u64, csnow_mid); +static cpumask_t cpus_ahead; +static cpumask_t cpus_behind; + +static void clocksource_verify_one_cpu(void *csin) +{ + struct clocksource *cs = (struct clocksource *)csin; + + __this_cpu_write(csnow_mid, cs->read(cs)); +} + +static void clocksource_verify_percpu_wq(struct work_struct *unused) +{ + int cpu; + struct clocksource *cs; + int64_t cs_nsec; + u64 csnow_begin; + u64 csnow_end; + u64 delta; + + cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release + if (WARN_ON_ONCE(!cs)) + return; + pr_warn("Checking clocksource %s synchronization from CPU %d.\n", + cs->name, smp_processor_id()); + cpumask_clear(&cpus_ahead); + cpumask_clear(&cpus_behind); + csnow_begin = cs->read(cs); + smp_call_function(clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + for_each_online_cpu(cpu) { + if (cpu == smp_processor_id()) + continue; + delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_behind); + delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_ahead); + } + if (!cpuma
[PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking
From: "Paul E. McKenney" Although smp_call_function() has the advantage of simplicity, using it to check for cross-CPU clock desynchronization means that any CPU being slow reduces the sensitivity of the checking across all CPUs. And it is not uncommon for smp_call_function() latencies to be in the hundreds of microseconds. This commit therefore switches to smp_call_function_single(), so that delays from a given CPU affect only those measurements involving that particular CPU. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 41 + 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index df48416..4161c84 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void) } static struct clocksource *clocksource_verify_work_cs; -static DEFINE_PER_CPU(u64, csnow_mid); +static u64 csnow_mid; static cpumask_t cpus_ahead; static cpumask_t cpus_behind; @@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin) sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; delta = sign * NSEC_PER_SEC; } - __this_cpu_write(csnow_mid, cs->read(cs) + delta); + csnow_mid = cs->read(cs) + delta; } static void clocksource_verify_percpu_wq(struct work_struct *unused) @@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) int cpu; struct clocksource *cs; int64_t cs_nsec; + int64_t cs_nsec_max; + int64_t cs_nsec_min; u64 csnow_begin; u64 csnow_end; - u64 delta; + s64 delta; + bool firsttime = 1; cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release if (WARN_ON_ONCE(!cs)) @@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) cs->name, smp_processor_id()); cpumask_clear(&cpus_ahead); cpumask_clear(&cpus_behind); - csnow_begin = cs->read(cs); - smp_call_function(clocksource_verify_one_cpu, cs, 1); - csnow_end = cs->read(cs); + preempt_disable(); for_each_online_cpu(cpu) { if (cpu == smp_processor_id()) continue; - delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; - if ((s64)delta < 0) + csnow_begin = cs->read(cs); + smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + delta = (s64)((csnow_mid - csnow_begin) & cs->mask); + if (delta < 0) cpumask_set_cpu(cpu, &cpus_behind); - delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; - if ((s64)delta < 0) + delta = (csnow_end - csnow_mid) & cs->mask; + if (delta < 0) cpumask_set_cpu(cpu, &cpus_ahead); + delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); + cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); + if (firsttime || cs_nsec > cs_nsec_max) + cs_nsec_max = cs_nsec; + if (firsttime || cs_nsec < cs_nsec_min) + cs_nsec_min = cs_nsec; + firsttime = 0; } + preempt_enable(); if (!cpumask_empty(&cpus_ahead)) pr_warn("CPUs %*pbl ahead of CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_ahead), @@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) pr_warn("CPUs %*pbl behind CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_behind), smp_processor_id(), cs->name); - if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) { - delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); - cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); - pr_warn("CPU %d duration %lldns for clocksource %s.\n", - smp_processor_id(), cs_nsec, cs->name); - } + if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind))) + pr_warn("CPU %d check durations %lldns - %lldns for clocksource %s.\n", + smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name); smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire. } -- 2.9.5
[PATCH clocksource 2/5] clocksource: Retry clock read if long delays detected
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. This commit therefore re-reads the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If retries were required, a message is printed on the console. If the number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason [ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ] [ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 545889c..4663b86 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating); */ #define WATCHDOG_INTERVAL (HZ >> 1) #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4) +#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6) static void clocksource_watchdog_work(struct work_struct *work) { @@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void) static void clocksource_watchdog(struct timer_list *unused) { struct clocksource *cs; - u64 csnow, wdnow, cslast, wdlast, delta; - int64_t wd_nsec, cs_nsec; + u64 csnow, wdnow, wdagain, cslast, wdlast, delta; + int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec; int next_cpu, reset_pending; + int nretries; spin_lock(&watchdog_lock); if (!watchdog_running) @@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused) reset_pending = atomic_read(&watchdog_reset_pending); list_for_each_entry(cs, &watchdog_list, wd_list) { + nretries = 0; /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { @@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused) continue; } +retry: local_irq_disable(); - csnow = cs->read(cs); - clocksource_watchdog_inject_delay(); wdnow = watchdog->read(watchdog); + clocksource_watchdog_inject_delay(); + csnow = cs->read(cs); + wdagain = watchdog->read(watchdog); local_irq_enable(); + delta = clocksource_delta(wdagain, wdnow, watchdog->mask); + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift); + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) { + wderr_nsec = wdagain_nsec; + if (nretries++ < max_read_retries) + goto retry; + } + if (nretries) + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n", + smp_processor_id(), watchdog->name, wderr_nsec, nretries); /* Clocksource initialized ? */ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || -- 2.9.5
[PATCH clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking
From: "Paul E. McKenney" Code that checks for clock desynchronization must itself be tested, so this commit creates a new clocksource.inject_delay_shift_percpu= kernel boot parameter that adds or subtracts a large value from the check read, using the specified bit of the CPU ID to determine whether to add or to subtract. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 9 + kernel/time/clocksource.c | 10 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9965266..f561e94 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -593,6 +593,15 @@ times the value specified for inject_delay_freq of consecutive non-delays. + clocksource.inject_delay_shift_percpu= [KNL] + Shift count to obtain bit from CPU number to + determine whether to shift the time of the per-CPU + clock under test ahead or behind. For example, + setting this to the value four will result in + alternating groups of 16 CPUs shifting ahead and + the rest of the CPUs shifting behind. The default + value of -1 disable this type of error injection. + clocksource.max_read_retries= [KNL] Number of clocksource_watchdog() retries due to external delays before the clock will be marked diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 23bcefe..67cf41c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -190,6 +190,8 @@ static int inject_delay_freq; module_param(inject_delay_freq, int, 0644); static int inject_delay_run = 1; module_param(inject_delay_run, int, 0644); +static int inject_delay_shift_percpu = -1; +module_param(inject_delay_shift_percpu, int, 0644); static int max_read_retries = 3; module_param(max_read_retries, int, 0644); @@ -219,8 +221,14 @@ static cpumask_t cpus_behind; static void clocksource_verify_one_cpu(void *csin) { struct clocksource *cs = (struct clocksource *)csin; + s64 delta = 0; + int sign; - __this_cpu_write(csnow_mid, cs->read(cs)); + if (inject_delay_shift_percpu >= 0) { + sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; + delta = sign * NSEC_PER_SEC; + } + __this_cpu_write(csnow_mid, cs->read(cs) + delta); } static void clocksource_verify_percpu_wq(struct work_struct *unused) -- 2.9.5
[PATCH clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable
From: "Paul E. McKenney" Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? This commit therefore adds CPU-to-CPU synchronization checking for newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason [ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- arch/x86/kernel/kvmclock.c | 2 +- arch/x86/kernel/tsc.c | 3 +- include/linux/clocksource.h | 2 +- kernel/time/clocksource.c | 73 + 4 files changed, 77 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index aa59374..337bb2c 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -169,7 +169,7 @@ struct clocksource kvm_clock = { .read = kvm_clock_get_cycles, .rating = 400, .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU, .enable = kvm_cs_enable, }; EXPORT_SYMBOL_GPL(kvm_clock); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index f70dffc..5628917 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = { .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VALID_FOR_HRES | - CLOCK_SOURCE_MUST_VERIFY, + CLOCK_SOURCE_MUST_VERIFY | + CLOCK_SOURCE_VERIFY_PERCPU, .vdso_clock_mode= VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 86d143d..83a3ebf 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -131,7 +131,7 @@ struct clocksource { #define CLOCK_SOURCE_UNSTABLE 0x40 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 #define CLOCK_SOURCE_RESELECT 0x100 - +#define CLOCK_SOURCE_VERIFY_PERCPU 0x200 /* simplify initialization of mask field */ #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 4663b86..23bcefe 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void) WARN_ON_ONCE(injectfail < 0); } +static struct clocksource *clocksource_verify_work_cs; +static DEFINE_PER_CPU(u64, csnow_mid); +static cpumask_t cpus_ahead; +static cpumask_t cpus_behind; + +static void clocksource_verify_one_cpu(void *csin) +{ + struct clocksource *cs = (struct clocksource *)csin; + + __this_cpu_write(csnow_mid, cs->read(cs)); +} + +static void clocksource_verify_percpu_wq(struct work_struct *unused) +{ + int cpu; + struct clocksource *cs; + int64_t cs_nsec; + u64 csnow_begin; + u64 csnow_end; + u64 delta; + + cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release + if (WARN_ON_ONCE(!cs)) + return; + pr_warn("Checking clocksource %s synchronization from CPU %d.\n", + cs->name, smp_processor_id()); + cpumask_clear(&cpus_ahead); + cpumask_clear(&cpus_behind); + csnow_begin = cs->read(cs); + smp_call_function(clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + for_each_online_cpu(cpu) { + if (cpu == smp_processor_id()) + continue; + delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_behind); + delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_ahead); + } + if (!cpuma
[PATCH clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. The first step is a way of injecting such delays, and this commit therefore provides a clocksource.inject_delay_freq and clocksource.inject_delay_run kernel boot parameters that specify that sufficient delay be injected to cause the clocksource_watchdog() function to mark a clock unstable. This delay is injected every Nth set of M calls to clocksource_watchdog(), where N is the value specified for the inject_delay_freq boot parameter and M is the value specified for the inject_delay_run boot parameter. Values of zero or less for either parameter disable delay injection, and the default for clocksource.inject_delay_freq is zero, that is, disabled. The default for clocksource.inject_delay_run is the value one, that is single-call runs. This facility is intended for diagnostic use only, and should be avoided on production systems. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen [ paulmck: Apply Rik van Riel feedback. ] Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 22 kernel/time/clocksource.c | 27 + 2 files changed, 49 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a10b545..9965266 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -577,6 +577,28 @@ loops can be debugged more effectively on production systems. + clocksource.inject_delay_freq= [KNL] + Number of runs of calls to clocksource_watchdog() + before delays are injected between reads from the + two clocksources. Values less than or equal to + zero disable this delay injection. These delays + can cause clocks to be marked unstable, so use + of this parameter should therefore be avoided on + production systems. Defaults to zero (disabled). + + clocksource.inject_delay_run= [KNL] + Run lengths of clocksource_watchdog() delay + injections. Specifying the value 8 will result + in eight consecutive delays followed by eight + times the value specified for inject_delay_freq + of consecutive non-delays. + + clocksource.max_read_retries= [KNL] + Number of clocksource_watchdog() retries due to + external delays before the clock will be marked + unstable. Defaults to three retries, that is, + four attempts to read the clock under test. + clearcpuid=BITNUM[,BITNUM...] [X86] Disable CPUID feature X for the kernel. See arch/x86/include/asm/cpufeatures.h for the valid bit diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cce484a..545889c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -14,6 +14,7 @@ #include /* for spin_unlock_irq() using preempt_count() m68k */ #include #include +#include #include "tick-internal.h" #include "timekeeping_internal.h" @@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs) spin_unlock_irqrestore(&watchdog_lock, flags); } +static int inject_delay_freq; +module_param(inject_delay_freq, int, 0644); +static int inject_delay_run = 1; +module_param(inject_delay_run, int, 0644); +static int max_read_retries = 3; +module_param(max_read_retries, int, 0644); + +static void clocksource_watchdog_inject_delay(void) +{ + int i; + static int injectfail = -1; + + if (inject_delay_freq <= 0 || inject_delay_run <= 0) + return; + if (injectfail < 0 || injectfail > INT_MAX / 2) + injectfail = inject_delay_run; + if (!(++injectfail / inject_delay_run % inject_delay_freq)) { + printk("%s(): Injecting delay.\n", __func__); + for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++) + udelay(1000); + printk("%s(): Done injecting delay.\n", __func_
[PATCH clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking
From: "Paul E. McKenney" Although smp_call_function() has the advantage of simplicity, using it to check for cross-CPU clock desynchronization means that any CPU being slow reduces the sensitivity of the checking across all CPUs. And it is not uncommon for smp_call_function() latencies to be in the hundreds of microseconds. This commit therefore switches to smp_call_function_single(), so that delays from a given CPU affect only those measurements involving that particular CPU. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Cc: Andi Kleen Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 41 + 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 67cf41c..3bae5fb 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void) } static struct clocksource *clocksource_verify_work_cs; -static DEFINE_PER_CPU(u64, csnow_mid); +static u64 csnow_mid; static cpumask_t cpus_ahead; static cpumask_t cpus_behind; @@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin) sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; delta = sign * NSEC_PER_SEC; } - __this_cpu_write(csnow_mid, cs->read(cs) + delta); + csnow_mid = cs->read(cs) + delta; } static void clocksource_verify_percpu_wq(struct work_struct *unused) @@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) int cpu; struct clocksource *cs; int64_t cs_nsec; + int64_t cs_nsec_max; + int64_t cs_nsec_min; u64 csnow_begin; u64 csnow_end; - u64 delta; + s64 delta; + bool firsttime = 1; cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release if (WARN_ON_ONCE(!cs)) @@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) cs->name, smp_processor_id()); cpumask_clear(&cpus_ahead); cpumask_clear(&cpus_behind); - csnow_begin = cs->read(cs); - smp_call_function(clocksource_verify_one_cpu, cs, 1); - csnow_end = cs->read(cs); + preempt_disable(); for_each_online_cpu(cpu) { if (cpu == smp_processor_id()) continue; - delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; - if ((s64)delta < 0) + csnow_begin = cs->read(cs); + smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + delta = (s64)((csnow_mid - csnow_begin) & cs->mask); + if (delta < 0) cpumask_set_cpu(cpu, &cpus_behind); - delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; - if ((s64)delta < 0) + delta = (csnow_end - csnow_mid) & cs->mask; + if (delta < 0) cpumask_set_cpu(cpu, &cpus_ahead); + delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); + cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); + if (firsttime || cs_nsec > cs_nsec_max) + cs_nsec_max = cs_nsec; + if (firsttime || cs_nsec < cs_nsec_min) + cs_nsec_min = cs_nsec; + firsttime = 0; } + preempt_enable(); if (!cpumask_empty(&cpus_ahead)) pr_warn("CPUs %*pbl ahead of CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_ahead), @@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) pr_warn("CPUs %*pbl behind CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_behind), smp_processor_id(), cs->name); - if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) { - delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); - cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); - pr_warn("CPU %d duration %lldns for clocksource %s.\n", - smp_processor_id(), cs_nsec, cs->name); - } + if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind))) + pr_warn("CPU %d check durations %lldns - %lldns for clocksource %s.\n", + smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name); smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire. } -- 2.9.5
[PATCH tip/core/rcu 11/12] rcu: Execute RCU reader shortly after rcu_core for strict GPs
From: "Paul E. McKenney" A kernel built with CONFIG_RCU_STRICT_GRACE_PERIOD=y needs a quiescent state to appear very shortly after a CPU has noticed a new grace period. Placing an RCU reader immediately after this point is ineffective because this normally happens in softirq context, which acts as a big RCU reader. This commit therefore introduces a new per-CPU work_struct, which is used at the end of rcu_core() processing to schedule an RCU read-side critical section from within a clean environment. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 + kernel/rcu/tree.h | 1 + 2 files changed, 14 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index dd7af40..ac37343 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2634,6 +2634,14 @@ void rcu_force_quiescent_state(void) } EXPORT_SYMBOL_GPL(rcu_force_quiescent_state); +// Workqueue handler for an RCU reader for kernels enforcing struct RCU +// grace periods. +static void strict_work_handler(struct work_struct *work) +{ + rcu_read_lock(); + rcu_read_unlock(); +} + /* Perform RCU core processing work for the current CPU. */ static __latent_entropy void rcu_core(void) { @@ -2678,6 +2686,10 @@ static __latent_entropy void rcu_core(void) /* Do any needed deferred wakeups of rcuo kthreads. */ do_nocb_deferred_wakeup(rdp); trace_rcu_utilization(TPS("End RCU core")); + + // If strict GPs, schedule an RCU reader in a clean environment. + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + queue_work_on(rdp->cpu, rcu_gp_wq, &rdp->strict_work); } static void rcu_core_si(struct softirq_action *h) @@ -3874,6 +3886,7 @@ rcu_boot_init_percpu_data(int cpu) /* Set up local state, ensuring consistent view of global state. */ rdp->grpmask = leaf_node_cpu_bit(rdp->mynode, cpu); + INIT_WORK(&rdp->strict_work, strict_work_handler); WARN_ON_ONCE(rdp->dynticks_nesting != 1); WARN_ON_ONCE(rcu_dynticks_in_eqs(rcu_dynticks_snap(rdp))); rdp->rcu_ofl_gp_seq = rcu_state.gp_seq; diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 309bc7f..e4f66b8 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -165,6 +165,7 @@ struct rcu_data { /* period it is aware of. */ struct irq_work defer_qs_iw;/* Obtain later scheduler attention. */ bool defer_qs_iw_pending; /* Scheduler attention pending? */ + struct work_struct strict_work; /* Schedule readers for strict GPs. */ /* 2) batch handling */ struct rcu_segcblist cblist;/* Segmented callback list, with */ -- 2.9.5
[PATCH tip/core/rcu 05/12] rcu: Always set .need_qs from __rcu_read_lock() for strict GPs
From: "Paul E. McKenney" The ->rcu_read_unlock_special.b.need_qs field in the task_struct structure indicates that the RCU core needs a quiscent state from the corresponding task. The __rcu_read_unlock() function checks this (via an eventual call to rcu_preempt_deferred_qs_irqrestore()), and if set reports a quiscent state immediately upon exit from the outermost RCU read-side critical section. Currently, this flag is only set when the scheduling-clock interrupt decides that the current RCU grace period is too old, as in about one full second too old. But if the kernel has been built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, we clearly do not want to wait that long. This commit therefore sets the .need_qs field immediately at the start of the RCU read-side critical section from within __rcu_read_lock() in order to unconditionally enlist help from __rcu_read_unlock(). But note the additional check for rcu_state.gp_kthread, which prevents attempts to awaken RCU's grace-period kthread during early boot before there is a scheduler. Leaving off this check results in early boot hangs. So early that there is no console output. Thus, this additional check fails until such time as RCU's grace-period kthread has been created, avoiding these empty-console hangs. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 5c0c580..7ed55c5 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -376,6 +376,8 @@ void __rcu_read_lock(void) rcu_preempt_read_enter(); if (IS_ENABLED(CONFIG_PROVE_LOCKING)) WARN_ON_ONCE(rcu_preempt_depth() > RCU_NEST_PMAX); + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) && rcu_state.gp_kthread) + WRITE_ONCE(current->rcu_read_unlock_special.b.need_qs, true); barrier(); /* critical section after entry code. */ } EXPORT_SYMBOL_GPL(__rcu_read_lock); -- 2.9.5
[PATCH tip/core/rcu 10/12] rcu: Provide optional RCU-reader exit delay for strict GPs
From: "Paul E. McKenney" The goal of this series is to increase the probability of tools like KASAN detecting that an RCU-protected pointer was used outside of its RCU read-side critical section. Thus far, the approach has been to make grace periods and callback processing happen faster. Another approach is to delay the pointer leaker. This commit therefore allows a delay to be applied to exit from RCU read-side critical sections. This slowdown is specified by a new rcutree.rcu_unlock_delay kernel boot parameter that specifies this delay in microseconds, defaulting to zero. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 9 + kernel/rcu/tree_plugin.h| 12 ++-- 2 files changed, 19 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 60e2c6e..c532c70 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4128,6 +4128,15 @@ This wake_up() will be accompanied by a WARN_ONCE() splat and an ftrace_dump(). + rcutree.rcu_unlock_delay= [KNL] + In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, + this specifies an rcu_read_unlock()-time delay + in microseconds. This defaults to zero. + Larger delays increase the probability of + catching RCU pointer leaks, that is, buggy use + of RCU-protected pointers after the relevant + rcu_read_unlock() has completed. + rcutree.sysrq_rcu= [KNL] Commandeer a sysrq key to dump out Tree RCU's rcu_node tree with an eye towards determining diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 1761ff4..25c9ee4 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -430,6 +430,12 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp) return !list_empty(&rnp->blkd_tasks); } +// Add delay to rcu_read_unlock() for strict grace periods. +static int rcu_unlock_delay; +#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD +module_param(rcu_unlock_delay, int, 0444); +#endif + /* * Report deferred quiescent states. The deferral time can * be quite short, for example, in the case of the call from @@ -460,10 +466,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags) } t->rcu_read_unlock_special.s = 0; if (special.b.need_qs) { - if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) { rcu_report_qs_rdp(rdp->cpu, rdp); - else + udelay(rcu_unlock_delay); + } else { rcu_qs(); + } } /* -- 2.9.5
[PATCH tip/core/rcu 12/12] rcu: Report QS for outermost PREEMPT=n rcu_read_unlock() for strict GPs
From: "Paul E. McKenney" The CONFIG_PREEMPT=n instance of rcu_read_unlock is even more aggressively than that of CONFIG_PREEMPT=y in deferring reporting quiescent states to the RCU core. This is just what is wanted in normal use because it reduces overhead, but the resulting delay is not what is wanted for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. This commit therefore adds an rcu_read_unlock_strict() function that checks for exceptional conditions, and reports the newly started quiescent state if it is safe to do so, also doing a spin-delay if requested via rcutree.rcu_unlock_delay. This commit also adds a call to rcu_read_unlock_strict() from the CONFIG_PREEMPT=n instance of __rcu_read_unlock(). [ paulmck: Fixed bug located by kernel test robot ] Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 7 +++ kernel/rcu/tree.c| 6 ++ kernel/rcu/tree_plugin.h | 23 +-- 3 files changed, 30 insertions(+), 6 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index b47d6b6..7c1ceff 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -55,6 +55,12 @@ void __rcu_read_unlock(void); #else /* #ifdef CONFIG_PREEMPT_RCU */ +#ifdef CONFIG_TINY_RCU +#define rcu_read_unlock_strict() do { } while (0) +#else +void rcu_read_unlock_strict(void); +#endif + static inline void __rcu_read_lock(void) { preempt_disable(); @@ -63,6 +69,7 @@ static inline void __rcu_read_lock(void) static inline void __rcu_read_unlock(void) { preempt_enable(); + rcu_read_unlock_strict(); } static inline int rcu_preempt_depth(void) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ac37343..78852ef 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -164,6 +164,12 @@ module_param(gp_init_delay, int, 0444); static int gp_cleanup_delay; module_param(gp_cleanup_delay, int, 0444); +// Add delay to rcu_read_unlock() for strict grace periods. +static int rcu_unlock_delay; +#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD +module_param(rcu_unlock_delay, int, 0444); +#endif + /* * This rcu parameter is runtime-read-only. It reflects * a minimum allowed number of objects which can be cached diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 25c9ee4..0881aef 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -430,12 +430,6 @@ static bool rcu_preempt_has_tasks(struct rcu_node *rnp) return !list_empty(&rnp->blkd_tasks); } -// Add delay to rcu_read_unlock() for strict grace periods. -static int rcu_unlock_delay; -#ifdef CONFIG_RCU_STRICT_GRACE_PERIOD -module_param(rcu_unlock_delay, int, 0444); -#endif - /* * Report deferred quiescent states. The deferral time can * be quite short, for example, in the case of the call from @@ -785,6 +779,23 @@ dump_blkd_tasks(struct rcu_node *rnp, int ncheck) #else /* #ifdef CONFIG_PREEMPT_RCU */ /* + * If strict grace periods are enabled, and if the calling + * __rcu_read_unlock() marks the beginning of a quiescent state, immediately + * report that quiescent state and, if requested, spin for a bit. + */ +void rcu_read_unlock_strict(void) +{ + struct rcu_data *rdp; + + if (!IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) || + irqs_disabled() || preempt_count() || !rcu_state.gp_kthread) + return; + rdp = this_cpu_ptr(&rcu_data); + rcu_report_qs_rdp(rdp->cpu, rdp); + udelay(rcu_unlock_delay); +} + +/* * Tell them what RCU they are running. */ static void __init rcu_bootup_announce(void) -- 2.9.5
[PATCH tip/core/rcu 03/12] rcu: Restrict default jiffies_till_first_fqs for strict RCU GPs
From: "Paul E. McKenney" If there are idle CPUs, RCU's grace-period kthread will wait several jiffies before even thinking about polling them. This promotes efficiency, which is normally a good thing, but when the kernel has been built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, we care more about short grace periods. This commit therefore restricts the default jiffies_till_first_fqs value to zero in kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, which causes RCU's grace-period kthread to poll for idle CPUs immediately after starting a grace period. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 65e1b5e..d333f1b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -471,7 +471,7 @@ module_param(qhimark, long, 0444); module_param(qlowmark, long, 0444); module_param(qovld, long, 0444); -static ulong jiffies_till_first_fqs = ULONG_MAX; +static ulong jiffies_till_first_fqs = IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 0 : ULONG_MAX; static ulong jiffies_till_next_fqs = ULONG_MAX; static bool rcu_kick_kthreads; static int rcu_divisor = 7; -- 2.9.5
[PATCH tip/core/rcu 09/12] rcu: IPI all CPUs at GP end for strict GPs
From: "Paul E. McKenney" Currently, each CPU discovers the end of a given grace period on its own time, which is again good for efficiency but bad for fast grace periods, given that it is things like kfree() within the RCU callbacks that will cause trouble for pointers leaked from RCU read-side critical sections. This commit therefore uses on_each_cpu() to IPI each CPU after grace-period cleanup in order to inform each CPU of the end of the old grace period in a timely manner, but only in kernels build with CONFIG_RCU_STRICT_GRACE_PERIOD=y. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index a30d6f3..dd7af40 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2034,6 +2034,10 @@ static void rcu_gp_cleanup(void) rcu_state.gp_flags & RCU_GP_FLAG_INIT); } raw_spin_unlock_irq_rcu_node(rnp); + + // If strict, make all CPUs aware of the end of the old grace period. + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + on_each_cpu(rcu_strict_gp_boundary, NULL, 0); } /* -- 2.9.5
[PATCH tip/core/rcu 08/12] rcu: IPI all CPUs at GP start for strict GPs
From: "Paul E. McKenney" Currently, each CPU discovers the beginning of a given grace period on its own time, which is again good for efficiency but bad for fast grace periods. This commit therefore uses on_each_cpu() to IPI each CPU after grace-period initialization in order to inform each CPU of the new grace period in a timely manner, but only in kernels build with CONFIG_RCU_STRICT_GRACE_PERIOD=y. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 + 1 file changed, 13 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 4353a1a..a30d6f3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1678,6 +1678,15 @@ static void rcu_gp_torture_wait(void) } /* + * Handler for on_each_cpu() to invoke the target CPU's RCU core + * processing. + */ +static void rcu_strict_gp_boundary(void *unused) +{ + invoke_rcu_core(); +} + +/* * Initialize a new grace period. Return false if no grace period required. */ static bool rcu_gp_init(void) @@ -1805,6 +1814,10 @@ static bool rcu_gp_init(void) WRITE_ONCE(rcu_state.gp_activity, jiffies); } + // If strict, make all CPUs aware of new grace period. + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + on_each_cpu(rcu_strict_gp_boundary, NULL, 0); + return true; } -- 2.9.5
[PATCH tip/core/rcu 06/12] rcu: Do full report for .need_qs for strict GPs
From: "Paul E. McKenney" The rcu_preempt_deferred_qs_irqrestore() function is invoked at the end of an RCU read-side critical section (for example, directly from rcu_read_unlock()) and, if .need_qs is set, invokes rcu_qs() to report the new quiescent state. This works, except that rcu_qs() only updates per-CPU state, leaving reporting of the actual quiescent state to a later call to rcu_report_qs_rdp(), for example from within a later RCU_SOFTIRQ instance. Although this approach is exactly what you want if you are more concerned about efficiency than about short grace periods, in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, short grace periods are the name of the game. This commit therefore makes rcu_preempt_deferred_qs_irqrestore() directly invoke rcu_report_qs_rdp() in CONFIG_RCU_STRICT_GRACE_PERIOD=y, thus shortening grace periods. Historical note: To the best of my knowledge, causing rcu_read_unlock() to directly report a quiescent state first appeared in Jim Houston's and Joe Korty's JRCU. This is the second instance of a Linux-kernel RCU feature being inspired by JRCU, the first being RCU callback offloading (as in the RCU_NOCB_CPU Kconfig option). Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 7ed55c5..1761ff4 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -459,8 +459,12 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags) return; } t->rcu_read_unlock_special.s = 0; - if (special.b.need_qs) - rcu_qs(); + if (special.b.need_qs) { + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + rcu_report_qs_rdp(rdp->cpu, rdp); + else + rcu_qs(); + } /* * Respond to a request by an expedited grace period for a -- 2.9.5
[PATCH tip/core/rcu 04/12] rcu: Force DEFAULT_RCU_BLIMIT to 1000 for strict RCU GPs
From: "Paul E. McKenney" The value of DEFAULT_RCU_BLIMIT is normally set to 10, the idea being to avoid needless response-time degradation due to RCU callback invocation. However, when CONFIG_RCU_STRICT_GRACE_PERIOD=y it is better to avoid throttling callback execution in order to better detect pointer leaks from RCU read-side critical sections. This commit therefore sets the value of DEFAULT_RCU_BLIMIT to 1000 in kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index d333f1b..08cc91c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -454,17 +454,18 @@ static int rcu_is_cpu_rrupt_from_idle(void) return __this_cpu_read(rcu_data.dynticks_nesting) == 0; } -#define DEFAULT_RCU_BLIMIT 10 /* Maximum callbacks per rcu_do_batch ... */ -#define DEFAULT_MAX_RCU_BLIMIT 1 /* ... even during callback flood. */ +#define DEFAULT_RCU_BLIMIT (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) ? 1000 : 10) + // Maximum callbacks per rcu_do_batch ... +#define DEFAULT_MAX_RCU_BLIMIT 1 // ... even during callback flood. static long blimit = DEFAULT_RCU_BLIMIT; -#define DEFAULT_RCU_QHIMARK 1 /* If this many pending, ignore blimit. */ +#define DEFAULT_RCU_QHIMARK 1 // If this many pending, ignore blimit. static long qhimark = DEFAULT_RCU_QHIMARK; -#define DEFAULT_RCU_QLOMARK 100 /* Once only this many pending, use blimit. */ +#define DEFAULT_RCU_QLOMARK 100 // Once only this many pending, use blimit. static long qlowmark = DEFAULT_RCU_QLOMARK; #define DEFAULT_RCU_QOVLD_MULT 2 #define DEFAULT_RCU_QOVLD (DEFAULT_RCU_QOVLD_MULT * DEFAULT_RCU_QHIMARK) -static long qovld = DEFAULT_RCU_QOVLD; /* If this many pending, hammer QS. */ -static long qovld_calc = -1; /* No pre-initialization lock acquisitions! */ +static long qovld = DEFAULT_RCU_QOVLD; // If this many pending, hammer QS. +static long qovld_calc = -1; // No pre-initialization lock acquisitions! module_param(blimit, long, 0444); module_param(qhimark, long, 0444); -- 2.9.5
[PATCH tip/core/rcu 02/12] rcu: Reduce leaf fanout for strict RCU grace periods
From: "Paul E. McKenney" Because strict RCU grace periods will complete more quickly, they will experience greater lock contention on each leaf rcu_node structure's ->lock. This commit therefore reduces the leaf fanout in order to reduce this lock contention. Note that this also has the effect of reducing the number of CPUs supported to 16 in the case of CONFIG_RCU_FANOUT_LEAF=2 or 81 in the case of CONFIG_RCU_FANOUT_LEAF=3. However, greater numbers of CPUs are probably a bad idea when using CONFIG_RCU_STRICT_GRACE_PERIOD=y. Those wishing to live dangerously are free to edit their kernel/rcu/Kconfig files accordingly. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/Kconfig | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig index 0ebe15a..b71e21f 100644 --- a/kernel/rcu/Kconfig +++ b/kernel/rcu/Kconfig @@ -135,10 +135,12 @@ config RCU_FANOUT config RCU_FANOUT_LEAF int "Tree-based hierarchical RCU leaf-level fanout value" - range 2 64 if 64BIT - range 2 32 if !64BIT + range 2 64 if 64BIT && !RCU_STRICT_GRACE_PERIOD + range 2 32 if !64BIT && !RCU_STRICT_GRACE_PERIOD + range 2 3 if RCU_STRICT_GRACE_PERIOD depends on TREE_RCU && RCU_EXPERT - default 16 + default 16 if !RCU_STRICT_GRACE_PERIOD + default 2 if RCU_STRICT_GRACE_PERIOD help This option controls the leaf-level fanout of hierarchical implementations of RCU, and allows trading off cache misses -- 2.9.5
[PATCH tip/core/rcu 07/12] rcu: Attempt QS when CPU discovers GP for strict GPs
From: "Paul E. McKenney" A given CPU normally notes a new grace period during one RCU_SOFTIRQ, but avoids reporting the corresponding quiescent state until some later RCU_SOFTIRQ. This leisurly approach improves efficiency by increasing the number of update requests served by each grace period, but is not what is needed for kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. This commit therefore adds a new rcu_strict_gp_check_qs() function which, in CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, simply enters and immediately exist an RCU read-side critical section. If the CPU is in a quiescent state, the rcu_read_unlock() will attempt to report an immediate quiescent state. This rcu_strict_gp_check_qs() function is invoked from note_gp_changes(), so that a CPU just noticing a new grace period might immediately report a quiescent state for that grace period. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 14 ++ 1 file changed, 14 insertions(+) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 08cc91c..4353a1a 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1557,6 +1557,19 @@ static void __maybe_unused rcu_advance_cbs_nowake(struct rcu_node *rnp, } /* + * In CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels, attempt to generate a + * quiescent state. This is intended to be invoked when the CPU notices + * a new grace period. + */ +static void rcu_strict_gp_check_qs(void) +{ + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) { + rcu_read_lock(); + rcu_read_unlock(); + } +} + +/* * Update CPU-local rcu_data state to record the beginnings and ends of * grace periods. The caller must hold the ->lock of the leaf rcu_node * structure corresponding to the current CPU, and must have irqs disabled. @@ -1626,6 +1639,7 @@ static void note_gp_changes(struct rcu_data *rdp) } needwake = __note_gp_changes(rnp, rdp); raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + rcu_strict_gp_check_qs(); if (needwake) rcu_gp_kthread_wake(); } -- 2.9.5
[PATCH tip/core/rcu 01/12] rcu: Add Kconfig option for strict RCU grace periods
From: "Paul E. McKenney" People running automated tests have asked for a way to make RCU minimize grace-period duration in order to increase the probability of KASAN detecting a pointer being improperly leaked from an RCU read-side critical section, for example, like this: rcu_read_lock(); p = rcu_dereference(gp); do_something_with(p); // OK rcu_read_unlock(); do_something_else_with(p); // BUG!!! The rcupdate.rcu_expedited boot parameter is a start in this direction, given that it makes calls to synchronize_rcu() instead invoke the faster (and more wasteful) synchronize_rcu_expedited(). However, this does nothing to shorten RCU grace periods that are instead initiated by call_rcu(), and RCU pointer-leak bugs can involve call_rcu() just as surely as they can synchronize_rcu(). This commit therefore adds a RCU_STRICT_GRACE_PERIOD Kconfig option that will be used to shorten normal (non-expedited) RCU grace periods. This commit also dumps out a message when this option is in effect. Later commits will actually shorten grace periods. Reported-by Jann Horn Signed-off-by: Paul E. McKenney --- kernel/rcu/Kconfig.debug | 15 +++ kernel/rcu/tree_plugin.h | 2 ++ 2 files changed, 17 insertions(+) diff --git a/kernel/rcu/Kconfig.debug b/kernel/rcu/Kconfig.debug index 3cf6132..cab5a4b 100644 --- a/kernel/rcu/Kconfig.debug +++ b/kernel/rcu/Kconfig.debug @@ -114,4 +114,19 @@ config RCU_EQS_DEBUG Say N here if you need ultimate kernel/user switch latencies Say Y if you are unsure +config RCU_STRICT_GRACE_PERIOD + bool "Provide debug RCU implementation with short grace periods" + depends on DEBUG_KERNEL && RCU_EXPERT + default n + select PREEMPT_COUNT if PREEMPT=n + help + Select this option to build an RCU variant that is strict about + grace periods, making them as short as it can. This limits + scalability, destroys real-time response, degrades battery + lifetime and kills performance. Don't try this on large + machines, as in systems with more than about 10 or 20 CPUs. + But in conjunction with tools like KASAN, it can be helpful + when looking for certain types of RCU usage bugs, for example, + too-short RCU read-side critical sections. + endmenu # "RCU Debugging" diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index cb1e8c8..5c0c580 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -36,6 +36,8 @@ static void __init rcu_bootup_announce_oddness(void) pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n"); if (IS_ENABLED(CONFIG_PROVE_RCU)) pr_info("\tRCU lockdep checking is enabled.\n"); + if (IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD)) + pr_info("\tRCU strict (and thus non-scalable) grace periods enabled.\n"); if (RCU_NUM_LVLS >= 4) pr_info("\tFour(or more)-level hierarchy is enabled.\n"); if (RCU_FANOUT_LEAF != 16) -- 2.9.5
[PATCH v2 clocksource 5/5] clocksource: Do pairwise clock-desynchronization checking
From: "Paul E. McKenney" Although smp_call_function() has the advantage of simplicity, using it to check for cross-CPU clock desynchronization means that any CPU being slow reduces the sensitivity of the checking across all CPUs. And it is not uncommon for smp_call_function() latencies to be in the hundreds of microseconds. This commit therefore switches to smp_call_function_single(), so that delays from a given CPU affect only those measurements involving that particular CPU. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 41 + 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 67cf41c..3bae5fb 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -214,7 +214,7 @@ static void clocksource_watchdog_inject_delay(void) } static struct clocksource *clocksource_verify_work_cs; -static DEFINE_PER_CPU(u64, csnow_mid); +static u64 csnow_mid; static cpumask_t cpus_ahead; static cpumask_t cpus_behind; @@ -228,7 +228,7 @@ static void clocksource_verify_one_cpu(void *csin) sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; delta = sign * NSEC_PER_SEC; } - __this_cpu_write(csnow_mid, cs->read(cs) + delta); + csnow_mid = cs->read(cs) + delta; } static void clocksource_verify_percpu_wq(struct work_struct *unused) @@ -236,9 +236,12 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) int cpu; struct clocksource *cs; int64_t cs_nsec; + int64_t cs_nsec_max; + int64_t cs_nsec_min; u64 csnow_begin; u64 csnow_end; - u64 delta; + s64 delta; + bool firsttime = 1; cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release if (WARN_ON_ONCE(!cs)) @@ -247,19 +250,28 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) cs->name, smp_processor_id()); cpumask_clear(&cpus_ahead); cpumask_clear(&cpus_behind); - csnow_begin = cs->read(cs); - smp_call_function(clocksource_verify_one_cpu, cs, 1); - csnow_end = cs->read(cs); + preempt_disable(); for_each_online_cpu(cpu) { if (cpu == smp_processor_id()) continue; - delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; - if ((s64)delta < 0) + csnow_begin = cs->read(cs); + smp_call_function_single(cpu, clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + delta = (s64)((csnow_mid - csnow_begin) & cs->mask); + if (delta < 0) cpumask_set_cpu(cpu, &cpus_behind); - delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; - if ((s64)delta < 0) + delta = (csnow_end - csnow_mid) & cs->mask; + if (delta < 0) cpumask_set_cpu(cpu, &cpus_ahead); + delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); + cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); + if (firsttime || cs_nsec > cs_nsec_max) + cs_nsec_max = cs_nsec; + if (firsttime || cs_nsec < cs_nsec_min) + cs_nsec_min = cs_nsec; + firsttime = 0; } + preempt_enable(); if (!cpumask_empty(&cpus_ahead)) pr_warn("CPUs %*pbl ahead of CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_ahead), @@ -268,12 +280,9 @@ static void clocksource_verify_percpu_wq(struct work_struct *unused) pr_warn("CPUs %*pbl behind CPU %d for clocksource %s.\n", cpumask_pr_args(&cpus_behind), smp_processor_id(), cs->name); - if (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind)) { - delta = clocksource_delta(csnow_end, csnow_begin, cs->mask); - cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift); - pr_warn("CPU %d duration %lldns for clocksource %s.\n", - smp_processor_id(), cs_nsec, cs->name); - } + if (!firsttime && (!cpumask_empty(&cpus_ahead) || !cpumask_empty(&cpus_behind))) + pr_warn("CPU %d check durations %lldns - %lldns for clocksource %s.\n", + smp_processor_id(), cs_nsec_min, cs_nsec_max, cs->name); smp_store_release(&clocksource_verify_work_cs, NULL); // pairs with acquire. } -- 2.9.5
[PATCH v2 clocksource 3/5] clocksource: Check per-CPU clock synchronization when marked unstable
From: "Paul E. McKenney" Some sorts of per-CPU clock sources have a history of going out of synchronization with each other. However, this problem has purportedy been solved in the past ten years. Except that it is all too possible that the problem has instead simply been made less likely, which might mean that some of the occasional "Marking clocksource 'tsc' as unstable" messages might be due to desynchronization. How would anyone know? This commit therefore adds CPU-to-CPU synchronization checking for newly unstable clocksource that are marked with the new CLOCK_SOURCE_VERIFY_PERCPU flag. Lists of desynchronized CPUs are printed, with the caveat that if it is the reporting CPU that is itself desynchronized, it will appear that all the other clocks are wrong. Just like in real life. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Reported-by: Chris Mason [ paulmck: Add "static" to clocksource_verify_one_cpu() per kernel test robot feedback. ] Signed-off-by: Paul E. McKenney --- arch/x86/kernel/kvmclock.c | 2 +- arch/x86/kernel/tsc.c | 3 +- include/linux/clocksource.h | 2 +- kernel/time/clocksource.c | 73 + 4 files changed, 77 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index aa59374..337bb2c 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -169,7 +169,7 @@ struct clocksource kvm_clock = { .read = kvm_clock_get_cycles, .rating = 400, .mask = CLOCKSOURCE_MASK(64), - .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VERIFY_PERCPU, .enable = kvm_cs_enable, }; EXPORT_SYMBOL_GPL(kvm_clock); diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index f70dffc..5628917 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1151,7 +1151,8 @@ static struct clocksource clocksource_tsc = { .mask = CLOCKSOURCE_MASK(64), .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_VALID_FOR_HRES | - CLOCK_SOURCE_MUST_VERIFY, + CLOCK_SOURCE_MUST_VERIFY | + CLOCK_SOURCE_VERIFY_PERCPU, .vdso_clock_mode= VDSO_CLOCKMODE_TSC, .enable = tsc_cs_enable, .resume = tsc_resume, diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h index 86d143d..83a3ebf 100644 --- a/include/linux/clocksource.h +++ b/include/linux/clocksource.h @@ -131,7 +131,7 @@ struct clocksource { #define CLOCK_SOURCE_UNSTABLE 0x40 #define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80 #define CLOCK_SOURCE_RESELECT 0x100 - +#define CLOCK_SOURCE_VERIFY_PERCPU 0x200 /* simplify initialization of mask field */ #define CLOCKSOURCE_MASK(bits) GENMASK_ULL((bits) - 1, 0) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 4663b86..23bcefe 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -211,6 +211,78 @@ static void clocksource_watchdog_inject_delay(void) WARN_ON_ONCE(injectfail < 0); } +static struct clocksource *clocksource_verify_work_cs; +static DEFINE_PER_CPU(u64, csnow_mid); +static cpumask_t cpus_ahead; +static cpumask_t cpus_behind; + +static void clocksource_verify_one_cpu(void *csin) +{ + struct clocksource *cs = (struct clocksource *)csin; + + __this_cpu_write(csnow_mid, cs->read(cs)); +} + +static void clocksource_verify_percpu_wq(struct work_struct *unused) +{ + int cpu; + struct clocksource *cs; + int64_t cs_nsec; + u64 csnow_begin; + u64 csnow_end; + u64 delta; + + cs = smp_load_acquire(&clocksource_verify_work_cs); // pairs with release + if (WARN_ON_ONCE(!cs)) + return; + pr_warn("Checking clocksource %s synchronization from CPU %d.\n", + cs->name, smp_processor_id()); + cpumask_clear(&cpus_ahead); + cpumask_clear(&cpus_behind); + csnow_begin = cs->read(cs); + smp_call_function(clocksource_verify_one_cpu, cs, 1); + csnow_end = cs->read(cs); + for_each_online_cpu(cpu) { + if (cpu == smp_processor_id()) + continue; + delta = (per_cpu(csnow_mid, cpu) - csnow_begin) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_behind); + delta = (csnow_end - per_cpu(csnow_mid, cpu)) & cs->mask; + if ((s64)delta < 0) + cpumask_set_cpu(cpu, &cpus_ahead); + } + if (!cpuma
[PATCH v2 clocksource 4/5] clocksource: Provide a module parameter to fuzz per-CPU clock checking
From: "Paul E. McKenney" Code that checks for clock desynchronization must itself be tested, so this commit creates a new clocksource.inject_delay_shift_percpu= kernel boot parameter that adds or subtracts a large value from the check read, using the specified bit of the CPU ID to determine whether to add or to subtract. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 9 + kernel/time/clocksource.c | 10 +- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 4c59813..ca64b0c 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -593,6 +593,15 @@ times the value specified for inject_delay_freq of consecutive non-delays. + clocksource.inject_delay_shift_percpu= [KNL] + Shift count to obtain bit from CPU number to + determine whether to shift the time of the per-CPU + clock under test ahead or behind. For example, + setting this to the value four will result in + alternating groups of 16 CPUs shifting ahead and + the rest of the CPUs shifting behind. The default + value of -1 disable this type of error injection. + clocksource.max_read_retries= [KNL] Number of clocksource_watchdog() retries due to external delays before the clock will be marked diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 23bcefe..67cf41c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -190,6 +190,8 @@ static int inject_delay_freq; module_param(inject_delay_freq, int, 0644); static int inject_delay_run = 1; module_param(inject_delay_run, int, 0644); +static int inject_delay_shift_percpu = -1; +module_param(inject_delay_shift_percpu, int, 0644); static int max_read_retries = 3; module_param(max_read_retries, int, 0644); @@ -219,8 +221,14 @@ static cpumask_t cpus_behind; static void clocksource_verify_one_cpu(void *csin) { struct clocksource *cs = (struct clocksource *)csin; + s64 delta = 0; + int sign; - __this_cpu_write(csnow_mid, cs->read(cs)); + if (inject_delay_shift_percpu >= 0) { + sign = ((smp_processor_id() >> inject_delay_shift_percpu) & 0x1) * 2 - 1; + delta = sign * NSEC_PER_SEC; + } + __this_cpu_write(csnow_mid, cs->read(cs) + delta); } static void clocksource_verify_percpu_wq(struct work_struct *unused) -- 2.9.5
[PATCH v2 clocksource 2/5] clocksource: Retry clock read if long delays detected
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. This commit therefore re-reads the watchdog clock on either side of the read from the clock under test. If the watchdog clock shows an excessive time delta between its pair of reads, the reads are retried. The maximum number of retries is specified by a new kernel boot parameter clocksource.max_read_retries, which defaults to three, that is, up to four reads, one initial and up to three retries. If retries were required, a message is printed on the console. If the number of retries is exceeded, the clock under test will be marked unstable. However, the probability of this happening due to various sorts of delays is quite small. In addition, the reason (clock-read delays) for the unstable marking will be apparent. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier Reported-by: Chris Mason [ paulmck: Per-clocksource retries per Neeraj Upadhyay feedback. ] [ paulmck: Don't reset injectfail per Neeraj Upadhyay feedback. ] Signed-off-by: Paul E. McKenney --- kernel/time/clocksource.c | 23 +++ 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index 545889c..4663b86 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -124,6 +124,7 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating); */ #define WATCHDOG_INTERVAL (HZ >> 1) #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4) +#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6) static void clocksource_watchdog_work(struct work_struct *work) { @@ -213,9 +214,10 @@ static void clocksource_watchdog_inject_delay(void) static void clocksource_watchdog(struct timer_list *unused) { struct clocksource *cs; - u64 csnow, wdnow, cslast, wdlast, delta; - int64_t wd_nsec, cs_nsec; + u64 csnow, wdnow, wdagain, cslast, wdlast, delta; + int64_t wd_nsec, wdagain_nsec, wderr_nsec = 0, cs_nsec; int next_cpu, reset_pending; + int nretries; spin_lock(&watchdog_lock); if (!watchdog_running) @@ -224,6 +226,7 @@ static void clocksource_watchdog(struct timer_list *unused) reset_pending = atomic_read(&watchdog_reset_pending); list_for_each_entry(cs, &watchdog_list, wd_list) { + nretries = 0; /* Clocksource already marked unstable? */ if (cs->flags & CLOCK_SOURCE_UNSTABLE) { @@ -232,11 +235,23 @@ static void clocksource_watchdog(struct timer_list *unused) continue; } +retry: local_irq_disable(); - csnow = cs->read(cs); - clocksource_watchdog_inject_delay(); wdnow = watchdog->read(watchdog); + clocksource_watchdog_inject_delay(); + csnow = cs->read(cs); + wdagain = watchdog->read(watchdog); local_irq_enable(); + delta = clocksource_delta(wdagain, wdnow, watchdog->mask); + wdagain_nsec = clocksource_cyc2ns(delta, watchdog->mult, watchdog->shift); + if (wdagain_nsec < 0 || wdagain_nsec > WATCHDOG_MAX_SKEW) { + wderr_nsec = wdagain_nsec; + if (nretries++ < max_read_retries) + goto retry; + } + if (nretries) + pr_warn("timekeeping watchdog on CPU%d: %s read-back delay of %lldns, attempt %d\n", + smp_processor_id(), watchdog->name, wderr_nsec, nretries); /* Clocksource initialized ? */ if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) || -- 2.9.5
[PATCH v2 clocksource 1/5] clocksource: Provide module parameters to inject delays in watchdog
From: "Paul E. McKenney" When the clocksource watchdog marks a clock as unstable, this might be due to that clock being unstable or it might be due to delays that happen to occur between the reads of the two clocks. Yes, interrupts are disabled across those two reads, but there are no shortage of things that can delay interrupts-disabled regions of code ranging from SMI handlers to vCPU preemption. It would be good to have some indication as to why the clock was marked unstable. The first step is a way of injecting such delays, and this commit therefore provides a clocksource.inject_delay_freq and clocksource.inject_delay_run kernel boot parameters that specify that sufficient delay be injected to cause the clocksource_watchdog() function to mark a clock unstable. This delay is injected every Nth set of M calls to clocksource_watchdog(), where N is the value specified for the inject_delay_freq boot parameter and M is the value specified for the inject_delay_run boot parameter. Values of zero or less for either parameter disable delay injection, and the default for clocksource.inject_delay_freq is zero, that is, disabled. The default for clocksource.inject_delay_run is the value one, that is single-call runs. This facility is intended for diagnostic use only, and should be avoided on production systems. Cc: John Stultz Cc: Thomas Gleixner Cc: Stephen Boyd Cc: Jonathan Corbet Cc: Mark Rutland Cc: Marc Zyngier [ paulmck: Apply Rik van Riel feedback. ] Reported-by: Chris Mason Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 22 kernel/time/clocksource.c | 27 + 2 files changed, 49 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 9e3cdb2..4c59813 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -577,6 +577,28 @@ loops can be debugged more effectively on production systems. + clocksource.inject_delay_freq= [KNL] + Number of runs of calls to clocksource_watchdog() + before delays are injected between reads from the + two clocksources. Values less than or equal to + zero disable this delay injection. These delays + can cause clocks to be marked unstable, so use + of this parameter should therefore be avoided on + production systems. Defaults to zero (disabled). + + clocksource.inject_delay_run= [KNL] + Run lengths of clocksource_watchdog() delay + injections. Specifying the value 8 will result + in eight consecutive delays followed by eight + times the value specified for inject_delay_freq + of consecutive non-delays. + + clocksource.max_read_retries= [KNL] + Number of clocksource_watchdog() retries due to + external delays before the clock will be marked + unstable. Defaults to three retries, that is, + four attempts to read the clock under test. + clearcpuid=BITNUM[,BITNUM...] [X86] Disable CPUID feature X for the kernel. See arch/x86/include/asm/cpufeatures.h for the valid bit diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c index cce484a..545889c 100644 --- a/kernel/time/clocksource.c +++ b/kernel/time/clocksource.c @@ -14,6 +14,7 @@ #include /* for spin_unlock_irq() using preempt_count() m68k */ #include #include +#include #include "tick-internal.h" #include "timekeeping_internal.h" @@ -184,6 +185,31 @@ void clocksource_mark_unstable(struct clocksource *cs) spin_unlock_irqrestore(&watchdog_lock, flags); } +static int inject_delay_freq; +module_param(inject_delay_freq, int, 0644); +static int inject_delay_run = 1; +module_param(inject_delay_run, int, 0644); +static int max_read_retries = 3; +module_param(max_read_retries, int, 0644); + +static void clocksource_watchdog_inject_delay(void) +{ + int i; + static int injectfail = -1; + + if (inject_delay_freq <= 0 || inject_delay_run <= 0) + return; + if (injectfail < 0 || injectfail > INT_MAX / 2) + injectfail = inject_delay_run; + if (!(++injectfail / inject_delay_run % inject_delay_freq)) { + printk("%s(): Injecting delay.\n", __func__); + for (i = 0; i < 2 * WATCHDOG_THRESHOLD / NSEC_PER_MSEC; i++) + udelay(1000); + printk("%s(): Done injecting delay.\n", __func__); + } + WARN_O
[PATCH v3 sl-b 1/6] mm: Add mem_dump_obj() to print source of memory block
From: "Paul E. McKenney" There are kernel facilities such as per-CPU reference counts that give error messages in generic handlers or callbacks, whose messages are unenlightening. In the case of per-CPU reference-count underflow, this is not a problem when creating a new use of this facility because in that case the bug is almost certainly in the code implementing that new use. However, trouble arises when deploying across many systems, which might exercise corner cases that were not seen during development and testing. Here, it would be really nice to get some kind of hint as to which of several uses the underflow was caused by. This commit therefore exposes a mem_dump_obj() function that takes a pointer to memory (which must still be allocated if it has been dynamically allocated) and prints available information on where that memory came from. This pointer can reference the middle of the block as well as the beginning of the block, as needed by things like RCU callback functions and timer handlers that might not know where the beginning of the memory block is. These functions and handlers can use mem_dump_obj() to print out better hints as to where the problem might lie. The information printed can depend on kernel configuration. For example, the allocation return address can be printed only for slab and slub, and even then only when the necessary debug has been enabled. For slab, build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space to the next power of two or use the SLAB_STORE_USER when creating the kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create() if more focused use is desired. Also for slub, use CONFIG_STACKTRACE to enable printing of the allocation-time stack trace. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko [ paulmck: Convert to printing and change names per Joonsoo Kim. ] [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ] [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ] [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ] Signed-off-by: Paul E. McKenney --- include/linux/mm.h | 2 ++ include/linux/slab.h | 2 ++ mm/slab.c| 20 ++ mm/slab.h| 12 + mm/slab_common.c | 74 mm/slob.c| 6 + mm/slub.c| 36 + mm/util.c| 24 + 8 files changed, 176 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360fe..1eea266 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +void mem_dump_obj(void *object); + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/slab.h b/include/linux/slab.h index dd6897f..169b511 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -186,6 +186,8 @@ void kfree(const void *); void kfree_sensitive(const void *); size_t __ksize(const void *); size_t ksize(const void *); +bool kmem_valid_obj(void *object); +void kmem_dump_obj(void *object); #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR void __check_heap_object(const void *ptr, unsigned long n, struct page *page, diff --git a/mm/slab.c b/mm/slab.c index b111356..66f00ad 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3633,6 +3633,26 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t flags, EXPORT_SYMBOL(__kmalloc_node_track_caller); #endif /* CONFIG_NUMA */ +void kmem_obj_info(struct kmem_obj_info *kpp, void *object, struct page *page) +{ + struct kmem_cache *cachep; + unsigned int objnr; + void *objp; + + kpp->kp_ptr = object; + kpp->kp_page = page; + cachep = page->slab_cache; + kpp->kp_slab_cache = cachep; + objp = object - obj_offset(cachep); + kpp->kp_data_offset = obj_offset(cachep); + page = virt_to_head_page(objp); + objnr = obj_to_index(cachep, page, objp); + objp = index_to_obj(cachep, page, objnr); + kpp->kp_objp = objp; + if (DEBUG && cachep->flags & SLAB_STORE_USER) + kpp->kp_ret = *dbg_userword(cachep, objp); +} + /** * __do_kmalloc - allocate memory * @size: how many bytes of memory are required. diff --git a/mm/slab.h b/mm/slab.h index 6d7c6a5..0dc705b 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -630,4 +630,16 @@ static inline bool slab_want_init_on_free(struct kmem_cache *c) return false; } +#define KS_ADDRS_COUNT 16 +struct kmem_obj_info { + void *kp_ptr; + struct page *kp_page; + void *kp_objp; + unsigned long kp_data_offset; + struct kmem_cache *kp_slab_cache; + void
[PATCH v3 sl-b 5/6] rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback
From: "Paul E. McKenney" The debug-object double-free checks in __call_rcu() print out the RCU callback function, which is usually sufficient to track down the double free. However, all uses of things like queue_rcu_work() will have the same RCU callback function (rcu_work_rcufn() in this case), so a diagnostic message for a double queue_rcu_work() needs more than just the callback function. This commit therefore calls mem_dump_obj() to dump out any additional available information on the double-freed callback. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b408dca..80ceee5 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2959,6 +2959,7 @@ static void check_cb_ovld(struct rcu_data *rdp) static void __call_rcu(struct rcu_head *head, rcu_callback_t func) { + static atomic_t doublefrees; unsigned long flags; struct rcu_data *rdp; bool was_alldone; @@ -2972,8 +2973,10 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) * Use rcu:rcu_callback trace event to find the previous * time callback was passed to __call_rcu(). */ - WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n", - head, head->func); + if (atomic_inc_return(&doublefrees) < 4) { + pr_err("%s(): Double-freed CB %p->%pS()!!! ", __func__, head, head->func); + mem_dump_obj(head); + } WRITE_ONCE(head->func, rcu_leak_callback); return; } -- 2.9.5
[PATCH v3 sl-b 2/6] mm: Make mem_dump_obj() handle NULL and zero-sized pointers
From: "Paul E. McKenney" This commit makes mem_dump_obj() call out NULL and zero-sized pointers specially instead of classifying them as non-paged memory. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko Acked-by: Vlastimil Babka Signed-off-by: Paul E. McKenney --- mm/util.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/util.c b/mm/util.c index f2e0c4d9..f7c94c8 100644 --- a/mm/util.c +++ b/mm/util.c @@ -985,7 +985,12 @@ int __weak memcmp_pages(struct page *page1, struct page *page2) void mem_dump_obj(void *object) { if (!virt_addr_valid(object)) { - pr_cont(" non-paged (local) memory.\n"); + if (object == NULL) + pr_cont(" NULL pointer.\n"); + else if (object == ZERO_SIZE_PTR) + pr_cont(" zero-size pointer.\n"); + else + pr_cont(" non-paged (local) memory.\n"); return; } if (kmem_valid_obj(object)) { -- 2.9.5
[PATCH v3 sl-b 3/6] mm: Make mem_dump_obj() handle vmalloc() memory
From: "Paul E. McKenney" This commit adds vmalloc() support to mem_dump_obj(). Note that the vmalloc_dump_obj() function combines the checking and dumping, in contrast with the split between kmem_valid_obj() and kmem_dump_obj(). The reason for the difference is that the checking in the vmalloc() case involves acquiring a global lock, and redundant acquisitions of global locks should be avoided, even on not-so-fast paths. Note that this change causes on-stack variables to be reported as vmalloc() storage from kernel_clone() or similar, depending on the degree of inlining that your compiler does. This is likely more helpful than the earlier "non-paged (local) memory". Cc: Andrew Morton Cc: Joonsoo Kim Cc: Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- include/linux/vmalloc.h | 6 ++ mm/util.c | 14 -- mm/vmalloc.c| 12 3 files changed, 26 insertions(+), 6 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9..c89c2be 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -248,4 +248,10 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) int register_vmap_purge_notifier(struct notifier_block *nb); int unregister_vmap_purge_notifier(struct notifier_block *nb); +#ifdef CONFIG_MMU +bool vmalloc_dump_obj(void *object); +#else +static inline bool vmalloc_dump_obj(void *object) { return false; } +#endif + #endif /* _LINUX_VMALLOC_H */ diff --git a/mm/util.c b/mm/util.c index f7c94c8..dcde696 100644 --- a/mm/util.c +++ b/mm/util.c @@ -984,18 +984,20 @@ int __weak memcmp_pages(struct page *page1, struct page *page2) */ void mem_dump_obj(void *object) { + if (kmem_valid_obj(object)) { + kmem_dump_obj(object); + return; + } + if (vmalloc_dump_obj(object)) + return; if (!virt_addr_valid(object)) { if (object == NULL) pr_cont(" NULL pointer.\n"); else if (object == ZERO_SIZE_PTR) pr_cont(" zero-size pointer.\n"); else - pr_cont(" non-paged (local) memory.\n"); - return; - } - if (kmem_valid_obj(object)) { - kmem_dump_obj(object); + pr_cont(" non-paged memory.\n"); return; } - pr_cont(" non-slab memory.\n"); + pr_cont(" non-slab/vmalloc memory.\n"); } diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6ae491a..7421719 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) } #endif /* CONFIG_SMP */ +bool vmalloc_dump_obj(void *object) +{ + struct vm_struct *vm; + void *objp = (void *)PAGE_ALIGN((unsigned long)object); + + vm = find_vm_area(objp); + if (!vm) + return false; + pr_cont(" vmalloc allocated at %pS\n", vm->caller); + return true; +} + #ifdef CONFIG_PROC_FS static void *s_start(struct seq_file *m, loff_t *pos) __acquires(&vmap_purge_lock) -- 2.9.5
[PATCH v3 sl-b 4/6] mm: Make mem_obj_dump() vmalloc() dumps include start and length
From: "Paul E. McKenney" This commit adds the starting address and number of pages to the vmalloc() information dumped by way of vmalloc_dump_obj(). Cc: Andrew Morton Cc: Joonsoo Kim Cc: Reported-by: Andrii Nakryiko Suggested-by: Vlastimil Babka Signed-off-by: Paul E. McKenney --- mm/vmalloc.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7421719..77b1100 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3439,7 +3439,8 @@ bool vmalloc_dump_obj(void *object) vm = find_vm_area(objp); if (!vm) return false; - pr_cont(" vmalloc allocated at %pS\n", vm->caller); + pr_cont(" %u-page vmalloc region starting at %#lx allocated at %pS\n", + vm->nr_pages, (unsigned long)vm->addr, vm->caller); return true; } -- 2.9.5
[PATCH v3 sl-b 6/6] percpu_ref: Dump mem_dump_obj() info upon reference-count underflow
From: "Paul E. McKenney" Reference-count underflow for percpu_ref is detected in the RCU callback percpu_ref_switch_to_atomic_rcu(), and the resulting warning does not print anything allowing easy identification of which percpu_ref use case is underflowing. This is of course not normally a problem when developing a new percpu_ref use case because it is most likely that the problem resides in this new use case. However, when deploying a new kernel to a large set of servers, the underflow might well be a new corner case in any of the old percpu_ref use cases. This commit therefore calls mem_dump_obj() to dump out any additional available information on the underflowing percpu_ref instance. Cc: Ming Lei Cc: Jens Axboe Cc: Joonsoo Kim Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- lib/percpu-refcount.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index e59eda0..a1071cd 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -5,6 +5,7 @@ #include #include #include +#include #include /* @@ -168,6 +169,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) struct percpu_ref_data, rcu); struct percpu_ref *ref = data->ref; unsigned long __percpu *percpu_count = percpu_count_ptr(ref); + static atomic_t underflows; unsigned long count = 0; int cpu; @@ -191,9 +193,13 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) */ atomic_long_add((long)count - PERCPU_COUNT_BIAS, &data->count); - WARN_ONCE(atomic_long_read(&data->count) <= 0, - "percpu ref (%ps) <= 0 (%ld) after switching to atomic", - data->release, atomic_long_read(&data->count)); + if (WARN_ONCE(atomic_long_read(&data->count) <= 0, + "percpu ref (%ps) <= 0 (%ld) after switching to atomic", + data->release, atomic_long_read(&data->count)) && + atomic_inc_return(&underflows) < 4) { + pr_err("%s(): percpu_ref underflow", __func__); + mem_dump_obj(data); + } /* @ref is viewed as dead on all CPUs, send out switch confirmation */ percpu_ref_call_confirm_rcu(rcu); -- 2.9.5
[PATCH v2 sl-b 2/5] mm: Make mem_dump_obj() handle NULL and zero-sized pointers
From: "Paul E. McKenney" This commit makes mem_dump_obj() call out NULL and zero-sized pointers specially instead of classifying them as non-paged memory. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- mm/util.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/util.c b/mm/util.c index d0e60d2..8c2449f 100644 --- a/mm/util.c +++ b/mm/util.c @@ -985,7 +985,12 @@ int __weak memcmp_pages(struct page *page1, struct page *page2) void mem_dump_obj(void *object) { if (!virt_addr_valid(object)) { - pr_cont(" non-paged (local) memory.\n"); + if (object == NULL) + pr_cont(" NULL pointer.\n"); + else if (object == ZERO_SIZE_PTR) + pr_cont(" zero-size pointer.\n"); + else + pr_cont(" non-paged (local) memory.\n"); return; } if (kmem_valid_obj(object)) { -- 2.9.5
[PATCH v2 sl-b 4/5] rcu: Make call_rcu() print mem_dump_obj() info for double-freed callback
From: "Paul E. McKenney" The debug-object double-free checks in __call_rcu() print out the RCU callback function, which is usually sufficient to track down the double free. However, all uses of things like queue_rcu_work() will have the same RCU callback function (rcu_work_rcufn() in this case), so a diagnostic message for a double queue_rcu_work() needs more than just the callback function. This commit therefore calls mem_dump_obj() to dump out any additional available information on the double-freed callback. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index b6c9c49..464cf14 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2957,6 +2957,7 @@ static void check_cb_ovld(struct rcu_data *rdp) static void __call_rcu(struct rcu_head *head, rcu_callback_t func) { + static atomic_t doublefrees; unsigned long flags; struct rcu_data *rdp; bool was_alldone; @@ -2970,8 +2971,10 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) * Use rcu:rcu_callback trace event to find the previous * time callback was passed to __call_rcu(). */ - WARN_ONCE(1, "__call_rcu(): Double-freed CB %p->%pS()!!!\n", - head, head->func); + if (atomic_inc_return(&doublefrees) < 4) { + pr_err("%s(): Double-freed CB %p->%pS()!!! ", __func__, head, head->func); + mem_dump_obj(head); + } WRITE_ONCE(head->func, rcu_leak_callback); return; } -- 2.9.5
[PATCH v2 sl-b 5/5] percpu_ref: Dump mem_dump_obj() info upon reference-count underflow
From: "Paul E. McKenney" Reference-count underflow for percpu_ref is detected in the RCU callback percpu_ref_switch_to_atomic_rcu(), and the resulting warning does not print anything allowing easy identification of which percpu_ref use case is underflowing. This is of course not normally a problem when developing a new percpu_ref use case because it is most likely that the problem resides in this new use case. However, when deploying a new kernel to a large set of servers, the underflow might well be a new corner case in any of the old percpu_ref use cases. This commit therefore calls mem_dump_obj() to dump out any additional available information on the underflowing percpu_ref instance. Cc: Ming Lei Cc: Jens Axboe Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- lib/percpu-refcount.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index e59eda0..a1071cd 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -5,6 +5,7 @@ #include #include #include +#include #include /* @@ -168,6 +169,7 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) struct percpu_ref_data, rcu); struct percpu_ref *ref = data->ref; unsigned long __percpu *percpu_count = percpu_count_ptr(ref); + static atomic_t underflows; unsigned long count = 0; int cpu; @@ -191,9 +193,13 @@ static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) */ atomic_long_add((long)count - PERCPU_COUNT_BIAS, &data->count); - WARN_ONCE(atomic_long_read(&data->count) <= 0, - "percpu ref (%ps) <= 0 (%ld) after switching to atomic", - data->release, atomic_long_read(&data->count)); + if (WARN_ONCE(atomic_long_read(&data->count) <= 0, + "percpu ref (%ps) <= 0 (%ld) after switching to atomic", + data->release, atomic_long_read(&data->count)) && + atomic_inc_return(&underflows) < 4) { + pr_err("%s(): percpu_ref underflow", __func__); + mem_dump_obj(data); + } /* @ref is viewed as dead on all CPUs, send out switch confirmation */ percpu_ref_call_confirm_rcu(rcu); -- 2.9.5
[PATCH v2 sl-b 3/5] mm: Make mem_dump_obj() handle vmalloc() memory
From: "Paul E. McKenney" This commit adds vmalloc() support to mem_dump_obj(). Note that the vmalloc_dump_obj() function combines the checking and dumping, in contrast with the split between kmem_valid_obj() and kmem_dump_obj(). The reason for the difference is that the checking in the vmalloc() case involves acquiring a global lock, and redundant acquisitions of global locks should be avoided, even on not-so-fast paths. Note that this change causes on-stack variables to be reported as vmalloc() storage from kernel_clone() or similar, depending on the degree of inlining that your compiler does. This is likely more helpful than the earlier "non-paged (local) memory". Cc: Andrew Morton Cc: Joonsoo Kim Cc: Reported-by: Andrii Nakryiko Signed-off-by: Paul E. McKenney --- include/linux/vmalloc.h | 6 ++ mm/util.c | 12 +++- mm/vmalloc.c| 12 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 938eaf9..c89c2be 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -248,4 +248,10 @@ pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) int register_vmap_purge_notifier(struct notifier_block *nb); int unregister_vmap_purge_notifier(struct notifier_block *nb); +#ifdef CONFIG_MMU +bool vmalloc_dump_obj(void *object); +#else +static inline bool vmalloc_dump_obj(void *object) { return false; } +#endif + #endif /* _LINUX_VMALLOC_H */ diff --git a/mm/util.c b/mm/util.c index 8c2449f..ee99a0a 100644 --- a/mm/util.c +++ b/mm/util.c @@ -984,6 +984,12 @@ int __weak memcmp_pages(struct page *page1, struct page *page2) */ void mem_dump_obj(void *object) { + if (kmem_valid_obj(object)) { + kmem_dump_obj(object); + return; + } + if (vmalloc_dump_obj(object)) + return; if (!virt_addr_valid(object)) { if (object == NULL) pr_cont(" NULL pointer.\n"); @@ -993,10 +999,6 @@ void mem_dump_obj(void *object) pr_cont(" non-paged (local) memory.\n"); return; } - if (kmem_valid_obj(object)) { - kmem_dump_obj(object); - return; - } - pr_cont(" non-slab memory.\n"); + pr_cont(" non-slab/vmalloc memory.\n"); } EXPORT_SYMBOL_GPL(mem_dump_obj); diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 6ae491a..7421719 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3431,6 +3431,18 @@ void pcpu_free_vm_areas(struct vm_struct **vms, int nr_vms) } #endif /* CONFIG_SMP */ +bool vmalloc_dump_obj(void *object) +{ + struct vm_struct *vm; + void *objp = (void *)PAGE_ALIGN((unsigned long)object); + + vm = find_vm_area(objp); + if (!vm) + return false; + pr_cont(" vmalloc allocated at %pS\n", vm->caller); + return true; +} + #ifdef CONFIG_PROC_FS static void *s_start(struct seq_file *m, loff_t *pos) __acquires(&vmap_purge_lock) -- 2.9.5
[PATCH v2 sl-b 1/5] mm: Add mem_dump_obj() to print source of memory block
From: "Paul E. McKenney" There are kernel facilities such as per-CPU reference counts that give error messages in generic handlers or callbacks, whose messages are unenlightening. In the case of per-CPU reference-count underflow, this is not a problem when creating a new use of this facility because in that case the bug is almost certainly in the code implementing that new use. However, trouble arises when deploying across many systems, which might exercise corner cases that were not seen during development and testing. Here, it would be really nice to get some kind of hint as to which of several uses the underflow was caused by. This commit therefore exposes a mem_dump_obj() function that takes a pointer to memory (which must still be allocated if it has been dynamically allocated) and prints available information on where that memory came from. This pointer can reference the middle of the block as well as the beginning of the block, as needed by things like RCU callback functions and timer handlers that might not know where the beginning of the memory block is. These functions and handlers can use mem_dump_obj() to print out better hints as to where the problem might lie. The information printed can depend on kernel configuration. For example, the allocation return address can be printed only for slab and slub, and even then only when the necessary debug has been enabled. For slab, build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space to the next power of two or use the SLAB_STORE_USER when creating the kmem_cache structure. For slub, build with CONFIG_SLUB_DEBUG=y and boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create() if more focused use is desired. Also for slub, use CONFIG_STACKTRACE to enable printing of the allocation-time stack trace. Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Andrew Morton Cc: Reported-by: Andrii Nakryiko [ paulmck: Convert to printing and change names per Joonsoo Kim. ] [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ] Signed-off-by: Paul E. McKenney --- include/linux/mm.h | 2 ++ include/linux/slab.h | 2 ++ mm/slab.c| 28 + mm/slab.h| 11 + mm/slab_common.c | 69 mm/slob.c| 7 ++ mm/slub.c| 40 ++ mm/util.c| 25 +++ 8 files changed, 184 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360fe..1eea266 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3153,5 +3153,7 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +void mem_dump_obj(void *object); + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/slab.h b/include/linux/slab.h index dd6897f..169b511 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -186,6 +186,8 @@ void kfree(const void *); void kfree_sensitive(const void *); size_t __ksize(const void *); size_t ksize(const void *); +bool kmem_valid_obj(void *object); +void kmem_dump_obj(void *object); #ifdef CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR void __check_heap_object(const void *ptr, unsigned long n, struct page *page, diff --git a/mm/slab.c b/mm/slab.c index b111356..72b6743 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -3602,6 +3602,34 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *cachep, EXPORT_SYMBOL(kmem_cache_alloc_node_trace); #endif +void kmem_provenance(struct kmem_provenance *kpp) +{ +#ifdef DEBUG + struct kmem_cache *cachep; + void *object = kpp->kp_ptr; + unsigned int objnr; + void *objp; + struct page *page = kpp->kp_page; + + cachep = page->slab_cache; + if (!(cachep->flags & SLAB_STORE_USER)) { + kpp->kp_ret = NULL; + goto nodebug; + } + objp = object - obj_offset(cachep); + page = virt_to_head_page(objp); + objnr = obj_to_index(cachep, page, objp); + objp = index_to_obj(cachep, page, objnr); + kpp->kp_objp = objp; + kpp->kp_ret = *dbg_userword(cachep, objp); +nodebug: +#else + kpp->kp_ret = NULL; +#endif + if (kpp->kp_nstack) + kpp->kp_stack[0] = NULL; +} + static __always_inline void * __do_kmalloc_node(size_t size, gfp_t flags, int node, unsigned long caller) { diff --git a/mm/slab.h b/mm/slab.h index 6d7c6a5..28a41d5 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -630,4 +630,15 @@ static inline bool slab_want_init_on_free(struct kmem_cache *c) return false; } +#define KS_ADDRS_COUNT 16 +struct kmem_provenance { + void *kp_ptr; + struct page *kp_page; + void *kp_objp; + void *kp_ret; + void *kp_stack[KS_ADDRS_COUNT]; + int kp_nstack; +}; +void kmem_proven
[PATCH tip/core/rcu 3/4] rcu: Run rcuo kthreads at elevated priority in CONFIG_RCU_BOOST kernels
From: "Paul E. McKenney" The priority level of the rcuo kthreads is the system administrator's responsibility, but kernels that priority-boost RCU readers probably need the rcuo kthreads running at the rcutree.kthread_prio level. This commit therefore sets these kthreads to that priority level at creation time, providing a sensible default. The system administrator is free to adjust as needed at any time. Cc: Sebastian Andrzej Siewior Cc: Scott Wood Cc: Thomas Gleixner Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index fca31c6..7e33dae0 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2197,6 +2197,7 @@ static int rcu_nocb_gp_kthread(void *arg) { struct rcu_data *rdp = arg; + rcu_cpu_kthread_setup(-1); for (;;) { WRITE_ONCE(rdp->nocb_gp_loops, rdp->nocb_gp_loops + 1); nocb_gp_wait(rdp); @@ -2298,6 +2299,7 @@ static int rcu_nocb_cb_kthread(void *arg) // Each pass through this loop does one callback batch, and, // if there are no more ready callbacks, waits for them. + rcu_cpu_kthread_setup(-1); for (;;) { nocb_cb_wait(rdp); cond_resched_tasks_rcu_qs(); -- 2.9.5
[PATCH tip/core/rcu 4/4] rcutorture: Fix testing of RCU priority boosting
From: "Paul E. McKenney" Currently, rcutorture refuses to test RCU priority boosting in CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on x86 these days. This commit therefore updates rcutorture's tests of RCU priority boosting to make them safe for CPU hotplug. However, these tests will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not happen in current mainline. This commit therefore also refuses to test RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y. While in the area, this commt adds some debug output at boost-fail time that helps diagnose the cause of the failure, for example, failing to run TIMER_SOFTIRQ at realtime priority. Cc: Sebastian Andrzej Siewior Cc: Scott Wood Cc: Thomas Gleixner Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 36 ++-- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 8e93f2e..2440f89 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -245,11 +245,11 @@ static const char *rcu_torture_writer_state_getname(void) return rcu_torture_writer_state_names[i]; } -#if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) -#define rcu_can_boost() 1 -#else /* #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */ -#define rcu_can_boost() 0 -#endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */ +#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_PREEMPT_RT) +# define rcu_can_boost() 1 +#else +# define rcu_can_boost() 0 +#endif #ifdef CONFIG_RCU_TRACE static u64 notrace rcu_trace_clock_local(void) @@ -923,9 +923,13 @@ static void rcu_torture_enable_rt_throttle(void) static bool rcu_torture_boost_failed(unsigned long start, unsigned long end) { + static int dbg_done; + if (end - start > test_boost_duration * HZ - HZ / 2) { VERBOSE_TOROUT_STRING("rcu_torture_boost boosting failed"); n_rcu_torture_boost_failure++; + if (!xchg(&dbg_done, 1) && cur_ops->gp_kthread_dbg) + cur_ops->gp_kthread_dbg(); return true; /* failed */ } @@ -948,8 +952,8 @@ static int rcu_torture_boost(void *arg) init_rcu_head_on_stack(&rbi.rcu); /* Each pass through the following loop does one boost-test cycle. */ do { - /* Track if the test failed already in this test interval? */ - bool failed = false; + bool failed = false; // Test failed already in this test interval + bool firsttime = true; /* Increment n_rcu_torture_boosts once per boost-test */ while (!kthread_should_stop()) { @@ -975,18 +979,17 @@ static int rcu_torture_boost(void *arg) /* Do one boost-test interval. */ endtime = oldstarttime + test_boost_duration * HZ; - call_rcu_time = jiffies; while (time_before(jiffies, endtime)) { /* If we don't have a callback in flight, post one. */ if (!smp_load_acquire(&rbi.inflight)) { /* RCU core before ->inflight = 1. */ smp_store_release(&rbi.inflight, 1); - call_rcu(&rbi.rcu, rcu_torture_boost_cb); + cur_ops->call(&rbi.rcu, rcu_torture_boost_cb); /* Check if the boost test failed */ - failed = failed || -rcu_torture_boost_failed(call_rcu_time, -jiffies); + if (!firsttime && !failed) + failed = rcu_torture_boost_failed(call_rcu_time, jiffies); call_rcu_time = jiffies; + firsttime = false; } if (stutter_wait("rcu_torture_boost")) sched_set_fifo_low(current); @@ -999,7 +1002,7 @@ static int rcu_torture_boost(void *arg) * this case the boost check would never happen in the above * loop so do another one here. */ - if (!failed && smp_load_acquire(&rbi.inflight)) + if (!firsttime && !failed && smp_load_acquire(&rbi.inflight)) rcu_torture_boost_failed(call_rcu_time, jiffies); /* @@ -1025,6 +1028,9 @@ checkwait:if (stutter_wait("rcu_torture_boost")) sched_set_fifo_low(current); } while (!torture_must_stop()); + while (smp_load_acquire(&rbi.inflight)) + schedule_timeout_uninterruptible(1); // rcu_barrier() deadlocks. + /* Clean up and exit. */ wh
[PATCH tip/core/rcu 1/4] rcu: Expedite deboost in case of deferred quiescent state
From: "Paul E. McKenney" Historically, a task that has been subjected to RCU priority boosting is deboosted at rcu_read_unlock() time. However, with the advent of deferred quiescent states, if the outermost rcu_read_unlock() was invoked with either bottom halves, interrupts, or preemption disabled, the deboosting will be delayed for some time. During this time, a low-priority process might be incorrectly running at a high real-time priority level. Fortunately, rcu_read_unlock_special() already provides mechanisms for forcing a minimal deferral of quiescent states, at least for kernels built with CONFIG_IRQ_WORK=y. These mechanisms are currently used when expedited grace periods are pending that might be blocked by the current task. This commit therefore causes those mechanisms to also be used in cases where the current task has been or might soon be subjected to RCU priority boosting. Note that this applies to all kernels built with CONFIG_RCU_BOOST=y, regardless of whether or not they are also built with CONFIG_PREEMPT_RT=y. This approach assumes that kernels build for use with aggressive real-time applications are built with CONFIG_IRQ_WORK=y. It is likely to be far simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting scheme that works correctly in its absence. While in the area, alphabetize the rcu_preempt_deferred_qs_handler() function's local variables. Cc: Sebastian Andrzej Siewior Cc: Scott Wood Cc: Lai Jiangshan Cc: Thomas Gleixner Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 8b0feb2..fca31c6 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -660,9 +660,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp) static void rcu_read_unlock_special(struct task_struct *t) { unsigned long flags; + bool irqs_were_disabled; bool preempt_bh_were_disabled = !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)); - bool irqs_were_disabled; /* NMI handlers cannot block and cannot safely manipulate state. */ if (in_nmi()) @@ -671,30 +671,32 @@ static void rcu_read_unlock_special(struct task_struct *t) local_irq_save(flags); irqs_were_disabled = irqs_disabled_flags(flags); if (preempt_bh_were_disabled || irqs_were_disabled) { - bool exp; + bool expboost; // Expedited GP in flight or possible boosting. struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; - exp = (t->rcu_blocked_node && - READ_ONCE(t->rcu_blocked_node->exp_tasks)) || - (rdp->grpmask & READ_ONCE(rnp->expmask)); + expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) || + (rdp->grpmask & READ_ONCE(rnp->expmask)) || + (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && + t->rcu_blocked_node); // Need to defer quiescent state until everything is enabled. - if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) { + if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) { // Using softirq, safe to awaken, and either the - // wakeup is free or there is an expedited GP. + // wakeup is free or there is either an expedited + // GP in flight or a potential need to deboost. raise_softirq_irqoff(RCU_SOFTIRQ); } else { // Enabling BH or preempt does reschedule, so... - // Also if no expediting, slow is OK. - // Plus nohz_full CPUs eventually get tick enabled. + // Also if no expediting and no possible deboosting, + // slow is OK. Plus nohz_full CPUs eventually get + // tick enabled. set_tsk_need_resched(current); set_preempt_need_resched(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && - !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) { + expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. - init_irq_work(&rdp->defer_qs_iw, - rcu_preempt_deferred_qs_handler); + init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
[PATCH tip/core/rcu 2/4] rcutorture: Make TREE03 use real-time tree.use_softirq setting
From: "Paul E. McKenney" TREE03 tests RCU priority boosting, which is a real-time feature. It would also be good if it tested something closer to what is actually used by the real-time folks. This commit therefore adds tree.use_softirq=0 to the TREE03 kernel boot parameters in TREE03.boot. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot index 1c21894..64f864f1 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot @@ -4,3 +4,4 @@ rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3 rcutree.kthread_prio=2 threadirqs +tree.use_softirq=0 -- 2.9.5
[PATCH tip/core/rcu 04/10] rculist: Replace reference to atomic_ops.rst
From: Akira Yokosawa The hlist_nulls_for_each_entry_rcu() docbook header references the atomic_ops.rst file, which was removed in commit f0400a77ebdc ("atomic: Delete obsolete documentation"). This commit therefore substitutes a section in memory-barriers.txt discussing the use of barrier() in loops. Cc: Peter Zijlstra Signed-off-by: Akira Yokosawa Signed-off-by: Paul E. McKenney --- include/linux/rculist_nulls.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h index ff3e947..d8afdb8 100644 --- a/include/linux/rculist_nulls.h +++ b/include/linux/rculist_nulls.h @@ -161,7 +161,7 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n) * * The barrier() is needed to make sure compiler doesn't cache first element [1], * as this loop can be restarted [2] - * [1] Documentation/core-api/atomic_ops.rst around line 114 + * [1] Documentation/memory-barriers.txt around line 1533 * [2] Documentation/RCU/rculist_nulls.rst around line 146 */ #define hlist_nulls_for_each_entry_rcu(tpos, pos, head, member) \ -- 2.9.5
[PATCH tip/core/rcu 01/10] rcu: Remove superfluous rdp fetch
From: Frederic Weisbecker Cc: Rafael J. Wysocki Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f521..cdf091f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -648,7 +648,6 @@ static noinstr void rcu_eqs_enter(bool user) instrumentation_begin(); trace_rcu_dyntick(TPS("Start"), rdp->dynticks_nesting, 0, atomic_read(&rdp->dynticks)); WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current)); - rdp = this_cpu_ptr(&rcu_data); rcu_prepare_for_idle(); rcu_preempt_deferred_qs(current); -- 2.9.5
[PATCH tip/core/rcu 03/10] rcu: Remove spurious instrumentation_end() in rcu_nmi_enter()
From: Zhouyi Zhou In rcu_nmi_enter(), there is an erroneous instrumentation_end() in the second branch of the "if" statement. Oddly enough, "objtool check -f vmlinux.o" fails to complain because it is unable to correctly cover all cases. Instead, objtool visits the third branch first, which marks following trace_rcu_dyntick() as visited. This commit therefore removes the spurious instrumentation_end(). Fixes: 04b25a495bd6 ("rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr") Reported-by Neeraj Upadhyay Signed-off-by: Zhouyi Zhou Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index e62c2de..4d90f20 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1076,7 +1076,6 @@ noinstr void rcu_nmi_enter(void) } else if (!in_nmi()) { instrumentation_begin(); rcu_irq_enter_check_tick(); - instrumentation_end(); } else { instrumentation_begin(); } -- 2.9.5
[PATCH tip/core/rcu 02/10] rcu: Fix CPU-offline trace in rcutree_dying_cpu
From: Neeraj Upadhyay The condition in the trace_rcu_grace_period() in rcutree_dying_cpu() is backwards, so that it uses the string "cpuofl" when the offline CPU is blocking the current grace period and "cpuofl-bgp" otherwise. Given that the "-bgp" stands for "blocking grace period", this is at best misleading. This commit therefore switches these strings in order to correctly trace whether the outgoing cpu blocks the current grace period. Signed-off-by: Neeraj Upadhyay Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index cdf091f..e62c2de 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2413,7 +2413,7 @@ int rcutree_dying_cpu(unsigned int cpu) blkd = !!(rnp->qsmask & rdp->grpmask); trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq), - blkd ? TPS("cpuofl") : TPS("cpuofl-bgp")); + blkd ? TPS("cpuofl-bgp") : TPS("cpuofl")); return 0; } -- 2.9.5
[PATCH tip/core/rcu 05/10] rcu: Fix kfree_rcu() docbook errors
From: Mauro Carvalho Chehab After commit 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument macro"), kernel-doc now emits two warnings: ./include/linux/rcupdate.h:884: warning: Excess function parameter 'ptr' description in 'kfree_rcu' ./include/linux/rcupdate.h:884: warning: Excess function parameter 'rhf' description in 'kfree_rcu' This commit added some macro magic in order to call two different versions of kfree_rcu(), the first having just one argument and the second having two arguments. That makes it difficult to document the kfree_rcu() arguments in the docboook header. In order to make clearer that this macro accepts optional arguments, this commit uses macro concatenation so that this macro changes from: #define kfree_rcu kvfree_rcu to: #define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf) That not only helps kernel-doc understand the macro arguments, but also provides a better C definition that makes clearer that the first argument is mandatory and the second one is optional. Fixes: 5130b8fd0690 ("rcu: Introduce kfree_rcu() single-argument macro") Tested-by: Uladzislau Rezki (Sony) Signed-off-by: Mauro Carvalho Chehab Signed-off-by: Paul E. McKenney --- include/linux/rcupdate.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index bd04f72..5cc6dea 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -881,7 +881,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void) * The BUILD_BUG_ON check must not involve any function calls, hence the * checks are done in macros here. */ -#define kfree_rcu kvfree_rcu +#define kfree_rcu(ptr, rhf...) kvfree_rcu(ptr, ## rhf) /** * kvfree_rcu() - kvfree an object after a grace period. -- 2.9.5
[PATCH tip/core/rcu 07/10] rcu: Prevent dyntick-idle until ksoftirqd has been spawned
From: "Paul E. McKenney" After interrupts have enabled at boot but before some random point in early_initcall() processing, softirq processing is unreliable. If softirq sees a need to push softirq-handler invocation to ksoftirqd during this time, then those handlers can be delayed until the ksoftirqd kthreads have been spawned, which happens at some random point in the early_initcall() processing. In many cases, this delay is just fine. However, if the boot sequence blocks waiting for a wakeup from a softirq handler, this delay will result in a silent-hang deadlock. This commit therefore prevents these hangs by ensuring that the tick stays active until after the ksoftirqd kthreads have been spawned. This change causes the tick to eventually drain the backlog of delayed softirq handlers, breaking this deadlock. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2d60377..36212de 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1255,6 +1255,11 @@ static void rcu_prepare_kthreads(int cpu) */ int rcu_needs_cpu(u64 basemono, u64 *nextevt) { + /* Through early_initcall(), need tick for softirq handlers. */ + if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) { + *nextevt = 1; + return 1; + } *nextevt = KTIME_MAX; return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) && !rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist); @@ -1350,6 +1355,12 @@ int rcu_needs_cpu(u64 basemono, u64 *nextevt) lockdep_assert_irqs_disabled(); + /* Through early_initcall(), need tick for softirq handlers. */ + if (!IS_ENABLED(CONFIG_HZ_PERIODIC) && !this_cpu_ksoftirqd()) { + *nextevt = 1; + return 1; + } + /* If no non-offloaded callbacks, RCU doesn't need the CPU. */ if (rcu_segcblist_empty(&rdp->cblist) || rcu_segcblist_is_offloaded(&this_cpu_ptr(&rcu_data)->cblist)) { -- 2.9.5
[PATCH tip/core/rcu 10/10] rcu/tree: Add a trace event for RCU CPU stall warnings
From: Sangmoon Kim This commit adds a trace event which allows tracing the beginnings of RCU CPU stall warnings on systems where sysctl_panic_on_rcu_stall is disabled. The first parameter is the name of RCU flavor like other trace events. The second parameter indicates whether this is a stall of an expedited grace period, a self-detected stall of a normal grace period, or a stall of a normal grace period detected by some CPU other than the one that is stalled. RCU CPU stall warnings are often caused by external-to-RCU issues, for example, in interrupt handling or task scheduling. Therefore, this event uses TRACE_EVENT, not TRACE_EVENT_RCU, to avoid requiring those interested in tracing RCU CPU stalls to rebuild their kernels with CONFIG_RCU_TRACE=y. Reviewed-by: Uladzislau Rezki (Sony) Reviewed-by: Neeraj Upadhyay Signed-off-by: Sangmoon Kim Signed-off-by: Paul E. McKenney --- include/trace/events/rcu.h | 28 kernel/rcu/tree_exp.h | 1 + kernel/rcu/tree_stall.h| 2 ++ 3 files changed, 31 insertions(+) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 5fc2940..c7711e9 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -432,6 +432,34 @@ TRACE_EVENT_RCU(rcu_fqs, __entry->cpu, __entry->qsevent) ); +/* + * Tracepoint for RCU stall events. Takes a string identifying the RCU flavor + * and a string identifying which function detected the RCU stall as follows: + * + * "StallDetected": Scheduler-tick detects other CPU's stalls. + * "SelfDetected": Scheduler-tick detects a current CPU's stall. + * "ExpeditedStall": Expedited grace period detects stalls. + */ +TRACE_EVENT(rcu_stall_warning, + + TP_PROTO(const char *rcuname, const char *msg), + + TP_ARGS(rcuname, msg), + + TP_STRUCT__entry( + __field(const char *, rcuname) + __field(const char *, msg) + ), + + TP_fast_assign( + __entry->rcuname = rcuname; + __entry->msg = msg; + ), + + TP_printk("%s %s", + __entry->rcuname, __entry->msg) +); + #endif /* #if defined(CONFIG_TREE_RCU) */ /* diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h index 6c6ff06..2796084 100644 --- a/kernel/rcu/tree_exp.h +++ b/kernel/rcu/tree_exp.h @@ -521,6 +521,7 @@ static void synchronize_rcu_expedited_wait(void) if (rcu_stall_is_suppressed()) continue; panic_on_rcu_stall(); + trace_rcu_stall_warning(rcu_state.name, TPS("ExpeditedStall")); pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {", rcu_state.name); ndetected = 0; diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h index 475b261..59b95cc 100644 --- a/kernel/rcu/tree_stall.h +++ b/kernel/rcu/tree_stall.h @@ -536,6 +536,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps) * See Documentation/RCU/stallwarn.rst for info on how to debug * RCU CPU stall warnings. */ + trace_rcu_stall_warning(rcu_state.name, TPS("StallDetected")); pr_err("INFO: %s detected stalls on CPUs/tasks:\n", rcu_state.name); rcu_for_each_leaf_node(rnp) { raw_spin_lock_irqsave_rcu_node(rnp, flags); @@ -606,6 +607,7 @@ static void print_cpu_stall(unsigned long gps) * See Documentation/RCU/stallwarn.rst for info on how to debug * RCU CPU stall warnings. */ + trace_rcu_stall_warning(rcu_state.name, TPS("SelfDetected")); pr_err("INFO: %s self-detected stall on CPU\n", rcu_state.name); raw_spin_lock_irqsave_rcu_node(rdp->mynode, flags); print_cpu_stall_info(smp_processor_id()); -- 2.9.5
[PATCH tip/core/rcu 09/10] rcu: Add explicit barrier() to __rcu_read_unlock()
From: "Paul E. McKenney" Because preemptible RCU's __rcu_read_unlock() is an external function, the rough equivalent of an implicit barrier() is inserted by the compiler. Except that there is a direct call to __rcu_read_unlock() in that same file, and compilers are getting to the point where they might choose to inline the fastpath of the __rcu_read_unlock() function. This commit therefore adds an explicit barrier() to the very beginning of __rcu_read_unlock(). Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 36212de..d9495de 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -393,8 +393,9 @@ void __rcu_read_unlock(void) { struct task_struct *t = current; + barrier(); // critical section before exit code. if (rcu_preempt_read_exit() == 0) { - barrier(); /* critical section before exit code. */ + barrier(); // critical-section exit before .s check. if (unlikely(READ_ONCE(t->rcu_read_unlock_special.s))) rcu_read_unlock_special(t); } -- 2.9.5
[PATCH tip/core/rcu 08/10] docs: Correctly spell Stephen Hemminger's name
From: "Paul E. McKenney" This commit replaces "Steve" with the his real name, which is "Stephen". Reported-by: Stephen Hemminger Signed-off-by: Paul E. McKenney --- Documentation/RCU/RTFP.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt index 3b0876c..588d973 100644 --- a/Documentation/RCU/RTFP.txt +++ b/Documentation/RCU/RTFP.txt @@ -847,7 +847,7 @@ Symposium on Distributed Computing} 'It's entirely possible that the current user could be replaced by RCU and/or seqlocks, and we could get rid of brlocks entirely.' . - Steve Hemminger responds by replacing them with RCU. + Stephen Hemminger responds by replacing them with RCU. } } -- 2.9.5
[PATCH tip/core/rcu 06/10] softirq: Don't try waking ksoftirqd before it has been spawned
From: "Paul E. McKenney" If there is heavy softirq activity, the softirq system will attempt to awaken ksoftirqd and will stop the traditional back-of-interrupt softirq processing. This is all well and good, but only if the ksoftirqd kthreads already exist, which is not the case during early boot, in which case the system hangs. One reproducer is as follows: tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs "TREE03" --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_NO_HZ_IDLE=y CONFIG_HZ_PERIODIC=n" --bootargs "threadirqs=1" --trust-make This commit therefore adds a couple of existence checks for ksoftirqd and forces back-of-interrupt softirq processing when ksoftirqd does not yet exist. With this change, the above test passes. Reported-by: Sebastian Andrzej Siewior Reported-by: Uladzislau Rezki Cc: Peter Zijlstra Cc: Thomas Gleixner [ paulmck: Remove unneeded check per Sebastian Siewior feedback. ] Signed-off-by: Paul E. McKenney --- kernel/softirq.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 9908ec4a..bad14ca 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -211,7 +211,7 @@ static inline void invoke_softirq(void) if (ksoftirqd_running(local_softirq_pending())) return; - if (!force_irqthreads) { + if (!force_irqthreads || !__this_cpu_read(ksoftirqd)) { #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK /* * We can safely execute softirq on the current stack if -- 2.9.5
[PATCH tip/core/rcu 6/6] rcuscale: Add kfree_rcu() single-argument scale test
From: "Uladzislau Rezki (Sony)" The single-argument variant of kfree_rcu() is currently not tested by any member of the rcutoture test suite. This commit therefore adds rcuscale code to test it. This testing is controlled by two new boolean module parameters, kfree_rcu_test_single and kfree_rcu_test_double. If one is set and the other not, only the corresponding variant is tested, otherwise both are tested, with the variant to be tested determined randomly on each invocation. Both of these module parameters are initialized to false, so setting either to true will test only that variant. Suggested-by: Paul E. McKenney Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 12 kernel/rcu/rcuscale.c | 15 ++- 2 files changed, 26 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0454572..84fce41 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4259,6 +4259,18 @@ rcuscale.kfree_rcu_test= [KNL] Set to measure performance of kfree_rcu() flooding. + rcuscale.kfree_rcu_test_double= [KNL] + Test the double-argument variant of kfree_rcu(). + If this parameter has the same value as + rcuscale.kfree_rcu_test_single, both the single- + and double-argument variants are tested. + + rcuscale.kfree_rcu_test_single= [KNL] + Test the single-argument variant of kfree_rcu(). + If this parameter has the same value as + rcuscale.kfree_rcu_test_double, both the single- + and double-argument variants are tested. + rcuscale.kfree_nthreads= [KNL] The number of threads running loops of kfree_rcu(). diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c index 06491d5..dca51fe 100644 --- a/kernel/rcu/rcuscale.c +++ b/kernel/rcu/rcuscale.c @@ -625,6 +625,8 @@ rcu_scale_shutdown(void *arg) torture_param(int, kfree_nthreads, -1, "Number of threads running loops of kfree_rcu()."); torture_param(int, kfree_alloc_num, 8000, "Number of allocations and frees done in an iteration."); torture_param(int, kfree_loops, 10, "Number of loops doing kfree_alloc_num allocations and frees."); +torture_param(bool, kfree_rcu_test_double, false, "Do we run a kfree_rcu() double-argument scale test?"); +torture_param(bool, kfree_rcu_test_single, false, "Do we run a kfree_rcu() single-argument scale test?"); static struct task_struct **kfree_reader_tasks; static int kfree_nrealthreads; @@ -644,10 +646,13 @@ kfree_scale_thread(void *arg) struct kfree_obj *alloc_ptr; u64 start_time, end_time; long long mem_begin, mem_during = 0; + bool kfree_rcu_test_both; + DEFINE_TORTURE_RANDOM(tr); VERBOSE_SCALEOUT_STRING("kfree_scale_thread task started"); set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); set_user_nice(current, MAX_NICE); + kfree_rcu_test_both = (kfree_rcu_test_single == kfree_rcu_test_double); start_time = ktime_get_mono_fast_ns(); @@ -670,7 +675,15 @@ kfree_scale_thread(void *arg) if (!alloc_ptr) return -ENOMEM; - kfree_rcu(alloc_ptr, rh); + // By default kfree_rcu_test_single and kfree_rcu_test_double are + // initialized to false. If both have the same value (false or true) + // both are randomly tested, otherwise only the one with value true + // is tested. + if ((kfree_rcu_test_single && !kfree_rcu_test_double) || + (kfree_rcu_test_both && torture_random(&tr) & 0x800)) + kfree_rcu(alloc_ptr); + else + kfree_rcu(alloc_ptr, rh); } cond_resched(); -- 2.9.5
[PATCH tip/core/rcu 2/6] kvfree_rcu: Use __GFP_NOMEMALLOC for single-argument kvfree_rcu()
From: "Paul E. McKenney" This commit applies the __GFP_NOMEMALLOC gfp flag to memory allocations carried out by the single-argument variant of kvfree_rcu(), thus avoiding this can-sleep code path from dipping into the emergency reserves. Acked-by: Michal Hocko Suggested-by: Michal Hocko Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 1f8c980..08b5044 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3519,7 +3519,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp, if (!bnode && can_alloc) { krc_this_cpu_unlock(*krcp, *flags); bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); + __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN); *krcp = krc_this_cpu_lock(flags); } -- 2.9.5
[PATCH tip/core/rcu 4/6] kvfree_rcu: Replace __GFP_RETRY_MAYFAIL by __GFP_NORETRY
From: "Uladzislau Rezki (Sony)" __GFP_RETRY_MAYFAIL can spend quite a bit of time reclaiming, and this can be wasted effort given that there is a fallback code path in case memory allocation fails. __GFP_NORETRY does perform some light-weight reclaim, but it will fail under OOM conditions, allowing the fallback to be taken as an alternative to hard-OOMing the system. There is a four-way tradeoff that must be balanced: 1) Minimize use of the fallback path; 2) Avoid full-up OOM; 3) Do a light-wait allocation request; 4) Avoid dipping into the emergency reserves. Signed-off-by: Uladzislau Rezki (Sony) Acked-by: Michal Hocko Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 7ee83f3..0ecc1fb 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3517,8 +3517,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp, bnode = get_cached_bnode(*krcp); if (!bnode && can_alloc) { krc_this_cpu_unlock(*krcp, *flags); + + // __GFP_NORETRY - allows a light-weight direct reclaim + // what is OK from minimizing of fallback hitting point of + // view. Apart of that it forbids any OOM invoking what is + // also beneficial since we are about to release memory soon. + // + // __GFP_NOMEMALLOC - prevents from consuming of all the + // memory reserves. Please note we have a fallback path. + // + // __GFP_NOWARN - it is supposed that an allocation can + // be failed under low memory or high memory pressure + // scenarios. bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOMEMALLOC | __GFP_NOWARN); + __get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); *krcp = krc_this_cpu_lock(flags); } -- 2.9.5
[PATCH tip/core/rcu 5/6] kvfree_rcu: Use same set of GFP flags as does single-argument
From: "Uladzislau Rezki (Sony)" Running an rcuscale stress-suite can lead to "Out of memory" of a system. This can happen under high memory pressure with a small amount of physical memory. For example, a KVM test configuration with 64 CPUs and 512 megabytes can result in OOM when running rcuscale with below parameters: ../kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig CONFIG_NR_CPUS=64 \ --bootargs "rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 \ rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" --trust-make [ 12.054448] kworker/1:1H invoked oom-killer: gfp_mask=0x2cc0(GFP_KERNEL|__GFP_NOWARN), order=0, oom_score_adj=0 [ 12.055303] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.055416] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 [ 12.056485] Workqueue: events_highpri fill_page_cache_func [ 12.056485] Call Trace: [ 12.056485] dump_stack+0x57/0x6a [ 12.056485] dump_header+0x4c/0x30a [ 12.056485] ? del_timer_sync+0x20/0x30 [ 12.056485] out_of_memory.cold.47+0xa/0x7e [ 12.056485] __alloc_pages_slowpath.constprop.123+0x82f/0xc00 [ 12.056485] __alloc_pages_nodemask+0x289/0x2c0 [ 12.056485] __get_free_pages+0x8/0x30 [ 12.056485] fill_page_cache_func+0x39/0xb0 [ 12.056485] process_one_work+0x1ed/0x3b0 [ 12.056485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] worker_thread+0x28/0x3c0 [ 12.060485] ? process_one_work+0x3b0/0x3b0 [ 12.060485] kthread+0x138/0x160 [ 12.060485] ? kthread_park+0x80/0x80 [ 12.060485] ret_from_fork+0x22/0x30 [ 12.062156] Mem-Info: [ 12.062350] active_anon:0 inactive_anon:0 isolated_anon:0 [ 12.062350] active_file:0 inactive_file:0 isolated_file:0 [ 12.062350] unevictable:0 dirty:0 writeback:0 [ 12.062350] slab_reclaimable:2797 slab_unreclaimable:80920 [ 12.062350] mapped:1 shmem:2 pagetables:8 bounce:0 [ 12.062350] free:10488 free_pcp:1227 free_cma:0 ... [ 12.101610] Out of memory and no killable processes... [ 12.102042] Kernel panic - not syncing: System is deadlocked on memory [ 12.102583] CPU: 1 PID: 377 Comm: kworker/1:1H Not tainted 5.11.0-rc3+ #510 [ 12.102600] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-1 04/01/2014 Because kvfree_rcu() has a fallback path, memory allocation failure is not the end of the world. Furthermore, the added overhead of aggressive GFP settings must be balanced against the overhead of the fallback path, which is a cache miss for double-argument kvfree_rcu() and a call to synchronize_rcu() for single-argument kvfree_rcu(). The current choice of GFP_KERNEL|__GFP_NOWARN can result in longer latencies than a call to synchronize_rcu(), so less-tenacious GFP flags would be helpful. Here is the tradeoff that must be balanced: a) Minimize use of the fallback path, b) Avoid pushing the system into OOM, c) Bound allocation latency to that of synchronize_rcu(), and d) Leave the emergency reserves to use cases lacking fallbacks. This commit therefore changes GFP flags from GFP_KERNEL|__GFP_NOWARN to GFP_KERNEL|__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_NOWARN. This combination leaves the emergency reserves alone and can initiate reclaim, but will not invoke the OOM killer. Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 0ecc1fb..4120d4b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3463,7 +3463,7 @@ static void fill_page_cache_func(struct work_struct *work) for (i = 0; i < rcu_min_cached_objs; i++) { bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_KERNEL | __GFP_NOWARN); + __get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN); if (bnode) { raw_spin_lock_irqsave(&krcp->lock, flags); -- 2.9.5
[PATCH tip/core/rcu 3/6] kvfree_rcu: Make krc_this_cpu_unlock() use raw_spin_unlock_irqrestore()
From: "Paul E. McKenney" The krc_this_cpu_unlock() function does a raw_spin_unlock() immediately followed by a local_irq_restore(). This commit saves a line of code by merging them into a raw_spin_unlock_irqrestore(). This transformation also reduces scheduling latency because raw_spin_unlock_irqrestore() responds immediately to a reschedule request. In contrast, local_irq_restore() does a scheduling-oblivious enabling of interrupts. Reported-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 08b5044..7ee83f3 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3229,8 +3229,7 @@ krc_this_cpu_lock(unsigned long *flags) static inline void krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags) { - raw_spin_unlock(&krcp->lock); - local_irq_restore(flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); } static inline struct kvfree_rcu_bulk_data * -- 2.9.5
[PATCH tip/core/rcu 1/6] kvfree_rcu: Directly allocate page for single-argument case
From: "Uladzislau Rezki (Sony)" Single-argument kvfree_rcu() must be invoked from sleepable contexts, so we can directly allocate pages. Furthermmore, the fallback in case of page-allocation failure is the high-latency synchronize_rcu(), so it makes sense to do these page allocations from the fastpath, and even to permit limited sleeping within the allocator. This commit therefore allocates if needed on the fastpath using GFP_KERNEL|__GFP_RETRY_MAYFAIL. This also has the beneficial effect of leaving kvfree_rcu()'s per-CPU caches to the double-argument variant of kvfree_rcu(), given that the double-argument variant cannot directly invoke the allocator. [ paulmck: Add add_ptr_to_bulk_krc_lock header comment per Michal Hocko. ] Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 42 ++ 1 file changed, 26 insertions(+), 16 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f521..1f8c980 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3493,37 +3493,50 @@ run_page_cache_worker(struct kfree_rcu_cpu *krcp) } } +// Record ptr in a page managed by krcp, with the pre-krc_this_cpu_lock() +// state specified by flags. If can_alloc is true, the caller must +// be schedulable and not be holding any locks or mutexes that might be +// acquired by the memory allocator or anything that it might invoke. +// Returns true if ptr was successfully recorded, else the caller must +// use a fallback. static inline bool -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) +add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp, + unsigned long *flags, void *ptr, bool can_alloc) { struct kvfree_rcu_bulk_data *bnode; int idx; - if (unlikely(!krcp->initialized)) + *krcp = krc_this_cpu_lock(flags); + if (unlikely(!(*krcp)->initialized)) return false; - lockdep_assert_held(&krcp->lock); idx = !!is_vmalloc_addr(ptr); /* Check if a new block is required. */ - if (!krcp->bkvhead[idx] || - krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { - bnode = get_cached_bnode(krcp); - /* Switch to emergency path. */ + if (!(*krcp)->bkvhead[idx] || + (*krcp)->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { + bnode = get_cached_bnode(*krcp); + if (!bnode && can_alloc) { + krc_this_cpu_unlock(*krcp, *flags); + bnode = (struct kvfree_rcu_bulk_data *) + __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); + *krcp = krc_this_cpu_lock(flags); + } + if (!bnode) return false; /* Initialize the new block. */ bnode->nr_records = 0; - bnode->next = krcp->bkvhead[idx]; + bnode->next = (*krcp)->bkvhead[idx]; /* Attach it to the head. */ - krcp->bkvhead[idx] = bnode; + (*krcp)->bkvhead[idx] = bnode; } /* Finally insert. */ - krcp->bkvhead[idx]->records - [krcp->bkvhead[idx]->nr_records++] = ptr; + (*krcp)->bkvhead[idx]->records + [(*krcp)->bkvhead[idx]->nr_records++] = ptr; return true; } @@ -3561,8 +3574,6 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) ptr = (unsigned long *) func; } - krcp = krc_this_cpu_lock(&flags); - // Queue the object but don't yet schedule the batch. if (debug_rcu_head_queue(ptr)) { // Probable double kfree_rcu(), just leak. @@ -3570,12 +3581,11 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) __func__, head); // Mark as success and leave. - success = true; - goto unlock_return; + return; } kasan_record_aux_stack(ptr); - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); + success = add_ptr_to_bulk_krc_lock(&krcp, &flags, ptr, !head); if (!success) { run_page_cache_worker(krcp); -- 2.9.5
[PATCH lib/bitmap 3/9] lib: test_bitmap: add more start-end:offset/len tests
From: Paul Gortmaker There are inputs to bitmap_parselist() that would probably never be entered manually by a person, but might result from some kind of automated input generator. Things like ranges of length 1, or group lengths longer than nbits, overlaps, or offsets of zero. Adding these tests serve two purposes: 1) document what might seem odd but nonetheless valid input. 2) don't regress from what we currently accept as valid. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Acked-by: Yury Norov Reviewed-by: Andy Shevchenko Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/test_bitmap.c | 22 ++ 1 file changed, 22 insertions(+) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 0f2e91d..3c1c46d 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -34,6 +34,8 @@ static const unsigned long exp1[] __initconst = { BITMAP_FROM_U64(0xULL), BITMAP_FROM_U64(0xULL), BITMAP_FROM_U64(0), + BITMAP_FROM_U64(0x8000), + BITMAP_FROM_U64(0x8000), }; static const unsigned long exp2[] __initconst = { @@ -334,6 +336,26 @@ static const struct test_bitmap_parselist parselist_tests[] __initconst = { {0, " , ,, , , ", &exp1[12 * step], 8, 0}, {0, " , ,, , , \n", &exp1[12 * step], 8, 0}, + {0, "0-0", &exp1[0], 32, 0}, + {0, "1-1", &exp1[1 * step], 32, 0}, + {0, "15-15",&exp1[13 * step], 32, 0}, + {0, "31-31",&exp1[14 * step], 32, 0}, + + {0, "0-0:0/1", &exp1[12 * step], 32, 0}, + {0, "0-0:1/1", &exp1[0], 32, 0}, + {0, "0-0:1/31", &exp1[0], 32, 0}, + {0, "0-0:31/31",&exp1[0], 32, 0}, + {0, "1-1:1/1", &exp1[1 * step], 32, 0}, + {0, "0-15:16/31", &exp1[2 * step], 32, 0}, + {0, "15-15:1/2",&exp1[13 * step], 32, 0}, + {0, "15-15:31/31", &exp1[13 * step], 32, 0}, + {0, "15-31:1/31", &exp1[13 * step], 32, 0}, + {0, "16-31:16/31", &exp1[3 * step], 32, 0}, + {0, "31-31:31/31", &exp1[14 * step], 32, 0}, + + {0, "0-31:1/3,1-31:1/3,2-31:1/3", &exp1[8 * step], 32, 0}, + {0, "1-10:8/12,8-31:24/29,0-31:0/3",&exp1[9 * step], 32, 0}, + {-EINVAL, "-1", NULL, 8, 0}, {-EINVAL, "-0", NULL, 8, 0}, {-EINVAL, "10-1", NULL, 8, 0}, -- 2.9.5
[PATCH lib/bitmap 6/9] lib: bitmap: support "N" as an alias for size of bitmap
From: Paul Gortmaker While this is done for all bitmaps, the original use case in mind was for CPU masks and cpulist_parse() as described below. It seems that a common configuration is to use the 1st couple cores for housekeeping tasks. This tends to leave the remaining ones to form a pool of similarly configured cores to take on the real workload of interest to the user. So on machine A - with 32 cores, it could be 0-3 for "system" and then 4-31 being used in boot args like nohz_full=, or rcu_nocbs= as part of setting up the worker pool of CPUs. But then newer machine B is added, and it has 48 cores, and so while the 0-3 part remains unchanged, the pool setup cpu list becomes 4-47. Multiple deployment becomes easier when we can just simply replace 31 and 47 with "N" and let the system substitute in the actual number at boot; a number that it knows better than we do. Cc: Yury Norov Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Rasmus Villemoes Cc: Andy Shevchenko Suggested-by: Yury Norov # move it from CPU code Acked-by: Yury Norov Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.rst | 7 +++ lib/bitmap.c| 22 +- 2 files changed, 24 insertions(+), 5 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index 1132796..d6e3f67 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -68,6 +68,13 @@ For example one can add to the command line following parameter: where the final item represents CPUs 100,101,125,126,150,151,... +The value "N" can be used to represent the numerically last CPU on the system, +i.e "foo_cpus=16-N" would be equivalent to "16-31" on a 32 core system. + +Keep in mind that "N" is dynamic, so if system changes cause the bitmap width +to change, such as less cores in the CPU list, then N and any ranges using N +will also change. Use the same on a small 4 core system, and "16-N" becomes +"16-3" and now the same boot input will be flagged as invalid (start > end). This document may not be entirely up to date and comprehensive. The command diff --git a/lib/bitmap.c b/lib/bitmap.c index 833f152a..9f4626a 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -519,11 +519,17 @@ static int bitmap_check_region(const struct region *r) return 0; } -static const char *bitmap_getnum(const char *str, unsigned int *num) +static const char *bitmap_getnum(const char *str, unsigned int *num, +unsigned int lastbit) { unsigned long long n; unsigned int len; + if (str[0] == 'N') { + *num = lastbit; + return str + 1; + } + len = _parse_integer(str, 10, &n); if (!len) return ERR_PTR(-EINVAL); @@ -571,7 +577,9 @@ static const char *bitmap_find_region_reverse(const char *start, const char *end static const char *bitmap_parse_region(const char *str, struct region *r) { - str = bitmap_getnum(str, &r->start); + unsigned int lastbit = r->nbits - 1; + + str = bitmap_getnum(str, &r->start, lastbit); if (IS_ERR(str)) return str; @@ -581,7 +589,7 @@ static const char *bitmap_parse_region(const char *str, struct region *r) if (*str != '-') return ERR_PTR(-EINVAL); - str = bitmap_getnum(str + 1, &r->end); + str = bitmap_getnum(str + 1, &r->end, lastbit); if (IS_ERR(str)) return str; @@ -591,14 +599,14 @@ static const char *bitmap_parse_region(const char *str, struct region *r) if (*str != ':') return ERR_PTR(-EINVAL); - str = bitmap_getnum(str + 1, &r->off); + str = bitmap_getnum(str + 1, &r->off, lastbit); if (IS_ERR(str)) return str; if (*str != '/') return ERR_PTR(-EINVAL); - return bitmap_getnum(str + 1, &r->group_len); + return bitmap_getnum(str + 1, &r->group_len, lastbit); no_end: r->end = r->start; @@ -625,6 +633,10 @@ static const char *bitmap_parse_region(const char *str, struct region *r) * From each group will be used only defined amount of bits. * Syntax: range:used_size/group_size * Example: 0-1023:2/256 ==> 0,1,256,257,512,513,768,769 + * The value 'N' can be used as a dynamically substituted token for the + * maximum allowed value; i.e (nmaskbits - 1). Keep in mind that it is + * dynamic, so if system changes cause the bitmap width to change, such + * as more cores in a CPU list, then any ranges using N will also change. * * Returns: 0 on success, -errno on invalid input strings. Error values: * -- 2.9.5
[PATCH lib/bitmap 8/9] rcu: deprecate "all" option to rcu_nocbs=
From: Paul Gortmaker With the core bitmap support now accepting "N" as a placeholder for the end of the bitmap, "all" can be represented as "0-N" and has the advantage of not being specific to RCU (or any other subsystem). So deprecate the use of "all" by removing documentation references to it. The support itself needs to remain for now, since we don't know how many people out there are using it currently, but since it is in an __init area anyway, it isn't worth losing sleep over. Cc: Yury Norov Cc: Peter Zijlstra Cc: "Paul E. McKenney" Cc: Josh Triplett Acked-by: Yury Norov Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- Documentation/admin-guide/kernel-parameters.txt | 4 +--- kernel/rcu/tree_plugin.h| 6 ++ 2 files changed, 3 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0454572..83e2ef1 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4068,9 +4068,7 @@ see CONFIG_RAS_CEC help text. rcu_nocbs= [KNL] - The argument is a cpu list, as described above, - except that the string "all" can be used to - specify every CPU on the system. + The argument is a cpu list, as described above. In kernels built with CONFIG_RCU_NOCB_CPU=y, set the specified list of CPUs to be no-callback CPUs. diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2d60377..0b95562 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1464,14 +1464,12 @@ static void rcu_cleanup_after_idle(void) /* * Parse the boot-time rcu_nocb_mask CPU list from the kernel parameters. - * The string after the "rcu_nocbs=" is either "all" for all CPUs, or a - * comma-separated list of CPUs and/or CPU ranges. If an invalid list is - * given, a warning is emitted and all CPUs are offloaded. + * If the list is invalid, a warning is emitted and all CPUs are offloaded. */ static int __init rcu_nocb_setup(char *str) { alloc_bootmem_cpumask_var(&rcu_nocb_mask); - if (!strcasecmp(str, "all")) + if (!strcasecmp(str, "all"))/* legacy: use "0-N" instead */ cpumask_setall(rcu_nocb_mask); else if (cpulist_parse(str, rcu_nocb_mask)) { -- 2.9.5
[PATCH lib/bitmap 9/9] rcutorture: Use "all" and "N" in "nohz_full" and "rcu_nocbs"
From: "Paul E. McKenney" This commit uses the shiny new "all" and "N" cpumask options to decouple the "nohz_full" and "rcu_nocbs" kernel boot parameters in the TREE04.boot and TREE08.boot files from the CONFIG_NR_CPUS options in the TREE04 and TREE08 files. Reported-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot | 2 +- tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot index 5adc675..a8d94ca 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE04.boot @@ -1 +1 @@ -rcutree.rcu_fanout_leaf=4 nohz_full=1-7 +rcutree.rcu_fanout_leaf=4 nohz_full=1-N diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot index 22478fd..94d3844 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE08.boot @@ -1,3 +1,3 @@ rcupdate.rcu_self_test=1 rcutree.rcu_fanout_exact=1 -rcu_nocbs=0-7 +rcu_nocbs=all -- 2.9.5
[PATCH lib/bitmap 7/9] lib: test_bitmap: add tests for "N" alias
From: Paul Gortmaker These are copies of existing tests, with just 31 --> N. This ensures the recently added "N" alias transparently works in any normally numeric fields of a region specification. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Acked-by: Yury Norov Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/test_bitmap.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 3c1c46d..9cd5755 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -353,6 +353,16 @@ static const struct test_bitmap_parselist parselist_tests[] __initconst = { {0, "16-31:16/31", &exp1[3 * step], 32, 0}, {0, "31-31:31/31", &exp1[14 * step], 32, 0}, + {0, "N-N", &exp1[14 * step], 32, 0}, + {0, "0-0:1/N", &exp1[0], 32, 0}, + {0, "0-0:N/N", &exp1[0], 32, 0}, + {0, "0-15:16/N",&exp1[2 * step], 32, 0}, + {0, "15-15:N/N",&exp1[13 * step], 32, 0}, + {0, "15-N:1/N", &exp1[13 * step], 32, 0}, + {0, "16-N:16/N",&exp1[3 * step], 32, 0}, + {0, "N-N:N/N", &exp1[14 * step], 32, 0}, + + {0, "0-N:1/3,1-N:1/3,2-N:1/3", &exp1[8 * step], 32, 0}, {0, "0-31:1/3,1-31:1/3,2-31:1/3", &exp1[8 * step], 32, 0}, {0, "1-10:8/12,8-31:24/29,0-31:0/3",&exp1[9 * step], 32, 0}, -- 2.9.5
[PATCH lib/bitmap 1/9] lib: test_bitmap: clearly separate ERANGE from EINVAL tests.
From: Paul Gortmaker This block of tests was meant to find/flag incorrect use of the ":" and "/" separators (syntax errors) and invalid (zero) group len. However they were specified with an 8 bit width and 32 bit operations, so they really contained two errors (EINVAL and ERANGE). Promote them to 32 bit so it is clear what they are meant to target. Then we can add tests specific for ERANGE (no syntax errors, just doing 32bit op on 8 bit width, plus a typical 9-on-8 fencepost error). Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Acked-by: Yury Norov Reviewed-by: Andy Shevchenko Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/test_bitmap.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 0ea0e82..853a3a6 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -337,12 +337,12 @@ static const struct test_bitmap_parselist parselist_tests[] __initconst = { {-EINVAL, "-1", NULL, 8, 0}, {-EINVAL, "-0", NULL, 8, 0}, {-EINVAL, "10-1", NULL, 8, 0}, - {-EINVAL, "0-31:", NULL, 8, 0}, - {-EINVAL, "0-31:0", NULL, 8, 0}, - {-EINVAL, "0-31:0/", NULL, 8, 0}, - {-EINVAL, "0-31:0/0", NULL, 8, 0}, - {-EINVAL, "0-31:1/0", NULL, 8, 0}, - {-EINVAL, "0-31:10/1", NULL, 8, 0}, + {-EINVAL, "0-31:", NULL, 32, 0}, + {-EINVAL, "0-31:0", NULL, 32, 0}, + {-EINVAL, "0-31:0/", NULL, 32, 0}, + {-EINVAL, "0-31:0/0", NULL, 32, 0}, + {-EINVAL, "0-31:1/0", NULL, 32, 0}, + {-EINVAL, "0-31:10/1", NULL, 32, 0}, {-EOVERFLOW, "0-98765432123456789:10/1", NULL, 8, 0}, {-EINVAL, "a-31", NULL, 8, 0}, -- 2.9.5
[PATCH lib/bitmap 2/9] lib: test_bitmap: add tests to trigger ERANGE case.
From: Paul Gortmaker Add tests that specify a valid range, but one that is outside the width of the bitmap for which it is to be applied to. These should trigger an -ERANGE response from the code. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Acked-by: Yury Norov Reviewed-by: Andy Shevchenko Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/test_bitmap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 853a3a6..0f2e91d 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -337,6 +337,8 @@ static const struct test_bitmap_parselist parselist_tests[] __initconst = { {-EINVAL, "-1", NULL, 8, 0}, {-EINVAL, "-0", NULL, 8, 0}, {-EINVAL, "10-1", NULL, 8, 0}, + {-ERANGE, "8-8", NULL, 8, 0}, + {-ERANGE, "0-31", NULL, 8, 0}, {-EINVAL, "0-31:", NULL, 32, 0}, {-EINVAL, "0-31:0", NULL, 32, 0}, {-EINVAL, "0-31:0/", NULL, 32, 0}, -- 2.9.5
[PATCH lib/bitmap 4/9] lib: bitmap: fold nbits into region struct
From: Paul Gortmaker This will reduce parameter passing and enable using nbits as part of future dynamic region parameter parsing. Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Suggested-by: Yury Norov Acked-by: Yury Norov Reviewed-by: Andy Shevchenko Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/bitmap.c | 19 ++- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index 75006c4..162e285 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -487,24 +487,24 @@ EXPORT_SYMBOL(bitmap_print_to_pagebuf); /* * Region 9-38:4/10 describes the following bitmap structure: - * 0 9 1218 38 - * ....... - * ^ ^ ^ ^ - * start off group_lenend + * 0 9 1218 38 N + * ....... + * ^ ^ ^ ^ ^ + * start off group_lenend nbits */ struct region { unsigned int start; unsigned int off; unsigned int group_len; unsigned int end; + unsigned int nbits; }; -static int bitmap_set_region(const struct region *r, - unsigned long *bitmap, int nbits) +static int bitmap_set_region(const struct region *r, unsigned long *bitmap) { unsigned int start; - if (r->end >= nbits) + if (r->end >= r->nbits) return -ERANGE; for (start = r->start; start <= r->end; start += r->group_len) @@ -640,7 +640,8 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits) struct region r; long ret; - bitmap_zero(maskp, nmaskbits); + r.nbits = nmaskbits; + bitmap_zero(maskp, r.nbits); while (buf) { buf = bitmap_find_region(buf); @@ -655,7 +656,7 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits) if (ret) return ret; - ret = bitmap_set_region(&r, maskp, nmaskbits); + ret = bitmap_set_region(&r, maskp); if (ret) return ret; } -- 2.9.5
[PATCH lib/bitmap 5/9] lib: bitmap: move ERANGE check from set_region to check_region
From: Paul Gortmaker It makes sense to do all the checks in check_region() and not 1/2 in check_region and 1/2 in set_region. Since set_region is called immediately after check_region, the net effect on runtime is zero, but it gets rid of an if (...) return... Cc: Yury Norov Cc: Rasmus Villemoes Cc: Andy Shevchenko Acked-by: Yury Norov Reviewed-by: Andy Shevchenko Signed-off-by: Paul Gortmaker Signed-off-by: Paul E. McKenney --- lib/bitmap.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/lib/bitmap.c b/lib/bitmap.c index 162e285..833f152a 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -500,17 +500,12 @@ struct region { unsigned int nbits; }; -static int bitmap_set_region(const struct region *r, unsigned long *bitmap) +static void bitmap_set_region(const struct region *r, unsigned long *bitmap) { unsigned int start; - if (r->end >= r->nbits) - return -ERANGE; - for (start = r->start; start <= r->end; start += r->group_len) bitmap_set(bitmap, start, min(r->end - start + 1, r->off)); - - return 0; } static int bitmap_check_region(const struct region *r) @@ -518,6 +513,9 @@ static int bitmap_check_region(const struct region *r) if (r->start > r->end || r->group_len == 0 || r->off > r->group_len) return -EINVAL; + if (r->end >= r->nbits) + return -ERANGE; + return 0; } @@ -656,9 +654,7 @@ int bitmap_parselist(const char *buf, unsigned long *maskp, int nmaskbits) if (ret) return ret; - ret = bitmap_set_region(&r, maskp); - if (ret) - return ret; + bitmap_set_region(&r, maskp); } return 0; -- 2.9.5
[PATCH tip/core/rcu 02/12] timer: Report ignored local enqueue in nohz mode
From: Frederic Weisbecker Enqueuing a local timer after the tick has been stopped will result in the timer being ignored until the next random interrupt. Perform sanity checks to report these situations. Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Rafael J. Wysocki Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/sched/core.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ca2bb62..4822371 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -674,6 +674,26 @@ int get_nohz_timer_target(void) return cpu; } +static void wake_idle_assert_possible(void) +{ +#ifdef CONFIG_SCHED_DEBUG + /* Timers are re-evaluated after idle IRQs */ + if (in_hardirq()) + return; + /* +* Same as hardirqs, assuming they are executing +* on IRQ tail. Ksoftirqd shouldn't reach here +* as the timer base wouldn't be idle. And inline +* softirq processing after a call to local_bh_enable() +* within idle loop sound too fun to be considered here. +*/ + if (in_serving_softirq()) + return; + + WARN_ON_ONCE("Late timer enqueue may be ignored\n"); +#endif +} + /* * When add_timer_on() enqueues a timer into the timer wheel of an * idle CPU then this timer might expire before the next timer event @@ -688,8 +708,10 @@ static void wake_up_idle_cpu(int cpu) { struct rq *rq = cpu_rq(cpu); - if (cpu == smp_processor_id()) + if (cpu == smp_processor_id()) { + wake_idle_assert_possible(); return; + } if (set_nr_and_not_polling(rq->idle)) smp_send_reschedule(cpu); -- 2.9.5
[PATCH tip/core/rcu 01/12] rcu/nocb: Detect unsafe checks for offloaded rdp
From: Frederic Weisbecker Provide CONFIG_PROVE_RCU sanity checks to ensure we are always reading the offloaded state of an rdp in a safe and stable way and prevent from its value to be changed under us. We must either hold the barrier mutex, the cpu-hotplug lock (read or write) or the nocb lock. Local non-preemptible reads are also safe. NOCB kthreads and timers have their own means of synchronization against the offloaded state updaters. Cc: Josh Triplett Cc: Steven Rostedt Cc: Mathieu Desnoyers Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Thomas Gleixner Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c| 21 +-- kernel/rcu/tree_plugin.h | 90 2 files changed, 87 insertions(+), 24 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f521..03503e2 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -156,6 +156,7 @@ static void invoke_rcu_core(void); static void rcu_report_exp_rdp(struct rcu_data *rdp); static void sync_sched_exp_online_cleanup(int cpu); static void check_cb_ovld_locked(struct rcu_data *rdp, struct rcu_node *rnp); +static bool rcu_rdp_is_offloaded(struct rcu_data *rdp); /* rcuc/rcub kthread realtime priority */ static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0; @@ -1672,7 +1673,7 @@ static bool __note_gp_changes(struct rcu_node *rnp, struct rcu_data *rdp) { bool ret = false; bool need_qs; - const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_rdp_is_offloaded(rdp); raw_lockdep_assert_held_rcu_node(rnp); @@ -2128,7 +2129,7 @@ static void rcu_gp_cleanup(void) needgp = true; } /* Advance CBs to reduce false positives below. */ - offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); + offloaded = rcu_rdp_is_offloaded(rdp); if ((offloaded || !rcu_accelerate_cbs(rnp, rdp)) && needgp) { WRITE_ONCE(rcu_state.gp_flags, RCU_GP_FLAG_INIT); WRITE_ONCE(rcu_state.gp_req_activity, jiffies); @@ -2327,7 +2328,7 @@ rcu_report_qs_rdp(struct rcu_data *rdp) unsigned long flags; unsigned long mask; bool needwake = false; - const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_rdp_is_offloaded(rdp); struct rcu_node *rnp; WARN_ON_ONCE(rdp->cpu != smp_processor_id()); @@ -2497,7 +2498,7 @@ static void rcu_do_batch(struct rcu_data *rdp) int div; bool __maybe_unused empty; unsigned long flags; - const bool offloaded = rcu_segcblist_is_offloaded(&rdp->cblist); + const bool offloaded = rcu_rdp_is_offloaded(rdp); struct rcu_head *rhp; struct rcu_cblist rcl = RCU_CBLIST_INITIALIZER(rcl); long bl, count = 0; @@ -3066,7 +3067,7 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func) trace_rcu_segcb_stats(&rdp->cblist, TPS("SegCBQueued")); /* Go handle any RCU core processing required. */ - if (unlikely(rcu_segcblist_is_offloaded(&rdp->cblist))) { + if (unlikely(rcu_rdp_is_offloaded(rdp))) { __call_rcu_nocb_wake(rdp, was_alldone, flags); /* unlocks */ } else { __call_rcu_core(rdp, head, flags); @@ -3843,13 +3844,13 @@ static int rcu_pending(int user) return 1; /* Does this CPU have callbacks ready to invoke? */ - if (!rcu_segcblist_is_offloaded(&rdp->cblist) && + if (!rcu_rdp_is_offloaded(rdp) && rcu_segcblist_ready_cbs(&rdp->cblist)) return 1; /* Has RCU gone idle with this CPU needing another grace period? */ if (!gp_in_progress && rcu_segcblist_is_enabled(&rdp->cblist) && - !rcu_segcblist_is_offloaded(&rdp->cblist) && + !rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_restempty(&rdp->cblist, RCU_NEXT_READY_TAIL)) return 1; @@ -3968,7 +3969,7 @@ void rcu_barrier(void) for_each_possible_cpu(cpu) { rdp = per_cpu_ptr(&rcu_data, cpu); if (cpu_is_offline(cpu) && - !rcu_segcblist_is_offloaded(&rdp->cblist)) + !rcu_rdp_is_offloaded(rdp)) continue; if (rcu_segcblist_n_cbs(&rdp->cblist) && cpu_online(cpu)) { rcu_barrier_trace(TPS("OnlineQ"), cpu, @@ -4291,7 +4292,7 @@ void rcutree_migrate_callbacks(int cpu) struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); bool needwake; - if (rcu_segcblist_is_offloaded(&rdp->cblist) || + if (rcu_rdp_is_offloaded(rdp) || rcu_segcblist_empty(&rdp->cblist)) return; /* No callbacks to migrate. */ @@ -4309,7 +4310,7 @@ void rcutree_migrate_callbacks(int cpu) rcu_segcblist_disable(&rdp->cbl
[PATCH tip/core/rcu 05/12] rcu/nocb: Avoid confusing double write of rdp->nocb_cb_sleep
From: Frederic Weisbecker The nocb_cb_wait() function first sets the rdp->nocb_cb_sleep flag to true by after invoking the callbacks, and then sets it back to false if it finds more callbacks that are ready to invoke. This is confusing and will become unsafe if this flag is ever read locklessly. This commit therefore writes it only once, based on the state after both callback invocation and checking. Reported-by: Paul E. McKenney Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 9fd8588..6a7f77d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2230,6 +2230,7 @@ static void nocb_cb_wait(struct rcu_data *rdp) unsigned long flags; bool needwake_state = false; bool needwake_gp = false; + bool can_sleep = true; struct rcu_node *rnp = rdp->mynode; local_irq_save(flags); @@ -2253,8 +2254,6 @@ static void nocb_cb_wait(struct rcu_data *rdp) raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ } - WRITE_ONCE(rdp->nocb_cb_sleep, true); - if (rcu_segcblist_test_flags(cblist, SEGCBLIST_OFFLOADED)) { if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) { rcu_segcblist_set_flags(cblist, SEGCBLIST_KTHREAD_CB); @@ -2262,7 +2261,7 @@ static void nocb_cb_wait(struct rcu_data *rdp) needwake_state = true; } if (rcu_segcblist_ready_cbs(cblist)) - WRITE_ONCE(rdp->nocb_cb_sleep, false); + can_sleep = false; } else { /* * De-offloading. Clear our flag and notify the de-offload worker. @@ -2275,6 +2274,8 @@ static void nocb_cb_wait(struct rcu_data *rdp) needwake_state = true; } + WRITE_ONCE(rdp->nocb_cb_sleep, can_sleep); + if (rdp->nocb_cb_sleep) trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("CBSleep")); -- 2.9.5
[PATCH tip/core/rcu 03/12] rcu/nocb: Comment the reason behind BH disablement on batch processing
From: Frederic Weisbecker This commit explains why softirqs need to be disabled while invoking callbacks, even when callback processing has been offloaded. After all, invoking callbacks concurrently is one thing, but concurrently invoking the same callback is quite another. Reported-by: Boqun Feng Reported-by: Paul E. McKenney Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index cd513ea..013142d 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2235,6 +2235,12 @@ static void nocb_cb_wait(struct rcu_data *rdp) local_irq_save(flags); rcu_momentary_dyntick_idle(); local_irq_restore(flags); + /* +* Disable BH to provide the expected environment. Also, when +* transitioning to/from NOCB mode, a self-requeuing callback might +* be invoked from softirq. A short grace period could cause both +* instances of this callback would execute concurrently. +*/ local_bh_disable(); rcu_do_batch(rdp); local_bh_enable(); -- 2.9.5
[PATCH tip/core/rcu 06/12] rcu/nocb: Only (re-)initialize segcblist when needed on CPU up
From: Frederic Weisbecker At the start of a CPU-hotplug operation, the incoming CPU's callback list can be in a number of states: 1. Disabled and empty. This is the case when the boot CPU has not invoked call_rcu(), when a non-boot CPU first comes online, and when a non-offloaded CPU comes back online. In this case, it is both necessary and permissible to initialize ->cblist. Because either the CPU is currently running with interrupts disabled (boot CPU) or is not yet running at all (other CPUs), it is not necessary to acquire ->nocb_lock. In this case, initialization is required. 2. Disabled and non-empty. This cannot occur, because early boot call_rcu() invocations enable the callback list before enqueuing their callback. 3. Enabled, whether empty or not. In this case, the callback list has already been initialized. This case occurs when the boot CPU has executed an early boot call_rcu() and also when an offloaded CPU comes back online. In both cases, there is no need to initialize the callback list: In the boot-CPU case, the CPU has not (yet) gone offline, and in the offloaded case, the rcuo kthreads are taking care of business. Because it is not necessary to initialize the callback list, it is also not necessary to acquire ->nocb_lock. Therefore, checking if the segcblist is enabled suffices. This commit therefore initializes the callback list at rcutree_prepare_cpu() time only if that list is disabled. Signed-off-by: Frederic Weisbecker Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index ee77858..402ea36 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4084,14 +4084,13 @@ int rcutree_prepare_cpu(unsigned int cpu) rdp->dynticks_nesting = 1; /* CPU not up, no tearing. */ rcu_dynticks_eqs_online(); raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ + /* -* Lock in case the CB/GP kthreads are still around handling -* old callbacks. +* Only non-NOCB CPUs that didn't have early-boot callbacks need to be +* (re-)initialized. */ - rcu_nocb_lock(rdp); - if (rcu_segcblist_empty(&rdp->cblist)) /* No early-boot CBs? */ + if (!rcu_segcblist_is_enabled(&rdp->cblist)) rcu_segcblist_init(&rdp->cblist); /* Re-enable callbacks. */ - rcu_nocb_unlock(rdp); /* * Add CPU to leaf rcu_node pending-online bitmask. Any needed -- 2.9.5
[PATCH tip/core/rcu 04/12] rcu/nocb: Forbid NOCB toggling on offline CPUs
From: Frederic Weisbecker It makes no sense to de-offload an offline CPU because that CPU will never invoke any remaining callbacks. It also makes little sense to offload an offline CPU because any pending RCU callbacks were migrated when that CPU went offline. Yes, it is in theory possible to use a number of tricks to permit offloading and deoffloading offline CPUs in certain cases, but in practice it is far better to have the simple and deterministic rule "Toggling the offload state of an offline CPU is forbidden". For but one example, consider that an offloaded offline CPU might have millions of callbacks queued. Best to just say "no". This commit therefore forbids toggling of the offloaded state of offline CPUs. Reported-by: Paul E. McKenney Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree.c| 3 +-- kernel/rcu/tree_plugin.h | 57 ++-- 2 files changed, 22 insertions(+), 38 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 03503e2..ee77858 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -4086,8 +4086,7 @@ int rcutree_prepare_cpu(unsigned int cpu) raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled. */ /* * Lock in case the CB/GP kthreads are still around handling -* old callbacks (longer term we should flush all callbacks -* before completing CPU offline) +* old callbacks. */ rcu_nocb_lock(rdp); if (rcu_segcblist_empty(&rdp->cblist)) /* No early-boot CBs? */ diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 013142d..9fd8588 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2399,23 +2399,18 @@ static int rdp_offload_toggle(struct rcu_data *rdp, return 0; } -static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp) +static long rcu_nocb_rdp_deoffload(void *arg) { + struct rcu_data *rdp = arg; struct rcu_segcblist *cblist = &rdp->cblist; unsigned long flags; int ret; + WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id()); + pr_info("De-offloading %d\n", rdp->cpu); rcu_nocb_lock_irqsave(rdp, flags); - /* -* If there are still pending work offloaded, the offline -* CPU won't help much handling them. -*/ - if (cpu_is_offline(rdp->cpu) && !rcu_segcblist_empty(&rdp->cblist)) { - rcu_nocb_unlock_irqrestore(rdp, flags); - return -EBUSY; - } ret = rdp_offload_toggle(rdp, false, flags); swait_event_exclusive(rdp->nocb_state_wq, @@ -2446,14 +2441,6 @@ static int __rcu_nocb_rdp_deoffload(struct rcu_data *rdp) return ret; } -static long rcu_nocb_rdp_deoffload(void *arg) -{ - struct rcu_data *rdp = arg; - - WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id()); - return __rcu_nocb_rdp_deoffload(rdp); -} - int rcu_nocb_cpu_deoffload(int cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); @@ -2466,12 +2453,14 @@ int rcu_nocb_cpu_deoffload(int cpu) mutex_lock(&rcu_state.barrier_mutex); cpus_read_lock(); if (rcu_rdp_is_offloaded(rdp)) { - if (cpu_online(cpu)) + if (cpu_online(cpu)) { ret = work_on_cpu(cpu, rcu_nocb_rdp_deoffload, rdp); - else - ret = __rcu_nocb_rdp_deoffload(rdp); - if (!ret) - cpumask_clear_cpu(cpu, rcu_nocb_mask); + if (!ret) + cpumask_clear_cpu(cpu, rcu_nocb_mask); + } else { + pr_info("NOCB: Can't CB-deoffload an offline CPU\n"); + ret = -EINVAL; + } } cpus_read_unlock(); mutex_unlock(&rcu_state.barrier_mutex); @@ -2480,12 +2469,14 @@ int rcu_nocb_cpu_deoffload(int cpu) } EXPORT_SYMBOL_GPL(rcu_nocb_cpu_deoffload); -static int __rcu_nocb_rdp_offload(struct rcu_data *rdp) +static long rcu_nocb_rdp_offload(void *arg) { + struct rcu_data *rdp = arg; struct rcu_segcblist *cblist = &rdp->cblist; unsigned long flags; int ret; + WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id()); /* * For now we only support re-offload, ie: the rdp must have been * offloaded on boot first. @@ -2525,14 +2516,6 @@ static int __rcu_nocb_rdp_offload(struct rcu_data *rdp) return ret; } -static long rcu_nocb_rdp_offload(void *arg) -{ - struct rcu_data *rdp = arg; - - WARN_ON_ONCE(rdp->cpu != raw_smp_processor_id()); - return __rcu_nocb_rdp_offload(rdp); -} - int rcu_nocb_cpu_offload(int cpu) { struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu); @@ -2541,12 +2524,14 @@ int rcu_nocb_cpu_offload(int c
[PATCH tip/core/rcu 07/12] rcu/nocb: Rename nocb_gp_update_state to nocb_gp_update_state_deoffloading
From: Frederic Weisbecker The name nocb_gp_update_state() is unenlightening, so this commit changes it to nocb_gp_update_state_deoffloading(). This function now does what its name says, updates state and returns true if the CPU corresponding to the specified rcu_data structure is in the process of being de-offloaded. Reported-by: Paul E. McKenney Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 6a7f77d..93d3938 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -2016,7 +2016,8 @@ static inline bool nocb_gp_enabled_cb(struct rcu_data *rdp) return rcu_segcblist_test_flags(&rdp->cblist, flags); } -static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool *needwake_state) +static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp, +bool *needwake_state) { struct rcu_segcblist *cblist = &rdp->cblist; @@ -2026,7 +2027,7 @@ static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool *needwake_sta if (rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) *needwake_state = true; } - return true; + return false; } /* @@ -2037,7 +2038,7 @@ static inline bool nocb_gp_update_state(struct rcu_data *rdp, bool *needwake_sta rcu_segcblist_clear_flags(cblist, SEGCBLIST_KTHREAD_GP); if (!rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB)) *needwake_state = true; - return false; + return true; } @@ -2075,7 +2076,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp) continue; trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check")); rcu_nocb_lock_irqsave(rdp, flags); - if (!nocb_gp_update_state(rdp, &needwake_state)) { + if (nocb_gp_update_state_deoffloading(rdp, &needwake_state)) { rcu_nocb_unlock_irqrestore(rdp, flags); if (needwake_state) swake_up_one(&rdp->nocb_state_wq); -- 2.9.5
[PATCH tip/core/rcu 10/12] rcu/nocb: Disable bypass when CPU isn't completely offloaded
From: Frederic Weisbecker Currently, the bypass is flushed at the very last moment in the deoffloading procedure. However, this approach leads to a larger state space than would be preferred. This commit therefore disables the bypass at soon as the deoffloading procedure begins, then flushes it. This guarantees that the bypass remains empty and thus out of the way of the deoffloading procedure. Symmetrically, this commit waits to enable the bypass until the offloading procedure has completed. Reported-by: Paul E. McKenney Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- include/linux/rcu_segcblist.h | 7 --- kernel/rcu/tree_plugin.h | 38 +- 2 files changed, 33 insertions(+), 12 deletions(-) diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h index 8afe886..3db96c4 100644 --- a/include/linux/rcu_segcblist.h +++ b/include/linux/rcu_segcblist.h @@ -109,7 +109,7 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | Kthreads handle callbacks holding nocb_lock, local rcu_core() stops | - * | handling callbacks. | + * | handling callbacks. Enable bypass queueing. | * */ @@ -125,7 +125,7 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | CB/GP kthreads handle callbacks holding nocb_lock, local rcu_core() | - * | ignores callbacks. | + * | ignores callbacks. Bypass enqueue is enabled. | * * | * v @@ -134,7 +134,8 @@ struct rcu_cblist { * | SEGCBLIST_KTHREAD_GP | * | | * | CB/GP kthreads and local rcu_core() handle callbacks concurrently | - * | holding nocb_lock. Wake up CB and GP kthreads if necessary. | + * | holding nocb_lock. Wake up CB and GP kthreads if necessary. Disable | + * | bypass enqueue. | * * | * v diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e392bd1..b08564b 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1830,11 +1830,22 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp, unsigned long j = jiffies; long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass); + lockdep_assert_irqs_disabled(); + + // Pure softirq/rcuc based processing: no bypassing, no + // locking. if (!rcu_rdp_is_offloaded(rdp)) { *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); + return false; + } + + // In the process of (de-)offloading: no bypassing, but + // locking. + if (!rcu_segcblist_completely_offloaded(&rdp->cblist)) { + rcu_nocb_lock(rdp); + *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist); return false; /* Not offloaded, no bypassing. */ } - lockdep_assert_irqs_disabled(); // Don't use ->nocb_bypass during early boot. if (rcu_scheduler_active != RCU_SCHEDULER_RUNNING) { @@ -2416,7 +2427,16 @@ static long rcu_nocb_rdp_deoffload(void *arg) pr_info("De-offloading %d\n", rdp->cpu); rcu_nocb_lock_irqsave(rdp, flags); - + /* +* Flush once and for all now. This suffices because we are +* running on the target CPU holding ->nocb_lock (thus having +* interrupts disabled), and because rdp_offload_toggle() +* invokes rcu_segcblist_offload(), which clears SEGCBLIST_OFFLOADED. +* Thus future calls to rcu_segcblist_completely_offloaded() will +* return false, which means that future calls to rcu_nocb_try_bypass() +* will refuse to put anything into the bypass. +*/ + WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies)); ret = rdp_offload_toggle(rdp, false, flags); swait_event_exclusive(rdp->nocb_state_wq, !rcu_segcblist_test_flags(cblist, SEGCBLIST_KTHREAD_CB | @@ -2428,21 +24
[PATCH tip/core/rcu 12/12] rcu/nocb: Move trace_rcu_nocb_wake() calls outside nocb_lock when possible
From: Frederic Weisbecker Those tracing calls don't need to be under ->nocb_lock. This commit therefore moves them outside of that lock. Signed-off-by: Frederic Weisbecker Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index b08564b..9846c8a 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1703,9 +1703,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, lockdep_assert_held(&rdp->nocb_lock); if (!READ_ONCE(rdp_gp->nocb_gp_kthread)) { + rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("AlreadyAwake")); - rcu_nocb_unlock_irqrestore(rdp, flags); return false; } @@ -1955,9 +1955,9 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, // If we are being polled or there is no kthread, just leave. t = READ_ONCE(rdp->nocb_gp_kthread); if (rcu_nocb_poll || !t) { + rcu_nocb_unlock_irqrestore(rdp, flags); trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNotPoll")); - rcu_nocb_unlock_irqrestore(rdp, flags); return; } // Need to actually to a wakeup. @@ -1992,8 +1992,8 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone, TPS("WakeOvfIsDeferred")); rcu_nocb_unlock_irqrestore(rdp, flags); } else { - trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); rcu_nocb_unlock_irqrestore(rdp, flags); + trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("WakeNot")); } return; } -- 2.9.5
[PATCH tip/core/rcu 08/12] rcu: Make nocb_nobypass_lim_per_jiffy static
From: Jiapeng Chong RCU triggerse the following sparse warning: kernel/rcu/tree_plugin.h:1497:5: warning: symbol 'nocb_nobypass_lim_per_jiffy' was not declared. Should it be static? This commit therefore makes this variable static. Reported-by: Abaci Robot Frederic Weisbecker Signed-off-by: Jiapeng Chong Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 93d3938..a1a17ad 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1556,7 +1556,7 @@ early_param("rcu_nocb_poll", parse_rcu_nocb_poll); * After all, the main point of bypassing is to avoid lock contention * on ->nocb_lock, which only can happen at high call_rcu() rates. */ -int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ; +static int nocb_nobypass_lim_per_jiffy = 16 * 1000 / HZ; module_param(nocb_nobypass_lim_per_jiffy, int, 0); /* -- 2.9.5
[PATCH tip/core/rcu 09/12] rcu/nocb: Fix missed nocb_timer requeue
From: Frederic Weisbecker This sequence of events can lead to a failure to requeue a CPU's ->nocb_timer: 1. There are no callbacks queued for any CPU covered by CPU 0-2's ->nocb_gp_kthread. Note that ->nocb_gp_kthread is associated with CPU 0. 2. CPU 1 enqueues its first callback with interrupts disabled, and thus must defer awakening its ->nocb_gp_kthread. It therefore queues its rcu_data structure's ->nocb_timer. At this point, CPU 1's rdp->nocb_defer_wakeup is RCU_NOCB_WAKE. 3. CPU 2, which shares the same ->nocb_gp_kthread, also enqueues a callback, but with interrupts enabled, allowing it to directly awaken the ->nocb_gp_kthread. 4. The newly awakened ->nocb_gp_kthread associates both CPU 1's and CPU 2's callbacks with a future grace period and arranges for that grace period to be started. 5. This ->nocb_gp_kthread goes to sleep waiting for the end of this future grace period. 6. This grace period elapses before the CPU 1's timer fires. This is normally improbably given that the timer is set for only one jiffy, but timers can be delayed. Besides, it is possible that kernel was built with CONFIG_RCU_STRICT_GRACE_PERIOD=y. 7. The grace period ends, so rcu_gp_kthread awakens the ->nocb_gp_kthread, which in turn awakens both CPU 1's and CPU 2's ->nocb_cb_kthread. Then ->nocb_gb_kthread sleeps waiting for more newly queued callbacks. 8. CPU 1's ->nocb_cb_kthread invokes its callback, then sleeps waiting for more invocable callbacks. 9. Note that neither kthread updated any ->nocb_timer state, so CPU 1's ->nocb_defer_wakeup is still set to RCU_NOCB_WAKE. 10. CPU 1 enqueues its second callback, this time with interrupts enabled so it can wake directly ->nocb_gp_kthread. It does so with calling wake_nocb_gp() which also cancels the pending timer that got queued in step 2. But that doesn't reset CPU 1's ->nocb_defer_wakeup which is still set to RCU_NOCB_WAKE. So CPU 1's ->nocb_defer_wakeup and its ->nocb_timer are now desynchronized. 11. ->nocb_gp_kthread associates the callback queued in 10 with a new grace period, arranges for that grace period to start and sleeps waiting for it to complete. 12. The grace period ends, rcu_gp_kthread awakens ->nocb_gp_kthread, which in turn wakes up CPU 1's ->nocb_cb_kthread which then invokes the callback queued in 10. 13. CPU 1 enqueues its third callback, this time with interrupts disabled so it must queue a timer for a deferred wakeup. However the value of its ->nocb_defer_wakeup is RCU_NOCB_WAKE which incorrectly indicates that a timer is already queued. Instead, CPU 1's ->nocb_timer was cancelled in 10. CPU 1 therefore fails to queue the ->nocb_timer. 14. CPU 1 has its pending callback and it may go unnoticed until some other CPU ever wakes up ->nocb_gp_kthread or CPU 1 ever calls an explicit deferred wakeup, for example, during idle entry. This commit fixes this bug by resetting rdp->nocb_defer_wakeup everytime we delete the ->nocb_timer. It is quite possible that there is a similar scenario involving ->nocb_bypass_timer and ->nocb_defer_wakeup. However, despite some effort from several people, a failure scenario has not yet been located. However, that by no means guarantees that no such scenario exists. Finding a failure scenario is left as an exercise for the reader, and the "Fixes:" tag below relates to ->nocb_bypass_timer instead of ->nocb_timer. Fixes: d1b222c6be1f (rcu/nocb: Add bypass callback queueing) Cc: Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Boqun Feng Reviewed-by: Neeraj Upadhyay Signed-off-by: Frederic Weisbecker Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index a1a17ad..e392bd1 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -1708,7 +1708,11 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force, rcu_nocb_unlock_irqrestore(rdp, flags); return false; } - del_timer(&rdp->nocb_timer); + + if (READ_ONCE(rdp->nocb_defer_wakeup) > RCU_NOCB_WAKE_NOT) { + WRITE_ONCE(rdp->nocb_defer_wakeup, RCU_NOCB_WAKE_NOT); + del_timer(&rdp->nocb_timer); + } rcu_nocb_unlock_irqrestore(rdp, flags); raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags); if (force || READ_ONCE(rdp_gp->nocb_gp_sleep)) { @@ -2335,7 +2339,6 @@ static bool do_nocb_deferred_wakeup_common(struct rcu_data *rdp) return false; } ndw = READ_ONCE(rdp->nocb_defer_wakeup); - WRITE_ONCE(rdp->nocb_defer
[PATCH tip/core/rcu 11/12] rcu/nocb: Remove stale comment above rcu_segcblist_offload()
From: Frederic Weisbecker This commit removes a stale comment claiming that the cblist must be empty before changing the offloading state. This claim was correct back when the offloaded state was defined exclusively at boot. Reported-by: Paul E. McKenney Signed-off-by: Frederic Weisbecker Cc: Josh Triplett Cc: Lai Jiangshan Cc: Joel Fernandes Cc: Neeraj Upadhyay Cc: Boqun Feng Signed-off-by: Paul E. McKenney --- kernel/rcu/rcu_segcblist.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c index 7f181c9..aaa1112 100644 --- a/kernel/rcu/rcu_segcblist.c +++ b/kernel/rcu/rcu_segcblist.c @@ -261,8 +261,7 @@ void rcu_segcblist_disable(struct rcu_segcblist *rsclp) } /* - * Mark the specified rcu_segcblist structure as offloaded. This - * structure must be empty. + * Mark the specified rcu_segcblist structure as offloaded. */ void rcu_segcblist_offload(struct rcu_segcblist *rsclp, bool offload) { -- 2.9.5
[PATCH tip/core/rcu 1/3] rcu: Provide polling interfaces for Tree RCU grace periods
From: "Paul E. McKenney" There is a need for a non-blocking polling interface for RCU grace periods, so this commit supplies start_poll_synchronize_rcu() and poll_state_synchronize_rcu() for this purpose. Note that the existing get_state_synchronize_rcu() may be used if future grace periods are inevitable (perhaps due to a later call_rcu() invocation). The new start_poll_synchronize_rcu() is to be used if future grace periods might not otherwise happen. Finally, poll_state_synchronize_rcu() provides a lockless check for a grace period having elapsed since the corresponding call to either of the get_state_synchronize_rcu() or start_poll_synchronize_rcu(). As with get_state_synchronize_rcu(), the return value from either get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in to a later call to either poll_state_synchronize_rcu() or the existing (might_sleep) cond_synchronize_rcu(). Signed-off-by: Paul E. McKenney --- include/linux/rcutree.h | 2 ++ kernel/rcu/tree.c | 70 + 2 files changed, 67 insertions(+), 5 deletions(-) diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h index df578b7..b89b541 100644 --- a/include/linux/rcutree.h +++ b/include/linux/rcutree.h @@ -41,6 +41,8 @@ void rcu_momentary_dyntick_idle(void); void kfree_rcu_scheduler_running(void); bool rcu_gp_might_be_stalled(void); unsigned long get_state_synchronize_rcu(void); +unsigned long start_poll_synchronize_rcu(void); +bool poll_state_synchronize_rcu(unsigned long oldstate); void cond_synchronize_rcu(unsigned long oldstate); void rcu_idle_enter(void); diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f521..8e0a140 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3774,8 +3774,8 @@ EXPORT_SYMBOL_GPL(synchronize_rcu); * get_state_synchronize_rcu - Snapshot current RCU state * * Returns a cookie that is used by a later call to cond_synchronize_rcu() - * to determine whether or not a full grace period has elapsed in the - * meantime. + * or poll_state_synchronize_rcu() to determine whether or not a full + * grace period has elapsed in the meantime. */ unsigned long get_state_synchronize_rcu(void) { @@ -3789,13 +3789,73 @@ unsigned long get_state_synchronize_rcu(void) EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); /** + * start_poll_state_synchronize_rcu - Snapshot and start RCU grace period + * + * Returns a cookie that is used by a later call to cond_synchronize_rcu() + * or poll_state_synchronize_rcu() to determine whether or not a full + * grace period has elapsed in the meantime. If the needed grace period + * is not already slated to start, notifies RCU core of the need for that + * grace period. + * + * Interrupts must be enabled for the case where it is necessary to awaken + * the grace-period kthread. + */ +unsigned long start_poll_synchronize_rcu(void) +{ + unsigned long flags; + unsigned long gp_seq = get_state_synchronize_rcu(); + bool needwake; + struct rcu_data *rdp; + struct rcu_node *rnp; + + lockdep_assert_irqs_enabled(); + local_irq_save(flags); + rdp = this_cpu_ptr(&rcu_data); + rnp = rdp->mynode; + raw_spin_lock_rcu_node(rnp); // irqs already disabled. + needwake = rcu_start_this_gp(rnp, rdp, gp_seq); + raw_spin_unlock_irqrestore_rcu_node(rnp, flags); + if (needwake) + rcu_gp_kthread_wake(); + return gp_seq; +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); + +/** + * poll_state_synchronize_rcu - Conditionally wait for an RCU grace period + * + * @oldstate: return from call to get_state_synchronize_rcu() or start_poll_synchronize_rcu() + * + * If a full RCU grace period has elapsed since the earlier call from + * which oldstate was obtained, return @true, otherwise return @false. + * Otherwise, invoke synchronize_rcu() to wait for a full grace period. + * + * Yes, this function does not take counter wrap into account. + * But counter wrap is harmless. If the counter wraps, we have waited for + * more than 2 billion grace periods (and way more on a 64-bit system!). + * Those needing to keep oldstate values for very long time periods + * (many hours even on 32-bit systems) should check them occasionally + * and either refresh them or set a flag indicating that the grace period + * has completed. + */ +bool poll_state_synchronize_rcu(unsigned long oldstate) +{ + if (rcu_seq_done(&rcu_state.gp_seq, oldstate)) { + smp_mb(); /* Ensure GP ends before subsequent accesses. */ + return true; + } + return false; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); + +/** * cond_synchronize_rcu - Conditionally wait for an RCU grace period * * @oldstate: return value from earlier call to get_state_synchronize_rcu() * * If a full RCU grace period has elapsed since the earlier call to - * get_state_synchronize_rcu(), just return. Otherwise, invoke - * syn
[PATCH tip/core/rcu 3/3] rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
From: "Paul E. McKenney" This commit causes rcutorture to test the new start_poll_synchronize_rcu() and poll_state_synchronize_rcu() functions. Because of the difficulty of determining the nature of a synchronous RCU grace (expedited or not), the test that insisted that poll_state_synchronize_rcu() detect an intervening synchronize_rcu() had to be dropped. Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 12 +++- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 99657ff..956e6bf 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -494,6 +494,8 @@ static struct rcu_torture_ops rcu_ops = { .sync = synchronize_rcu, .exp_sync = synchronize_rcu_expedited, .get_gp_state = get_state_synchronize_rcu, + .start_gp_poll = start_poll_synchronize_rcu, + .poll_gp_state = poll_state_synchronize_rcu, .cond_sync = cond_synchronize_rcu, .call = call_rcu, .cb_barrier = rcu_barrier, @@ -1223,14 +1225,6 @@ rcu_torture_writer(void *arg) WARN_ON_ONCE(1); break; } - if (cur_ops->get_gp_state && cur_ops->poll_gp_state) - WARN_ONCE(rcu_torture_writer_state != RTWS_DEF_FREE && - !cur_ops->poll_gp_state(cookie), - "%s: Cookie check 2 failed %s(%d) %lu->%lu\n", - __func__, - rcu_torture_writer_state_getname(), - rcu_torture_writer_state, - cookie, cur_ops->get_gp_state()); } WRITE_ONCE(rcu_torture_current_version, rcu_torture_current_version + 1); @@ -1589,7 +1583,7 @@ static bool rcu_torture_one_read(struct torture_random_state *trsp, long myid) preempt_enable(); if (cur_ops->get_gp_state && cur_ops->poll_gp_state) WARN_ONCE(cur_ops->poll_gp_state(cookie), - "%s: Cookie check 3 failed %s(%d) %lu->%lu\n", + "%s: Cookie check 2 failed %s(%d) %lu->%lu\n", __func__, rcu_torture_writer_state_getname(), rcu_torture_writer_state, -- 2.9.5
[PATCH tip/core/rcu 2/3] rcu: Provide polling interfaces for Tiny RCU grace periods
From: "Paul E. McKenney" There is a need for a non-blocking polling interface for RCU grace periods, so this commit supplies start_poll_synchronize_rcu() and poll_state_synchronize_rcu() for this purpose. Note that the existing get_state_synchronize_rcu() may be used if future grace periods are inevitable (perhaps due to a later call_rcu() invocation). The new start_poll_synchronize_rcu() is to be used if future grace periods might not otherwise happen. Finally, poll_state_synchronize_rcu() provides a lockless check for a grace period having elapsed since the corresponding call to either of the get_state_synchronize_rcu() or start_poll_synchronize_rcu(). As with get_state_synchronize_rcu(), the return value from either get_state_synchronize_rcu() or start_poll_synchronize_rcu() is passed in to a later call to either poll_state_synchronize_rcu() or the existing (might_sleep) cond_synchronize_rcu(). Signed-off-by: Paul E. McKenney --- include/linux/rcutiny.h | 11 ++- kernel/rcu/tiny.c | 40 2 files changed, 46 insertions(+), 5 deletions(-) diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h index 2a97334..69108cf4 100644 --- a/include/linux/rcutiny.h +++ b/include/linux/rcutiny.h @@ -17,14 +17,15 @@ /* Never flag non-existent other CPUs! */ static inline bool rcu_eqs_special_set(int cpu) { return false; } -static inline unsigned long get_state_synchronize_rcu(void) -{ - return 0; -} +unsigned long get_state_synchronize_rcu(void); +unsigned long start_poll_synchronize_rcu(void); +bool poll_state_synchronize_rcu(unsigned long oldstate); static inline void cond_synchronize_rcu(unsigned long oldstate) { - might_sleep(); + if (poll_state_synchronize_rcu(oldstate)) + return; + synchronize_rcu(); } extern void rcu_barrier(void); diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c index aa897c3..c8a029f 100644 --- a/kernel/rcu/tiny.c +++ b/kernel/rcu/tiny.c @@ -32,12 +32,14 @@ struct rcu_ctrlblk { struct rcu_head *rcucblist; /* List of pending callbacks (CBs). */ struct rcu_head **donetail; /* ->next pointer of last "done" CB. */ struct rcu_head **curtail; /* ->next pointer of last CB. */ + unsigned long gp_seq; /* Grace-period counter. */ }; /* Definition for rcupdate control block. */ static struct rcu_ctrlblk rcu_ctrlblk = { .donetail = &rcu_ctrlblk.rcucblist, .curtail= &rcu_ctrlblk.rcucblist, + .gp_seq = 0 - 300UL, }; void rcu_barrier(void) @@ -56,6 +58,7 @@ void rcu_qs(void) rcu_ctrlblk.donetail = rcu_ctrlblk.curtail; raise_softirq_irqoff(RCU_SOFTIRQ); } + WRITE_ONCE(rcu_ctrlblk.gp_seq, rcu_ctrlblk.gp_seq + 1); local_irq_restore(flags); } @@ -177,6 +180,43 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func) } EXPORT_SYMBOL_GPL(call_rcu); +/* + * Return a grace-period-counter "cookie". For more information, + * see the Tree RCU header comment. + */ +unsigned long get_state_synchronize_rcu(void) +{ + return READ_ONCE(rcu_ctrlblk.gp_seq); +} +EXPORT_SYMBOL_GPL(get_state_synchronize_rcu); + +/* + * Return a grace-period-counter "cookie" and ensure that a future grace + * period completes. For more information, see the Tree RCU header comment. + */ +unsigned long start_poll_synchronize_rcu(void) +{ + unsigned long gp_seq = get_state_synchronize_rcu(); + + if (unlikely(is_idle_task(current))) { + /* force scheduling for rcu_qs() */ + resched_cpu(0); + } + return gp_seq; +} +EXPORT_SYMBOL_GPL(start_poll_synchronize_rcu); + +/* + * Return true if the grace period corresponding to oldstate has completed + * and false otherwise. For more information, see the Tree RCU header + * comment. + */ +bool poll_state_synchronize_rcu(unsigned long oldstate) +{ + return READ_ONCE(rcu_ctrlblk.gp_seq) != oldstate; +} +EXPORT_SYMBOL_GPL(poll_state_synchronize_rcu); + void __init rcu_init(void) { open_softirq(RCU_SOFTIRQ, rcu_process_callbacks); -- 2.9.5
[PATCH tip/core/rcu 4/5] rcu: Make rcu_read_unlock_special() expedite strict grace periods
From: "Paul E. McKenney" In kernels built with CONFIG_RCU_STRICT_GRACE_PERIOD=y, every grace period is an expedited grace period. However, rcu_read_unlock_special() does not treat them that way, instead allowing the deferred quiescent state to be reported whenever. This commit therefore adds a check of this Kconfig option that causes rcu_read_unlock_special() to treat all grace periods as expedited for CONFIG_RCU_STRICT_GRACE_PERIOD=y kernels. Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index e17cb23..a21c41c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -615,6 +615,7 @@ static void rcu_read_unlock_special(struct task_struct *t) expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) || (rdp->grpmask & READ_ONCE(rnp->expmask)) || + IS_ENABLED(CONFIG_RCU_STRICT_GRACE_PERIOD) || (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && t->rcu_blocked_node); // Need to defer quiescent state until everything is enabled. -- 2.9.5
[PATCH tip/core/rcu 1/5] rcu: Expedite deboost in case of deferred quiescent state
From: "Paul E. McKenney" Historically, a task that has been subjected to RCU priority boosting is deboosted at rcu_read_unlock() time. However, with the advent of deferred quiescent states, if the outermost rcu_read_unlock() was invoked with either bottom halves, interrupts, or preemption disabled, the deboosting will be delayed for some time. During this time, a low-priority process might be incorrectly running at a high real-time priority level. Fortunately, rcu_read_unlock_special() already provides mechanisms for forcing a minimal deferral of quiescent states, at least for kernels built with CONFIG_IRQ_WORK=y. These mechanisms are currently used when expedited grace periods are pending that might be blocked by the current task. This commit therefore causes those mechanisms to also be used in cases where the current task has been or might soon be subjected to RCU priority boosting. Note that this applies to all kernels built with CONFIG_RCU_BOOST=y, regardless of whether or not they are also built with CONFIG_PREEMPT_RT=y. This approach assumes that kernels build for use with aggressive real-time applications are built with CONFIG_IRQ_WORK=y. It is likely to be far simpler to enable CONFIG_IRQ_WORK=y than to implement a fast-deboosting scheme that works correctly in its absence. While in the area, alphabetize the rcu_preempt_deferred_qs_handler() function's local variables. Cc: Sebastian Andrzej Siewior Cc: Scott Wood Cc: Lai Jiangshan Cc: Thomas Gleixner Signed-off-by: Paul E. McKenney --- kernel/rcu/tree_plugin.h | 26 ++ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 2d60377..e17cb23 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -598,9 +598,9 @@ static void rcu_preempt_deferred_qs_handler(struct irq_work *iwp) static void rcu_read_unlock_special(struct task_struct *t) { unsigned long flags; + bool irqs_were_disabled; bool preempt_bh_were_disabled = !!(preempt_count() & (PREEMPT_MASK | SOFTIRQ_MASK)); - bool irqs_were_disabled; /* NMI handlers cannot block and cannot safely manipulate state. */ if (in_nmi()) @@ -609,30 +609,32 @@ static void rcu_read_unlock_special(struct task_struct *t) local_irq_save(flags); irqs_were_disabled = irqs_disabled_flags(flags); if (preempt_bh_were_disabled || irqs_were_disabled) { - bool exp; + bool expboost; // Expedited GP in flight or possible boosting. struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp = rdp->mynode; - exp = (t->rcu_blocked_node && - READ_ONCE(t->rcu_blocked_node->exp_tasks)) || - (rdp->grpmask & READ_ONCE(rnp->expmask)); + expboost = (t->rcu_blocked_node && READ_ONCE(t->rcu_blocked_node->exp_tasks)) || + (rdp->grpmask & READ_ONCE(rnp->expmask)) || + (IS_ENABLED(CONFIG_RCU_BOOST) && irqs_were_disabled && + t->rcu_blocked_node); // Need to defer quiescent state until everything is enabled. - if (use_softirq && (in_irq() || (exp && !irqs_were_disabled))) { + if (use_softirq && (in_irq() || (expboost && !irqs_were_disabled))) { // Using softirq, safe to awaken, and either the - // wakeup is free or there is an expedited GP. + // wakeup is free or there is either an expedited + // GP in flight or a potential need to deboost. raise_softirq_irqoff(RCU_SOFTIRQ); } else { // Enabling BH or preempt does reschedule, so... - // Also if no expediting, slow is OK. - // Plus nohz_full CPUs eventually get tick enabled. + // Also if no expediting and no possible deboosting, + // slow is OK. Plus nohz_full CPUs eventually get + // tick enabled. set_tsk_need_resched(current); set_preempt_need_resched(); if (IS_ENABLED(CONFIG_IRQ_WORK) && irqs_were_disabled && - !rdp->defer_qs_iw_pending && exp && cpu_online(rdp->cpu)) { + expboost && !rdp->defer_qs_iw_pending && cpu_online(rdp->cpu)) { // Get scheduler to re-evaluate and call hooks. // If !IRQ_WORK, FQS scan will eventually IPI. - init_irq_work(&rdp->defer_qs_iw, - rcu_preempt_deferred_qs_handler); + init_irq_work(&rdp->defer_qs_iw, rcu_preempt_deferred_qs_handler);
[PATCH tip/core/rcu 3/5] rcutorture: Fix testing of RCU priority boosting
From: "Paul E. McKenney" Currently, rcutorture refuses to test RCU priority boosting in CONFIG_HOTPLUG_CPU=y kernels, which are the only kind normally built on x86 these days. This commit therefore updates rcutorture's tests of RCU priority boosting to make them safe for CPU hotplug. However, these tests will fail unless TIMER_SOFTIRQ runs at realtime priority, which does not happen in current mainline. This commit therefore also refuses to test RCU priority boosting except in kernels built with CONFIG_PREEMPT_RT=y. While in the area, this commt adds some debug output at boost-fail time that helps diagnose the cause of the failure, for example, failing to run TIMER_SOFTIRQ at realtime priority. Cc: Sebastian Andrzej Siewior Cc: Scott Wood Cc: Thomas Gleixner Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 36 ++-- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 99657ff..af64bd8 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -245,11 +245,11 @@ static const char *rcu_torture_writer_state_getname(void) return rcu_torture_writer_state_names[i]; } -#if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) -#define rcu_can_boost() 1 -#else /* #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */ -#define rcu_can_boost() 0 -#endif /* #else #if defined(CONFIG_RCU_BOOST) && !defined(CONFIG_HOTPLUG_CPU) */ +#if defined(CONFIG_RCU_BOOST) && defined(CONFIG_PREEMPT_RT) +# define rcu_can_boost() 1 +#else +# define rcu_can_boost() 0 +#endif #ifdef CONFIG_RCU_TRACE static u64 notrace rcu_trace_clock_local(void) @@ -923,9 +923,13 @@ static void rcu_torture_enable_rt_throttle(void) static bool rcu_torture_boost_failed(unsigned long start, unsigned long end) { + static int dbg_done; + if (end - start > test_boost_duration * HZ - HZ / 2) { VERBOSE_TOROUT_STRING("rcu_torture_boost boosting failed"); n_rcu_torture_boost_failure++; + if (!xchg(&dbg_done, 1) && cur_ops->gp_kthread_dbg) + cur_ops->gp_kthread_dbg(); return true; /* failed */ } @@ -948,8 +952,8 @@ static int rcu_torture_boost(void *arg) init_rcu_head_on_stack(&rbi.rcu); /* Each pass through the following loop does one boost-test cycle. */ do { - /* Track if the test failed already in this test interval? */ - bool failed = false; + bool failed = false; // Test failed already in this test interval + bool firsttime = true; /* Increment n_rcu_torture_boosts once per boost-test */ while (!kthread_should_stop()) { @@ -975,18 +979,17 @@ static int rcu_torture_boost(void *arg) /* Do one boost-test interval. */ endtime = oldstarttime + test_boost_duration * HZ; - call_rcu_time = jiffies; while (time_before(jiffies, endtime)) { /* If we don't have a callback in flight, post one. */ if (!smp_load_acquire(&rbi.inflight)) { /* RCU core before ->inflight = 1. */ smp_store_release(&rbi.inflight, 1); - call_rcu(&rbi.rcu, rcu_torture_boost_cb); + cur_ops->call(&rbi.rcu, rcu_torture_boost_cb); /* Check if the boost test failed */ - failed = failed || -rcu_torture_boost_failed(call_rcu_time, -jiffies); + if (!firsttime && !failed) + failed = rcu_torture_boost_failed(call_rcu_time, jiffies); call_rcu_time = jiffies; + firsttime = false; } if (stutter_wait("rcu_torture_boost")) sched_set_fifo_low(current); @@ -999,7 +1002,7 @@ static int rcu_torture_boost(void *arg) * this case the boost check would never happen in the above * loop so do another one here. */ - if (!failed && smp_load_acquire(&rbi.inflight)) + if (!firsttime && !failed && smp_load_acquire(&rbi.inflight)) rcu_torture_boost_failed(call_rcu_time, jiffies); /* @@ -1025,6 +1028,9 @@ checkwait:if (stutter_wait("rcu_torture_boost")) sched_set_fifo_low(current); } while (!torture_must_stop()); + while (smp_load_acquire(&rbi.inflight)) + schedule_timeout_uninterruptible(1); // rcu_barrier() deadlocks. + /* Clean up and exit. */ wh
[PATCH tip/core/rcu 2/5] rcutorture: Make TREE03 use real-time tree.use_softirq setting
From: "Paul E. McKenney" TREE03 tests RCU priority boosting, which is a real-time feature. It would also be good if it tested something closer to what is actually used by the real-time folks. This commit therefore adds tree.use_softirq=0 to the TREE03 kernel boot parameters in TREE03.boot. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot | 1 + 1 file changed, 1 insertion(+) diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot index 1c21894..64f864f1 100644 --- a/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot +++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE03.boot @@ -4,3 +4,4 @@ rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3 rcutree.kthread_prio=2 threadirqs +tree.use_softirq=0 -- 2.9.5
[PATCH tip/core/rcu 1/2] rcu-tasks: Rectify kernel-doc for struct rcu_tasks
From: Lukas Bulwahn The command 'find ./kernel/rcu/ | xargs ./scripts/kernel-doc -none' reported an issue with the kernel-doc of struct rcu_tasks. This commit rectifies the kernel-doc, such that no issues remain for ./kernel/rcu/. Signed-off-by: Lukas Bulwahn Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index af7c194..17c8ebe 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -20,7 +20,7 @@ typedef void (*holdouts_func_t)(struct list_head *hop, bool ndrpt, bool *frptp); typedef void (*postgp_func_t)(struct rcu_tasks *rtp); /** - * Definition for a Tasks-RCU-like mechanism. + * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism. * @cbs_head: Head of callback list. * @cbs_tail: Tail pointer for callback list. * @cbs_wq: Wait queue allowning new callback to get kthread's attention. @@ -38,7 +38,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * @pregp_func: This flavor's pre-grace-period function (optional). * @pertask_func: This flavor's per-task scan function (optional). * @postscan_func: This flavor's post-task scan function (optional). - * @holdout_func: This flavor's holdout-list scan function (optional). + * @holdouts_func: This flavor's holdout-list scan function (optional). * @postgp_func: This flavor's post-grace-period function (optional). * @call_func: This flavor's call_rcu()-equivalent function. * @name: This flavor's textual name. -- 2.9.5
[PATCH tip/core/rcu 2/2] rcu-tasks: Add block comment laying out RCU Tasks Trace design
From: "Paul E. McKenney" This commit adds a block comment that gives a high-level overview of how RCU tasks trace grace periods progress. It also adds a note about how exiting tasks are handled, plus it gives an overview of the memory ordering. Reported-by: Peter Zijlstra Reported-by: Mathieu Desnoyers [ paulmck: Fix commit log per Mathieu Desnoyers feedback. ] Signed-off-by: Paul E. McKenney --- kernel/rcu/tasks.h | 36 1 file changed, 36 insertions(+) diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index 17c8ebe..f818357 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -726,6 +726,42 @@ EXPORT_SYMBOL_GPL(show_rcu_tasks_rude_gp_kthread); // flavors, rcu_preempt and rcu_sched. The fact that RCU Tasks Trace // readers can operate from idle, offline, and exception entry/exit in no // way allows rcu_preempt and rcu_sched readers to also do so. +// +// The implementation uses rcu_tasks_wait_gp(), which relies on function +// pointers in the rcu_tasks structure. The rcu_spawn_tasks_trace_kthread() +// function sets these function pointers up so that rcu_tasks_wait_gp() +// invokes these functions in this order: +// +// rcu_tasks_trace_pregp_step(): +// Initialize the count of readers and block CPU-hotplug operations. +// rcu_tasks_trace_pertask(), invoked on every non-idle task: +// Initialize per-task state and attempt to identify an immediate +// quiescent state for that task, or, failing that, attempt to set +// that task's .need_qs flag so that that task's next outermost +// rcu_read_unlock_trace() will report the quiescent state (in which +// case the count of readers is incremented). If both attempts fail, +// the task is added to a "holdout" list. +// rcu_tasks_trace_postscan(): +// Initialize state and attempt to identify an immediate quiescent +// state as above (but only for idle tasks), unblock CPU-hotplug +// operations, and wait for an RCU grace period to avoid races with +// tasks that are in the process of exiting. +// check_all_holdout_tasks_trace(), repeatedly until holdout list is empty: +// Scans the holdout list, attempting to identify a quiescent state +// for each task on the list. If there is a quiescent state, the +// corresponding task is removed from the holdout list. +// rcu_tasks_trace_postgp(): +// Wait for the count of readers do drop to zero, reporting any stalls. +// Also execute full memory barriers to maintain ordering with code +// executing after the grace period. +// +// The exit_tasks_rcu_finish_trace() synchronizes with exiting tasks. +// +// Pre-grace-period update-side code is ordered before the grace +// period via the ->cbs_lock and barriers in rcu_tasks_kthread(). +// Pre-grace-period read-side code is ordered before the grace period by +// atomic_dec_and_test() of the count of readers (for IPIed readers) and by +// scheduler context-switch ordering (for locked-down non-running readers). // The lockdep state must be outside of #ifdef to be useful. #ifdef CONFIG_DEBUG_LOCK_ALLOC -- 2.9.5
[PATCH tip/core/rcu 2/2] rcutorture: Replace rcu_torture_stall string with %s
From: Stephen Zhang This commit replaces a hard-coded "rcu_torture_stall" string in a pr_alert() format with "%s" and __func__. Signed-off-by: Stephen Zhang Signed-off-by: Paul E. McKenney --- kernel/rcu/rcutorture.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 99657ff..271726e 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -1971,8 +1971,8 @@ static int rcu_torture_stall(void *args) local_irq_disable(); else if (!stall_cpu_block) preempt_disable(); - pr_alert("rcu_torture_stall start on CPU %d.\n", -raw_smp_processor_id()); + pr_alert("%s start on CPU %d.\n", + __func__, raw_smp_processor_id()); while (ULONG_CMP_LT((unsigned long)ktime_get_seconds(), stop_at)) if (stall_cpu_block) @@ -1983,7 +1983,7 @@ static int rcu_torture_stall(void *args) preempt_enable(); cur_ops->readunlock(idx); } - pr_alert("rcu_torture_stall end.\n"); + pr_alert("%s end.\n", __func__); torture_shutdown_absorb("rcu_torture_stall"); while (!kthread_should_stop()) schedule_timeout_interruptible(10 * HZ); -- 2.9.5
[PATCH tip/core/rcu 1/2] torture: Replace torture_init_begin string with %s
From: Stephen Zhang This commit replaces a hard-coded "torture_init_begin" string in a pr_alert() format with "%s" and __func__. Signed-off-by: Stephen Zhang Signed-off-by: Paul E. McKenney --- kernel/torture.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/kernel/torture.c b/kernel/torture.c index 01e336f..0a315c3 100644 --- a/kernel/torture.c +++ b/kernel/torture.c @@ -816,9 +816,9 @@ bool torture_init_begin(char *ttype, int v) { mutex_lock(&fullstop_mutex); if (torture_type != NULL) { - pr_alert("torture_init_begin: Refusing %s init: %s running.\n", -ttype, torture_type); - pr_alert("torture_init_begin: One torture test at a time!\n"); + pr_alert("%s: Refusing %s init: %s running.\n", + __func__, ttype, torture_type); + pr_alert("%s: One torture test at a time!\n", __func__); mutex_unlock(&fullstop_mutex); return false; } -- 2.9.5
[PATCH tip/core/rcu 01/28] torturescript: Don't rerun failed rcutorture builds
From: "Paul E. McKenney" If the build fails when running multiple instances of a given rcutorture scenario, for example, using the kvm.sh --configs "8*RUDE01" argument, the build will be rerun an additional seven times. This is in some sense correct, but it can waste significant time. This commit therefore checks for a prior failed build and simply copies over that build's output. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 536d103..9d8a82c 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -73,7 +73,7 @@ config_override_param "--kconfig argument" KcList "$TORTURE_KCONFIG_ARG" cp $T/KcList $resdir/ConfigFragment base_resdir=`echo $resdir | sed -e 's/\.[0-9]\+$//'` -if test "$base_resdir" != "$resdir" -a -f $base_resdir/bzImage -a -f $base_resdir/vmlinux +if test "$base_resdir" != "$resdir" && test -f $base_resdir/bzImage && test -f $base_resdir/vmlinux then # Rerunning previous test, so use that test's kernel. QEMU="`identify_qemu $base_resdir/vmlinux`" @@ -83,6 +83,17 @@ then ln -s $base_resdir/.config $resdir # for kvm-recheck.sh # Arch-independent indicator touch $resdir/builtkernel +elif test "$base_resdir" != "$resdir" +then + # Rerunning previous test for which build failed + ln -s $base_resdir/Make*.out $resdir # for kvm-recheck.sh + ln -s $base_resdir/.config $resdir # for kvm-recheck.sh + echo Initial build failed, not running KVM, see $resdir. + if test -f $builddir.wait + then + mv $builddir.wait $builddir.ready + fi + exit 1 elif kvm-build.sh $T/KcList $resdir then # Had to build a kernel for this test. -- 2.9.5
[PATCH tip/core/rcu 5/5] torture: Make jitter.sh handle large systems
From: "Paul E. McKenney" The current jitter.sh script expects cpumask bits to fit into whatever the awk interpreter uses for an integer, which clearly does not hold for even medium-sized systems these days. This means that on a large system, only the first 32 or 64 CPUs (depending) are subjected to jitter.sh CPU-time perturbations. This commit therefore computes a given CPU's cpumask using text manipulation rather than arithmetic shifts. Reported-by: Sebastian Andrzej Siewior Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/jitter.sh | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/jitter.sh b/tools/testing/selftests/rcutorture/bin/jitter.sh index 188b864..3a856ec 100755 --- a/tools/testing/selftests/rcutorture/bin/jitter.sh +++ b/tools/testing/selftests/rcutorture/bin/jitter.sh @@ -67,10 +67,10 @@ do srand(n + me + systime()); ncpus = split(cpus, ca); curcpu = ca[int(rand() * ncpus + 1)]; - mask = lshift(1, curcpu); - if (mask + 0 <= 0) - mask = 1; - printf("%#x\n", mask); + z = ""; + for (i = 1; 4 * i <= curcpu; i++) + z = z "0"; + print "0x" 2 ^ (curcpu % 4) z; }' < /dev/null` n=$(($n+1)) if ! taskset -p $cpumask $$ > /dev/null 2>&1 -- 2.9.5
[PATCH tip/core/rcu 02/28] torture: Allow 1G of memory for torture.sh kvfree testing
From: "Paul E. McKenney" Yes, I do recall a time when 512MB of memory was a lot of mass storage, much less main memory, but the rcuscale kvfree_rcu() testing invoked by torture.sh can sometimes exceed it on large systems, resulting in OOM. This commit therefore causes torture.sh to pase the "--memory 1G" argument to kvm.sh to reserve a full gigabyte for this purpose. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/torture.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/torture.sh b/tools/testing/selftests/rcutorture/bin/torture.sh index ad7525b..56e2e1a 100755 --- a/tools/testing/selftests/rcutorture/bin/torture.sh +++ b/tools/testing/selftests/rcutorture/bin/torture.sh @@ -374,7 +374,7 @@ done if test "$do_kvfree" = "yes" then torture_bootargs="rcuscale.kfree_rcu_test=1 rcuscale.kfree_nthreads=16 rcuscale.holdoff=20 rcuscale.kfree_loops=1 torture.disable_onoff_at_boot" - torture_set "rcuscale-kvfree" tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig "CONFIG_NR_CPUS=$HALF_ALLOTED_CPUS" --trust-make + torture_set "rcuscale-kvfree" tools/testing/selftests/rcutorture/bin/kvm.sh --torture rcuscale --allcpus --duration 10 --kconfig "CONFIG_NR_CPUS=$HALF_ALLOTED_CPUS" --memory 1G --trust-make fi echo " --- " $scriptname $args -- 2.9.5
[PATCH tip/core/rcu 03/28] torture: Provide bare-metal modprobe-based advice
From: "Paul E. McKenney" In some environments, the torture-testing use of virtualization is inconvenient. In such cases, the modprobe and rmmod commands may be used to do torture testing, but significant setup is required to build, boot, and modprobe a kernel so as to match a given torture-test scenario. This commit therefore creates a "bare-metal" file in each results directory containing steps to run the corresponding scenario using the modprobe command on bare metal. For example, the contents of this file after using kvm.sh to build an rcutorture TREE01 kernel, perhaps with the --buildonly argument, is as follows: To run this scenario on bare metal: 1. Set your bare-metal build tree to the state shown in this file: /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/testid.txt 2. Update your bare-metal build tree's .config based on this file: /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/ConfigFragment 3. Make the bare-metal kernel's build system aware of your .config updates: $ yes "" | make oldconfig 4. Build your bare-metal kernel. 5. Boot your bare-metal kernel with the following parameters: maxcpus=8 nr_cpus=43 rcutree.gp_preinit_delay=3 rcutree.gp_init_delay=3 rcutree.gp_cleanup_delay=3 rcu_nocbs=0-1,3-7 6. Start the test with the following command: $ modprobe rcutorture nocbs_nthreads=8 nocbs_toggle=1000 fwd_progress=0 onoff_interval=1000 onoff_holdoff=30 n_barrier_cbs=4 stat_interval=15 shutdown_secs=120 test_no_idle_hz=1 verbose=1 7. After some time, end the test with the following command: $ rmmod rcutorture 8. Copy your bare-metal kernel's .config file, overwriting this file: /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/.config 9. Copy the console output from just before the modprobe to just after the rmmod into this file: /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19/TREE01/console.log 10. Check for runtime errors using the following command: $ tools/testing/selftests/rcutorture/bin/kvm-recheck.sh /home/git/linux-rcu/tools/testing/selftests/rcutorture/res/2021.02.04-17.10.19 Signed-off-by: Paul E. McKenney --- .../selftests/rcutorture/bin/kvm-test-1-run.sh | 44 +++--- tools/testing/selftests/rcutorture/bin/kvm.sh | 4 ++ 2 files changed, 42 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 9d8a82c..03c0410 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -7,15 +7,15 @@ # Execute this in the source tree. Do not run it as a background task # because qemu does not seem to like that much. # -# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args +# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args_in # # qemu-args defaults to "-enable-kvm -nographic", along with arguments # specifying the number of CPUs and other options # generated from the underlying CPU architecture. -# boot_args defaults to value returned by the per_version_boot_params +# boot_args_in defaults to value returned by the per_version_boot_params # shell function. # -# Anything you specify for either qemu-args or boot_args is appended to +# Anything you specify for either qemu-args or boot_args_in is appended to # the default values. The "-smp" value is deduced from the contents of # the config fragment. # @@ -134,7 +134,7 @@ do done seconds=$4 qemu_args=$5 -boot_args=$6 +boot_args_in=$6 if test -z "$TORTURE_BUILDONLY" then @@ -144,7 +144,7 @@ fi # Generate -smp qemu argument. qemu_args="-enable-kvm -nographic $qemu_args" cpu_count=`configNR_CPUS.sh $resdir/ConfigFragment` -cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"` +cpu_count=`configfrag_boot_cpus "$boot_args_in" "$config_template" "$cpu_count"` if test "$cpu_count" -gt "$TORTURE_ALLOTED_CPUS" then echo CPU count limited from $cpu_count to $TORTURE_ALLOTED_CPUS | tee -a $resdir/Warnings @@ -160,13 +160,45 @@ qemu_args="$qemu_args `identify_qemu_args "$QEMU" "$resdir/console.log"`" qemu_append="`identify_qemu_append "$QEMU"`" # Pull in Kconfig-fragment boot parameters -boot_args="`configfrag_boot_params "$boot_args" "$config_template"`" +boot_args="`configfrag_boot_params "$boot_args_in" "$config_template"`" # Generate kernel-version-specific boot parameters boot_args="`per_version_boot_params "$boot_args" $resdir/.config $seconds`" if test -n "$TORTURE_BOOT_GDB_ARG" then boot_args="$boot_args $TORTURE_BOOT_GDB_ARG" fi + +# Give bare-metal advice +modprobe_args="`echo $boot_args | tr -s ' ' '\012' | grep "^$TORTURE_MOD\." | sed -e "s/$TORTURE_MOD\.//g"
[PATCH tip/core/rcu 05/28] rcuscale: Disable verbose torture-test output
From: "Paul E. McKenney" Given large numbers of threads, the quantity of torture-test output is sufficient to sometimes result in RCU CPU stall warnings. The probability of these stall warnings was greatly reduced by batching the output, but the warnings were not eliminated. However, the actual test only depends on console output that is printed even when rcuscale.verbose=0. This commit therefore causes this test to run with rcuscale.verbose=0. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh index 0333e9b..ffbe151 100644 --- a/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh +++ b/tools/testing/selftests/rcutorture/configs/rcuscale/ver_functions.sh @@ -12,5 +12,5 @@ # Adds per-version torture-module parameters to kernels supporting them. per_version_boot_params () { echo $1 rcuscale.shutdown=1 \ - rcuscale.verbose=1 + rcuscale.verbose=0 } -- 2.9.5
[PATCH tip/core/rcu 04/28] torture: Improve readability of the testid.txt file
From: "Paul E. McKenney" The testid.txt file was intended for occasional in extremis use, but now that the new "bare-metal" file references it, it might see more use. This commit therefore labels sections of output and adds spacing to make it easier to see what needs to be done to make a bare-metal build tree match an rcutorture build tree. Of course, you can avoid this whole issue by building your bare-metal kernel in the same directory in which you ran rcutorture, but that might not always be an option. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm.sh | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index 35a2132..1de198d 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -404,11 +404,16 @@ echo $scriptname $args touch $resdir/$ds/log echo $scriptname $args >> $resdir/$ds/log echo ${TORTURE_SUITE} > $resdir/$ds/TORTURE_SUITE -pwd > $resdir/$ds/testid.txt +echo Build directory: `pwd` > $resdir/$ds/testid.txt if test -d .git then + echo Current commit: `git rev-parse HEAD` >> $resdir/$ds/testid.txt + echo >> $resdir/$ds/testid.txt + echo ' ---' Output of "'"git status"'": >> $resdir/$ds/testid.txt git status >> $resdir/$ds/testid.txt - git rev-parse HEAD >> $resdir/$ds/testid.txt + echo >> $resdir/$ds/testid.txt + echo >> $resdir/$ds/testid.txt + echo ' ---' Output of "'"git diff HEAD"'": >> $resdir/$ds/testid.txt git diff HEAD >> $resdir/$ds/testid.txt fi ___EOF___ -- 2.9.5
[PATCH tip/core/rcu 16/28] torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
From: "Paul E. McKenney" This commit records the process IDs of the kvm-test-1-run.sh and kvm-test-1-run-qemu.sh scripts to ease monitoring of remotely running instances of these scripts. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh | 2 ++ tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh index 6b0d71b..576a9b7 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run-qemu.sh @@ -33,6 +33,8 @@ then exit 1 fi +echo ' ---' `date`: Starting kernel, PID $$ + # Obtain settings from the qemu-cmd file. grep '^#' $resdir/qemu-cmd | sed -e 's/^# //' > $T/qemu-cmd-settings . $T/qemu-cmd-settings diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index a69f8ae..a386ca8d 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -41,7 +41,7 @@ then echo "kvm-test-1-run.sh :$resdir: Not a writable directory, cannot store results into it" exit 1 fi -echo ' ---' `date`: Starting build +echo ' ---' `date`: Starting build, PID $$ echo ' ---' Kconfig fragment at: $config_template >> $resdir/log touch $resdir/ConfigFragment.input -- 2.9.5
[PATCH tip/core/rcu 06/28] refscale: Disable verbose torture-test output
From: "Paul E. McKenney" Given large numbers of threads, the quantity of torture-test output is sufficient to sometimes result in RCU CPU stall warnings. The probability of these stall warnings was greatly reduced by batching the output, but the warnings were not eliminated. However, the actual test only depends on console output that is printed even when refscale.verbose=0. This commit therefore causes this test to run with refscale.verbose=0. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh index 321e826..f81fa2c 100644 --- a/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh +++ b/tools/testing/selftests/rcutorture/configs/refscale/ver_functions.sh @@ -12,5 +12,5 @@ # Adds per-version torture-module parameters to kernels supporting them. per_version_boot_params () { echo $1 refscale.shutdown=1 \ - refscale.verbose=1 + refscale.verbose=0 } -- 2.9.5
[PATCH tip/core/rcu 07/28] torture: Move build/run synchronization files into scenario directories
From: "Paul E. McKenney" Currently the bN.ready and bN.wait files are placed in the rcutorture directory, which really is not at all a good place for run-specific files. This commit therefore renames these files to build.ready and build.wait and then moves them into the scenario directories within the "res" directory, for example, into tools/testing/selftests/rcutorture/res/2021.02.10-15.08.23/TINY01. Signed-off-by: Paul E. McKenney --- .../selftests/rcutorture/bin/kvm-test-1-run.sh | 25 +++--- tools/testing/selftests/rcutorture/bin/kvm.sh | 10 - 2 files changed, 16 insertions(+), 19 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh index 03c0410..91578d3 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh @@ -7,7 +7,7 @@ # Execute this in the source tree. Do not run it as a background task # because qemu does not seem to like that much. # -# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args_in +# Usage: kvm-test-1-run.sh config resdir seconds qemu-args boot_args_in # # qemu-args defaults to "-enable-kvm -nographic", along with arguments # specifying the number of CPUs and other options @@ -35,8 +35,7 @@ mkdir $T config_template=${1} config_dir=`echo $config_template | sed -e 's,/[^/]*$,,'` title=`echo $config_template | sed -e 's/^.*\///'` -builddir=${2} -resdir=${3} +resdir=${2} if test -z "$resdir" -o ! -d "$resdir" -o ! -w "$resdir" then echo "kvm-test-1-run.sh :$resdir: Not a writable directory, cannot store results into it" @@ -89,9 +88,9 @@ then ln -s $base_resdir/Make*.out $resdir # for kvm-recheck.sh ln -s $base_resdir/.config $resdir # for kvm-recheck.sh echo Initial build failed, not running KVM, see $resdir. - if test -f $builddir.wait + if test -f $resdir/build.wait then - mv $builddir.wait $builddir.ready + mv $resdir/build.wait $resdir/build.ready fi exit 1 elif kvm-build.sh $T/KcList $resdir @@ -118,23 +117,23 @@ else # Build failed. cp .config $resdir || : echo Build failed, not running KVM, see $resdir. - if test -f $builddir.wait + if test -f $resdir/build.wait then - mv $builddir.wait $builddir.ready + mv $resdir/build.wait $resdir/build.ready fi exit 1 fi -if test -f $builddir.wait +if test -f $resdir/build.wait then - mv $builddir.wait $builddir.ready + mv $resdir/build.wait $resdir/build.ready fi -while test -f $builddir.ready +while test -f $resdir/build.ready do sleep 1 done -seconds=$4 -qemu_args=$5 -boot_args_in=$6 +seconds=$3 +qemu_args=$4 +boot_args_in=$5 if test -z "$TORTURE_BUILDONLY" then diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index 1de198d..7944510 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -444,7 +444,6 @@ function dump(first, pastlast, batchnum) print "needqemurun=" jn=1 for (j = first; j < pastlast; j++) { - builddir=KVM "/b" j - first + 1 cpusr[jn] = cpus[j]; if (cfrep[cf[j]] == "") { cfr[jn] = cf[j]; @@ -453,15 +452,15 @@ function dump(first, pastlast, batchnum) cfrep[cf[j]]++; cfr[jn] = cf[j] "." cfrep[cf[j]]; } + builddir=rd cfr[jn] "/build"; if (cpusr[jn] > ncpus && ncpus != 0) ovf = "-ovf"; else ovf = ""; print "echo ", cfr[jn], cpusr[jn] ovf ": Starting build. `date` | tee -a " rd "log"; - print "rm -f " builddir ".*"; - print "touch " builddir ".wait"; print "mkdir " rd cfr[jn] " || :"; - print "kvm-test-1-run.sh " CONFIGDIR cf[j], builddir, rd cfr[jn], dur " \"" TORTURE_QEMU_ARG "\" \"" TORTURE_BOOTARGS "\" > " rd cfr[jn] "/kvm-test-1-run.sh.out 2>&1 &" + print "touch " builddir ".wait"; + print "kvm-test-1-run.sh " CONFIGDIR cf[j], rd cfr[jn], dur " \"" TORTURE_QEMU_ARG "\" \"" TORTURE_BOOTARGS "\" > " rd cfr[jn] "/kvm-test-1-run.sh.out 2>&1 &" print "echo ", cfr[jn], cpusr[jn] ovf ": Waiting for build to complete. `date` | tee -a " rd "log"; print "while test -f " builddir ".wait" print "do" @@ -471,7 +470,7 @@ function dump(first, pastlast, batchnum) jn++; } for (j = 1; j < jn; j++) { - builddir=KVM "/b" j + builddir=rd cfr[j] "/build"; print "rm -f " builddir "
[PATCH tip/core/rcu 11/28] torture: Reverse jittering and duration parameters for jitter.sh
From: "Paul E. McKenney" Remote rcutorture testing requires that jitter.sh continue to be invoked from the generated script for local runs, but that it instead be invoked on the remote system for distributed runs. This argues for common jitterstart and jitterstop scripts. But it would be good for jitterstart and jitterstop to control the name and location of the "jittering" file, while continuing to have the duration controlled by the caller of these new scripts. This commit therefore reverses the order of the jittering and duration parameters for jitter.sh, so that the jittering parameter precedes the duration parameter. Signed-off-by: Paul E. McKenney --- tools/testing/selftests/rcutorture/bin/jitter.sh | 6 +++--- tools/testing/selftests/rcutorture/bin/kvm.sh| 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/rcutorture/bin/jitter.sh b/tools/testing/selftests/rcutorture/bin/jitter.sh index ed0ea86..ff1d3e4 100755 --- a/tools/testing/selftests/rcutorture/bin/jitter.sh +++ b/tools/testing/selftests/rcutorture/bin/jitter.sh @@ -5,7 +5,7 @@ # of this script is to inflict random OS jitter on a concurrently running # test. # -# Usage: jitter.sh me duration jittering-path [ sleepmax [ spinmax ] ] +# Usage: jitter.sh me jittering-path duration [ sleepmax [ spinmax ] ] # # me: Random-number-generator seed salt. # duration: Time to run in seconds. @@ -18,8 +18,8 @@ # Authors: Paul E. McKenney me=$(($1 * 1000)) -duration=$2 -jittering=$3 +jittering=$2 +duration=$3 sleepmax=${4-100} spinmax=${5-1000} diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh index de93802..a2ee3f2 100755 --- a/tools/testing/selftests/rcutorture/bin/kvm.sh +++ b/tools/testing/selftests/rcutorture/bin/kvm.sh @@ -504,7 +504,7 @@ function dump(first, pastlast, batchnum) print "\techo Starting kernels. `date` | tee -a " rd "log"; print "\ttouch " rd "jittering" for (j = 0; j < njitter; j++) - print "\tjitter.sh " j " " dur " " rd "jittering " ja[2] " " ja[3] "&" + print "\tjitter.sh " j " " rd "jittering " dur " " ja[2] " " ja[3] "&" print "\twhile ls $runfiles > /dev/null 2>&1" print "\tdo" print "\t\t:" -- 2.9.5