On 18.02.21 06:21, Roman Shaposhnik wrote:
On Wed, Feb 17, 2021 at 12:29 AM Jürgen Groß <jgr...@suse.com <mailto:jgr...@suse.com>> wrote:

    On 17.02.21 09:12, Roman Shaposhnik wrote:
     > Hi Jürgen, thanks for taking a look at this. A few comments below:
     >
     > On Tue, Feb 16, 2021 at 10:47 PM Jürgen Groß <jgr...@suse.com
    <mailto:jgr...@suse.com>> wrote:
     >>
     >> On 16.02.21 21:34, Stefano Stabellini wrote:
     >>> + x86 maintainers
     >>>
     >>> It looks like the tlbflush is getting stuck?
     >>
     >> I have seen this case multiple times on customer systems now, but
     >> reproducing it reliably seems to be very hard.
     >
     > It is reliably reproducible under my workload but it take a long time
     > (~3 days of the workload running in the lab).

    This is by far the best reproduction rate I have seen up to now.

    The next best reproducer seems to be a huge installation with several
    hundred hosts and thousands of VMs with about 1 crash each week.

     >
     >> I suspected fifo events to be blamed, but just yesterday I've been
     >> informed of another case with fifo events disabled in the guest.
     >>
     >> One common pattern seems to be that up to now I have seen this
    effect
     >> only on systems with Intel Gold cpus. Can it be confirmed to be true
     >> in this case, too?
     >
     > I am pretty sure mine isn't -- I can get you full CPU specs if
    that's useful.

    Just the output of "grep model /proc/cpuinfo" should be enough.


processor: 3
vendor_id: GenuineIntel
cpu family: 6
model: 77
model name: Intel(R) Atom(TM) CPU  C2550  @ 2.40GHz
stepping: 8
microcode: 0x12d
cpu MHz: 1200.070
cache size: 1024 KB
physical id: 0
siblings: 4
core id: 3
cpu cores: 4
apicid: 6
initial apicid: 6
fpu: yes
fpu_exception: yes
cpuid level: 11
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 movbe popcnt tsc_deadline_timer aes rdrand lahf_lm 3dnowprefetch cpuid_fault epb pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid tsc_adjust smep erms dtherm ida arat md_clear vmx flags: vnmi preemption_timer invvpid ept_x_only flexpriority tsc_offset vtpr mtf vapic ept vpid unrestricted_guest
bugs: cpu_meltdown spectre_v1 spectre_v2 mds msbds_only
bogomips: 4800.19
clflush size: 64
cache_alignment: 64
address sizes: 36 bits physical, 48 bits virtual
power management:

     >
     >> In case anybody has a reproducer (either in a guest or dom0) with a
     >> setup where a diagnostic kernel can be used, I'd be _very_
    interested!
     >
     > I can easily add things to Dom0 and DomU. Whether that will
    disrupt the
     > experiment is, of course, another matter. Still please let me
    know what
     > would be helpful to do.

    Is there a chance to switch to an upstream kernel in the guest? I'd like
    to add some diagnostic code to the kernel and creating the patches will
    be easier this way.


That's a bit tough -- the VM is based on stock Ubuntu and if I upgrade the kernel I'll have fiddle with a lot things to make workload functional again.

However, I can install debug kernel (from Ubuntu, etc. etc.)

Of course, if patching the kernel is the only way to make progress -- lets try that -- please let me know.

I have found a nice upstream patch, which - with some modifications - I
plan to give our customer as a workaround.

The patch is for kernel 4.12, but chances are good it will apply to a
4.15 kernel, too.

Are you able to give it a try? I hope it will fix the hangs, but in#case of a fixed situation there should be a message on the console.


Juergen
From: "Juergen Gross" <jgr...@suse.com>
Date: Thu, 18 Feb 2021 09:22:54 +0100
Subject: [PATCH] kernel/smp: Provide CSD lock timeout diagnostics

This commit causes csd_lock_wait() to emit diagnostics when a CPU
fails to respond quickly enough to one of the smp_call_function()
family of function calls.

In case such a stall is detected the cpu which ought to execute the
function will be pinged again in case the IPI somehow got lost.

This commit is based on an upstream patch by Paul E. McKenney.

Signed-off-by: Juergen Gross <jgr...@suse.com>
---
diff --git a/kernel/smp.c b/kernel/smp.c
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,6 +19,9 @@
 #include <linux/sched.h>
 #include <linux/sched/idle.h>
 #include <linux/hypervisor.h>
+#include <linux/sched/clock.h>
+#include <linux/nmi.h>
+#include <linux/sched/debug.h>
 
 #include "smpboot.h"
 
@@ -96,6 +99,79 @@ void __init call_function_init(void)
 	smpcfd_prepare_cpu(smp_processor_id());
 }
 
+static DEFINE_PER_CPU(call_single_data_t *, cur_csd);
+static DEFINE_PER_CPU(smp_call_func_t, cur_csd_func);
+static DEFINE_PER_CPU(void *, cur_csd_info);
+
+#define CSD_LOCK_TIMEOUT (5ULL * NSEC_PER_SEC)
+atomic_t csd_bug_count = ATOMIC_INIT(0);
+
+/* Record current CSD work for current CPU, NULL to erase. */
+static void csd_lock_record(call_single_data_t *csd)
+{
+	if (!csd) {
+		smp_mb(); /* NULL cur_csd after unlock. */
+		__this_cpu_write(cur_csd, NULL);
+		return;
+	}
+	__this_cpu_write(cur_csd_func, csd->func);
+	__this_cpu_write(cur_csd_info, csd->info);
+	smp_wmb(); /* func and info before csd. */
+	__this_cpu_write(cur_csd, csd);
+	smp_mb(); /* Update cur_csd before function call. */
+		  /* Or before unlock, as the case may be. */
+}
+
+/* Complain if too much time spent waiting. */
+static __always_inline bool csd_lock_wait_toolong(call_single_data_t *csd, u64 ts0, u64 *ts1, int *bug_id, unsigned int cpu)
+{
+	bool firsttime;
+	u64 ts2, ts_delta;
+	call_single_data_t *cpu_cur_csd;
+	unsigned int flags = READ_ONCE(csd->flags);
+
+	if (!(flags & CSD_FLAG_LOCK)) {
+		if (!unlikely(*bug_id))
+			return true;
+		pr_alert("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock.\n",
+			 *bug_id, raw_smp_processor_id(), cpu);
+		return true;
+	}
+
+	ts2 = sched_clock();
+	ts_delta = ts2 - *ts1;
+	if (likely(ts_delta <= CSD_LOCK_TIMEOUT))
+		return false;
+
+	firsttime = !*bug_id;
+	if (firsttime)
+		*bug_id = atomic_inc_return(&csd_bug_count);
+	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpu)); /* Before func and info. */
+	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 cpu, csd->func, csd->info);
+	if (cpu_cur_csd && csd != cpu_cur_csd) {
+		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
+			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpu)),
+			 READ_ONCE(per_cpu(cur_csd_info, cpu)));
+	} else {
+		pr_alert("\tcsd: CSD lock (#%d) %s.\n",
+			 *bug_id, !cpu_cur_csd ? "unresponsive" : "handling this request");
+	}
+	if (cpu >= 0) {
+		if (!trigger_single_cpu_backtrace(cpu))
+			dump_cpu_task(cpu);
+		if (!cpu_cur_csd) {
+			pr_alert("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", *bug_id, raw_smp_processor_id(), cpu);
+			arch_send_call_function_single_ipi(cpu);
+		}
+	}
+	dump_stack();
+	*ts1 = ts2;
+
+	return false;
+}
+
 /*
  * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
  *
@@ -103,14 +179,23 @@ void __init call_function_init(void)
  * previous function call. For multi-cpu calls its even more interesting
  * as we'll have to ensure no other cpu is observing our csd.
  */
-static __always_inline void csd_lock_wait(call_single_data_t *csd)
+static __always_inline void csd_lock_wait(call_single_data_t *csd, unsigned int cpu)
 {
-	smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
+	int bug_id = 0;
+	u64 ts0, ts1;
+
+	ts1 = ts0 = sched_clock();
+	for (;;) {
+		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, cpu))
+			break;
+		cpu_relax();
+	}
+	smp_acquire__after_ctrl_dep();
 }
 
-static __always_inline void csd_lock(call_single_data_t *csd)
+static __always_inline void csd_lock(call_single_data_t *csd, unsigned int cpu)
 {
-	csd_lock_wait(csd);
+	csd_lock_wait(csd, cpu);
 	csd->flags |= CSD_FLAG_LOCK;
 
 	/*
@@ -148,9 +233,11 @@ static int generic_exec_single(int cpu,
 		 * We can unlock early even for the synchronous on-stack case,
 		 * since we're doing this from the same CPU..
 		 */
+		csd_lock_record(csd);
 		csd_unlock(csd);
 		local_irq_save(flags);
 		func(info);
+		csd_lock_record(NULL);
 		local_irq_restore(flags);
 		return 0;
 	}
@@ -238,6 +325,7 @@ static void flush_smp_call_function_queu
 		smp_call_func_t func = csd->func;
 		void *info = csd->info;
 
+		csd_lock_record(csd);
 		/* Do we wait until *after* callback? */
 		if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
 			func(info);
@@ -246,6 +334,7 @@ static void flush_smp_call_function_queu
 			csd_unlock(csd);
 			func(info);
 		}
+		csd_lock_record(NULL);
 	}
 
 	/*
@@ -293,13 +382,13 @@ int smp_call_function_single(int cpu, sm
 	csd = &csd_stack;
 	if (!wait) {
 		csd = this_cpu_ptr(&csd_data);
-		csd_lock(csd);
+		csd_lock(csd, cpu);
 	}
 
 	err = generic_exec_single(cpu, csd, func, info);
 
 	if (wait)
-		csd_lock_wait(csd);
+		csd_lock_wait(csd, cpu);
 
 	put_cpu();
 
@@ -331,7 +420,7 @@ int smp_call_function_single_async(int c
 
 	/* We could deadlock if we have to wait here with interrupts disabled! */
 	if (WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK))
-		csd_lock_wait(csd);
+		csd_lock_wait(csd, cpu);
 
 	csd->flags = CSD_FLAG_LOCK;
 	smp_wmb();
@@ -448,7 +537,7 @@ void smp_call_function_many(const struct
 	for_each_cpu(cpu, cfd->cpumask) {
 		call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
 
-		csd_lock(csd);
+		csd_lock(csd, cpu);
 		if (wait)
 			csd->flags |= CSD_FLAG_SYNCHRONOUS;
 		csd->func = func;
@@ -465,7 +554,7 @@ void smp_call_function_many(const struct
 			call_single_data_t *csd;
 
 			csd = per_cpu_ptr(cfd->csd, cpu);
-			csd_lock_wait(csd);
+			csd_lock_wait(csd, cpu);
 		}
 	}
 }

Attachment: OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to