Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-20 Thread Chris J Arges
On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote: > > So potentially, CPU0 generated an interrupt that caused > > vcpu_enter_guest to be called on CPU1. However, when > > vmx_handle_external_intr was called, it didn't progress any further. > > So the IPI does look like to be lost in t

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-05-11 Thread Ingo Molnar
* Chris J Arges wrote: > Later in the trace we see the same call followed by > vmx_handle_external_intr() ignoring the call: > > [ 603.248016] 2452.083823 | 0) |ptep_clear_flush() { > [ 603.248016] 2452.083824 | 0) | flush_tlb_page() { > [ 603.2480

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-29 Thread Chris J Arges
On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote: > Now what would be nice is to observe it whether the CPU that is not > doing the CSD wait is truly locked up. > > It might be executing random KVM-ish workloads and the various > backtraces we've seen so far are just a random sampl

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges wrote: > A previous backtrace of a 3.19 series kernel is here and showing interrupts > enabled on both CPUs on L1: > https://lkml.org/lkml/2015/2/23/234 > http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt > > [...] > > Yes, I think at this point I'll go through the var

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Chris J Arges
On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote: > > * Chris J Arges wrote: > > > Ingo, > > > > Below are the patches and data I've gathered from the reproducer. My > > methodology was as described previously; however I used gdb on the > > qemu process in order to breakpoint L1 on

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-16 Thread Ingo Molnar
* Chris J Arges wrote: > Ingo, > > Below are the patches and data I've gathered from the reproducer. My > methodology was as described previously; however I used gdb on the > qemu process in order to breakpoint L1 once we've detected the hang. > This made dumping the kvm_lapic structures on

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-15 Thread Chris J Arges
On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote: > > FYI, I'm working on getting better data at the moment and here is my > > approach: > > * For the L0 kernel: > > - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and > > print > >the addresses of various u

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-12 Thread Ingo Molnar
* Chris J Arges wrote: > /sys/module/kvm_intel/parameters/enable_apicv on the affected > hardware is not enabled, and unfortunately my hardware doesn't have > the necessary features to enable it. So we are dealing with KVM's > lapic implementation only. That's actually pretty fortunate, as w

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-12 Thread Chris J Arges
> So it would be really important to see the stack dump of CPU#0 at this > point, and also do an APIC state dump of it. > > Because from previous dumps it appeared that the 'stuck' CPU was just > in its idle loop - which is 'impossible' as the idle loop should still > allow APIC irqs arriving

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Ingo Molnar
* Chris J Arges wrote: > Ingo, > > Looks like sched_clock() works in this case. > > Adding the dump_stack() line caused various issues such as the VM > oopsing on boot or the softlockup never being detected properly (and > thus not crashing). So the below is running with your patch and > 'd

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Linus Torvalds
On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges wrote: > > Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to > 5s): > [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x > [ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Chris J Arges
On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges > > wrote: > > > > > > I noticed that with this patch it never reached 'csd: Detected > > > non-responsive CSD lock...' because it seems that ts_delta nev

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-07 Thread Ingo Molnar
* Linus Torvalds wrote: > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges > wrote: > > > > I noticed that with this patch it never reached 'csd: Detected > > non-responsive CSD lock...' because it seems that ts_delta never reached > > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and stil

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Linus Torvalds
On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges wrote: > > I noticed that with this patch it never reached 'csd: Detected > non-responsive CSD lock...' because it seems that ts_delta never reached > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a > hang without reaching this st

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-06 Thread Chris J Arges
On 04/03/2015 12:43 AM, Ingo Molnar wrote: > > * Chris J Arges wrote: > >> Ingo, >> >> I think tracking IPI calls from 'generic_exec_single' would make a lot >> of sense. When you say poll for completion do you mean a loop after >> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main co

Re: [PATCH] smp/call: Detect stuck CSD locks

2015-04-02 Thread Ingo Molnar
* Ingo Molnar wrote: > +static void csd_lock_wait(struct call_single_data *csd, int cpu) > { > - while (csd->flags & CSD_FLAG_LOCK) > + int bug_id = 0; > + u64 ts0, ts1, ts_delta; > + > + ts0 = jiffies_to_msecs(jiffies); Note that while 'jiffies' is a global variable, it's read