On Tue, Mar 31, 2015 at 12:56:56PM +0200, Ingo Molnar wrote: > > * Linus Torvalds <torva...@linux-foundation.org> wrote: > > > Ok, interesting. So the whole "we try to do an APIC ACK with the ISR > > bit clear" seems to be a real issue. > > It's interesting in particular when it happens with an edge-triggered > interrupt source: it's much harder to miss level triggered IRQs, which > stay around until actively handled. Edge triggered irqs are more > fragile to loss of event processing. > > > > Anyway, maybe this sheds some more light on this issue. I can > > > reproduce this at will, so let me know of other experiments to do. > > Btw., could you please describe (again) what your current best method > for reproduction is? It's been a long discussion ... >
Ingo, To set this up, I've done the following on a Xeon E5620 / Xeon E312xx machine ( Although I've heard of others that have reproduced on other machines. ) 1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce) 2) Create an L2 KVM VM inside the L1 VM with 1 vCPU 3) Add the following to the L1 cmdline: nmi_watchdog=panic hung_task_panic=1 softlockup_panic=1 unknown_nmi_panic 3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM Sometimes this is sufficient to reproduce the issue, I've observed that running KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others). If this doesn't reproduce then you can do the following: 4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between L1 vCPUs until the hang occurs. I attempted to write a module that used smp_call_function_single calls to trigger IPIs but have been unable to create a more simple reproducer. > > Somebody else who knows the apic needs to also take a look, but I'd > > love to hear what the actual hardware interrupt is (from that > > "vector_irq[vector]" thing above. > > > > I'm not recognizing 0xe1 as any of the hardcoded SMP vectors (they > > are 0xf0-0xff), so it sounds like an external one. But that then > > requires the whole mapping table thing. > > > > Ingo/Peter/Jiang - is there anything else useful we could print out? > > I worry about the irq movement code. Can we add printk's to when an > > irq is chasing from one CPU to another and doing that > > "move_in_progress" thing? I've always been scared of that code. > > 1) > > So the irq_cfg::move_in_progress field originates from: > > 610142927b5b ("[PATCH] x86_64 irq: Safely cleanup an irq after moving it.") > > and I agree that it looks fragile, so it would be nice to see how much > (if at all?) it gets used, by sticking a few pr_info()s in it. > > Debug patch for the vector movement state machine attached below. > Untested. > > 2) > > But ... having taken a quick look at the vector movement handling, I > am scared more than ever,and I cannot convince myself that it's race > free. It's possibly harmless, but still, famous last words ... > > For example, most ->move_in_progress transitions happen with the > vector_lock held - with the exception of send_cleanup_vector(): there > we are only holding the desc->lock, but that's not enough! For > example, this codepath: > > static void __irq_complete_move(struct irq_cfg *cfg, unsigned vector) > { > unsigned me; > > if (likely(!cfg->move_in_progress)) > return; > > me = smp_processor_id(); > > if (vector == cfg->vector && cpumask_test_cpu(me, cfg->domain)) > send_cleanup_vector(cfg); > > ... > > void send_cleanup_vector(struct irq_cfg *cfg) > { > ... > > cfg->move_in_progress = 0; > } > > is blatantly racy, unless I missed something obvious? > > 2b) > > Initially I thought this only affects slowpaths like driver unregister > or CPU hotplug, but I think this would be relevant for the fastpath of > edge triggered irqs as well: > > void apic_ack_edge(struct irq_data *data) > { > irq_complete_move(irqd_cfg(data)); > irq_move_irq(data); > ack_APIC_irq(); > } > > and irq_complete_move() checks and clears cfg->move_in_progress as > listed above. > > So in most scenarios this is probably harmless, because it can in the > worst case result in the missing of a ->move_in_progress flag setting. > > But it does not look harmless in the apic_set_affinity() code path: > there we call into assign_irq_vector() without the desc->lock held, > due to this gem in the IRQ core: > > int irq_set_affinity_hint(unsigned int irq, const struct cpumask *m) > { > unsigned long flags; > struct irq_desc *desc = irq_get_desc_lock(irq, &flags, > IRQ_GET_DESC_CHECK_GLOBAL); > > if (!desc) > return -EINVAL; > desc->affinity_hint = m; > irq_put_desc_unlock(desc, flags); > /* set the initial affinity to prevent every interrupt being on CPU0 > */ > if (m) > __irq_set_affinity(irq, m, false); > return 0; > } > > argh! > > Now if this ever crosses path with an assign_irq_vector() call on > another CPU, then I really cannot see what would save us from deep > trouble: we use cfg->vector while that is possibly being modified, > right? > > So I'd suggest to put a printk into irq_set_affinity_hint() as well, > to make sure it's not used during this test. In particular: > > drivers/virtio/virtio_pci_common.c: > irq_set_affinity_hint(irq, NULL); > drivers/virtio/virtio_pci_common.c: > irq_set_affinity_hint(irq, mask); > > might be triggering it in the virtualization code... > > The second patch below adds the relevant pr_info(). (untested) > > Now this too might be unrelated, because the affinity hint was added > ages ago, in: > > e7a297b0d7d6 ("genirq: Add CPU mask affinity hint") > > and the potentially racy nature of calling into set_affinity without > the desc lock held was probably not realized back then. > > VirtIO's use of the affinity-hint was added a while ago as well, four > years ago, in: > > 75a0a52be3c2 ("virtio: introduce an API to set affinity for a virtqueue") > > 2c) > > The other thing that worries me here is that we apparently send an > IRQ-move IPI while having an un-acked local APIC (!). I have vague > memories that this was problematic and fragile before. > > To ease my worries in this area I've attached a third patch below that > changes the order around. > > This too is ancient behavior - but might be more prominent on certain > hardware (and virtualization) variants, so it has a chance to be > relevant. > > 3) > > Furthermore, I also noticed that the whole vector_lock()/unlock() > business is actively misleading IMHO, the vector_lock looks local to > vector.c, while we have these two calls that essentially export it ... > > Some of that damage was inflicted in this ancient commit: > > d388e5fdc461 ("x86: Restore proper vector locking during cpu hotplug") > > the proper approach would be to protect against hotplug events in > assign_irq_vector(), not the other way around! > > Probably not a functional problem, yet somewhat disgusting. > > ------------------------------- > > So ... all in one, these various problems might be related to the > regression or not, but they sure need addressing. > > Thanks, > > Ingo > I've merged the below patch with Linus' patch here: https://lkml.org/lkml/2015/3/31/955 This was only tested only on the L1, so I can put this on the L0 host and run this as well. The results: [ 124.897002] apic: vector c1, new-domain move in progress [ 124.954827] apic: vector d1, sent cleanup vector, move completed [ 163.477270] apic: vector d1, new-domain move in progress [ 164.041938] apic: vector e1, sent cleanup vector, move completed [ 213.466971] apic: vector e1, new-domain move in progress [ 213.775639] apic: vector 22, sent cleanup vector, move completed [ 365.996747] apic: vector 22, new-domain move in progress [ 366.011136] apic: vector 42, sent cleanup vector, move completed [ 393.836032] apic: vector 42, new-domain move in progress [ 393.837727] apic: vector 52, sent cleanup vector, move completed [ 454.977514] apic: vector 52, new-domain move in progress [ 454.978880] apic: vector 62, sent cleanup vector, move completed [ 467.055730] apic: vector 62, new-domain move in progress [ 467.058129] apic: vector 72, sent cleanup vector, move completed [ 545.280125] apic: vector 72, new-domain move in progress [ 545.282801] apic: vector 82, sent cleanup vector, move completed [ 567.631652] apic: vector 82, new-domain move in progress [ 567.632207] apic: vector 92, sent cleanup vector, move completed [ 628.940638] apic: vector 92, new-domain move in progress [ 628.965274] apic: vector a2, sent cleanup vector, move completed [ 635.187433] apic: vector a2, new-domain move in progress [ 635.191643] apic: vector b2, sent cleanup vector, move completed [ 673.548020] apic: vector b2, new-domain move in progress [ 673.553843] apic: vector c2, sent cleanup vector, move completed [ 688.221906] apic: vector c2, new-domain move in progress [ 688.229487] apic: vector d2, sent cleanup vector, move completed [ 723.818916] apic: vector d2, new-domain move in progress [ 723.828970] apic: vector e2, sent cleanup vector, move completed [ 733.485435] apic: vector e2, new-domain move in progress [ 733.615007] apic: vector 23, sent cleanup vector, move completed [ 824.092036] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26] Thanks, --chris j arges > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index 6cedd7914581..1dd1de9dd690 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -143,6 +143,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const > struct cpumask *mask) > cpumask_andnot(cfg->old_domain, cfg->domain, tmp_mask); > cfg->move_in_progress = > cpumask_intersects(cfg->old_domain, cpu_online_mask); > + if (cfg->move_in_progress) > + pr_info("apic: vector %02x, same-domain move in > progress\n", cfg->vector); > cpumask_and(cfg->domain, cfg->domain, tmp_mask); > break; > } > @@ -178,6 +180,8 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const > struct cpumask *mask) > cpumask_copy(cfg->old_domain, cfg->domain); > cfg->move_in_progress = > cpumask_intersects(cfg->old_domain, cpu_online_mask); > + if (cfg->move_in_progress) > + pr_info("apic: vector %02x, new-domain move in > progress\n", cfg->vector); > } > for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask) > per_cpu(vector_irq, new_cpu)[vector] = irq; > @@ -232,6 +236,7 @@ void clear_irq_vector(int irq, struct irq_cfg *cfg) > } > } > cfg->move_in_progress = 0; > + pr_info("apic: vector %02x, vector cleared, move completed\n", > cfg->vector); > raw_spin_unlock_irqrestore(&vector_lock, flags); > } > > @@ -391,6 +396,7 @@ void send_cleanup_vector(struct irq_cfg *cfg) > free_cpumask_var(cleanup_mask); > } > cfg->move_in_progress = 0; > + pr_info("apic: vector %02x, sent cleanup vector, move completed\n", > cfg->vector); > } > > asmlinkage __visible void smp_irq_move_cleanup_interrupt(void) > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c > index c0a1100d911f..18300a7b8ed2 100644 > --- a/kernel/irq/manage.c > +++ b/kernel/irq/manage.c > @@ -250,8 +250,10 @@ int irq_set_affinity_hint(unsigned int irq, const struct > cpumask *m) > desc->affinity_hint = m; > irq_put_desc_unlock(desc, flags); > /* set the initial affinity to prevent every interrupt being on CPU0 */ > - if (m) > + if (m) { > + pr_info("irq/core: Called irq_set_affinity_hint(%d)\n", irq); > __irq_set_affinity(irq, m, false); > + } > return 0; > } > EXPORT_SYMBOL_GPL(irq_set_affinity_hint); > > diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c > index 6cedd7914581..833a981c5420 100644 > --- a/arch/x86/kernel/apic/vector.c > +++ b/arch/x86/kernel/apic/vector.c > @@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data) > > void apic_ack_edge(struct irq_data *data) > { > + ack_APIC_irq(); > + > + /* Might generate IPIs, so do this after having ACKed the APIC: */ > irq_complete_move(irqd_cfg(data)); > irq_move_irq(data); > - ack_APIC_irq(); > } > > /* > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/