Le 10/31/20 à 5:08 AM, marma...@invisiblethingslab.com a écrit :
On Sat, Oct 31, 2020 at 04:27:58AM +0100, Dario Faggioli wrote:On Sat, 2020-10-31 at 03:54 +0100, marma...@invisiblethingslab.com wrote:On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote: (XEN) *** Dumping CPU7 host state: *** (XEN) Xen call trace: (XEN) [<ffff82d040223625>] R _spin_lock+0x35/0x40 (XEN) [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0 (XEN) [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0 (XEN) [<ffff82d0402df160>] S context_switch+0x110/0xa60 (XEN) [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250 (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 (XEN) [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30 And so on, for (almost?) all CPUs.Right. So, it seems like a live (I would say) lock. It might happen on some resource which his shared among domains. And introduced (the livelock, not the resource or the sharing) in 4.14. Just giving a quick look, I see that vmx_do_resume() calls vmx_clear_vmcs() which calls on_selected_cpus() which takes the call_lock spinlock. And none of these seems to have received much attention recently. But this is just a really basic analysis!I've looked at on_selected_cpus() and my understanding is this: 1. take call_lock spinlock 2. set function+args+what cpus to be called in a global "call_data" variable 3. ask CPUs to execute that function (smp_send_call_function_mask() call) 4. wait for all requested CPUs to execute the function, still holding the spinlock 5. only then - release the spinlock So, if any CPU does not execute requested function for any reason, it will keep the call_lock locked forever. I don't see any CPU waiting on step 4, but also I don't see call traces from CPU3 and CPU8 in the log - that's because they are in guest (dom0 here) context, right? I do see "guest state" dumps from them. The only three CPUs that do logged xen call traces and are not waiting on that spin lock are: CPU0: (XEN) Xen call trace: (XEN) [<ffff82d040240f89>] R vcpu_unblock+0x9/0x50 (XEN) [<ffff82d0402e0171>] S vcpu_kick+0x11/0x60 (XEN) [<ffff82d0402259c8>] S tasklet.c#do_tasklet_work+0x68/0xc0 (XEN) [<ffff82d040225a59>] S tasklet.c#tasklet_softirq_action+0x39/0x60 (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 (XEN) [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30 CPU4: (XEN) Xen call trace: (XEN) [<ffff82d040227043>] R set_timer+0x133/0x220 (XEN) [<ffff82d040234e90>] S credit.c#csched_tick+0/0x3a0 (XEN) [<ffff82d04022660f>] S timer.c#timer_softirq_action+0x9f/0x300 (XEN) [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0 (XEN) [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20 CPU14: (XEN) Xen call trace: (XEN) [<ffff82d040222dc0>] R do_softirq+0/0x10 (XEN) [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20 I'm not sure if any of those is related to that spin lock, on_selected_cpus() call, or anything like that...
Hi, Some newer logs here: https://gist.github.com/fepitre/5b2da8cf2ef976c0b885ce7bcfbf7313 You can have piece of serial console at hang freeze then debug keys 'd' and '0' blocked at one VCPU. I hope that will help. Regards, Frédéric
OpenPGP_0x484010B5CDC576E2.asc
Description: application/pgp-keys
OpenPGP_signature
Description: OpenPGP digital signature