Re: Recent upgrade of 4.13 -> 4.14 issue

Frédéric Pierret Sat, 31 Oct 2020 08:05:06 -0700


Le 10/31/20 à 5:08 AM, marma...@invisiblethingslab.com a écrit :

On Sat, Oct 31, 2020 at 04:27:58AM +0100, Dario Faggioli wrote:

On Sat, 2020-10-31 at 03:54 +0100, marma...@invisiblethingslab.com
wrote:

On Sat, Oct 31, 2020 at 02:34:32AM +0000, Dario Faggioli wrote:
(XEN) *** Dumping CPU7 host state: ***
(XEN) Xen call trace:
(XEN)    [<ffff82d040223625>] R _spin_lock+0x35/0x40
(XEN)    [<ffff82d0402233cd>] S on_selected_cpus+0x1d/0xc0
(XEN)    [<ffff82d040284aba>] S vmx_do_resume+0xba/0x1b0
(XEN)    [<ffff82d0402df160>] S context_switch+0x110/0xa60
(XEN)    [<ffff82d04024310a>] S core.c#schedule+0x1aa/0x250
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

And so on, for (almost?) all CPUs.


Right. So, it seems like a live (I would say) lock. It might happen on
some resource which his shared among domains. And introduced (the
livelock, not the resource or the sharing) in 4.14.

Just giving a quick look, I see that vmx_do_resume() calls
vmx_clear_vmcs() which calls on_selected_cpus() which takes the
call_lock spinlock.

And none of these seems to have received much attention recently.

But this is just a really basic analysis!


I've looked at on_selected_cpus() and my understanding is this:
1. take call_lock spinlock
2. set function+args+what cpus to be called in a global "call_data" variable
3. ask CPUs to execute that function (smp_send_call_function_mask() call)
4. wait for all requested CPUs to execute the function, still holding
the spinlock
5. only then - release the spinlock

So, if any CPU does not execute requested function for any reason, it
will keep the call_lock locked forever.

I don't see any CPU waiting on step 4, but also I don't see call traces
from CPU3 and CPU8 in the log - that's because they are in guest (dom0
here) context, right? I do see "guest state" dumps from them.
The only three CPUs that do logged xen call traces and are not waiting on that
spin lock are:

CPU0:
(XEN) Xen call trace:
(XEN)    [<ffff82d040240f89>] R vcpu_unblock+0x9/0x50
(XEN)    [<ffff82d0402e0171>] S vcpu_kick+0x11/0x60
(XEN)    [<ffff82d0402259c8>] S tasklet.c#do_tasklet_work+0x68/0xc0
(XEN)    [<ffff82d040225a59>] S tasklet.c#tasklet_softirq_action+0x39/0x60
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d040291b6b>] S vmx_asm_do_vmentry+0x2b/0x30

CPU4:
(XEN) Xen call trace:
(XEN)    [<ffff82d040227043>] R set_timer+0x133/0x220
(XEN)    [<ffff82d040234e90>] S credit.c#csched_tick+0/0x3a0
(XEN)    [<ffff82d04022660f>] S timer.c#timer_softirq_action+0x9f/0x300
(XEN)    [<ffff82d040222d4a>] S softirq.c#__do_softirq+0x5a/0xa0
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

CPU14:
(XEN) Xen call trace:
(XEN)    [<ffff82d040222dc0>] R do_softirq+0/0x10
(XEN)    [<ffff82d0402d64e6>] S x86_64/entry.S#process_softirqs+0x6/0x20

I'm not sure if any of those is related to that spin lock,
on_selected_cpus() call, or anything like that...


Hi,
Some newer logs here: 
https://gist.github.com/fepitre/5b2da8cf2ef976c0b885ce7bcfbf7313

You can have piece of serial console at hang freeze then debug keys 'd' and '0' 
blocked at one VCPU.

I hope that will help.

Regards,
Frédéric

OpenPGP_0x484010B5CDC576E2.asc
Description: application/pgp-keys

OpenPGP_signature
Description: OpenPGP digital signature

Re: Recent upgrade of 4.13 -> 4.14 issue

Reply via email to