On 22.07.19 16:22, Sergey Dyasli wrote:
On 19/07/2019 14:57, Juergen Gross wrote:
I have now a git branch with the two problems corrected and rebased to
current staging available:
github.com/jgross1/xen.git sched-v1b
Many thanks for the branch! As for the crashes, vcpu_sleep_sync() one
seems to be fixed now. But I can still reproduce the shutdown one.
Interestingly, it now happens only if a host has running VMs (which
are automatically powered off via PV tools):
(XEN) [ 332.981355] Preparing system for ACPI S5 state.
(XEN) [ 332.981419] Disabling non-boot CPUs ...
(XEN) [ 337.703896] Watchdog timer detects that CPU1 is stuck!
(XEN) [ 337.709532] ----[ Xen-4.13.0-8.0.6-d x86_64 debug=y Not tainted
]----
(XEN) [ 337.716808] CPU: 1
(XEN) [ 337.719582] RIP: e008:[<ffff82d08024041c>]
sched_context_switched+0xaf/0x101
(XEN) [ 337.727384] RFLAGS: 0000000000000202 CONTEXT: hypervisor
(XEN) [ 337.733364] rax: 0000000000000002 rbx: ffff83081cc615b0 rcx:
0000000000000001
(XEN) [ 337.741338] rdx: ffff83081cc61634 rsi: ffff83081cc72000 rdi:
ffff83081cc72000
(XEN) [ 337.749312] rbp: ffff83081cc8fdc0 rsp: ffff83081cc8fda0 r8:
0000000000000000
(XEN) [ 337.757284] r9: 0000000000000000 r10: 0000004d88fc535e r11:
0000004df8675ce7
(XEN) [ 337.765256] r12: ffff83081cc72000 r13: ffff83081cc72000 r14:
ffff83081ccb0e80
(XEN) [ 337.773232] r15: ffff83081cc615b0 cr0: 000000008005003b cr4:
00000000001526e0
(XEN) [ 337.781206] cr3: 00000000dd2a1000 cr2: ffff88809ed1fb80
(XEN) [ 337.787100] fsb: 0000000000000000 gsb: ffff8880a38c0000 gss:
0000000000000000
(XEN) [ 337.795072] ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs:
e008
(XEN) [ 337.802525] Xen code around <ffff82d08024041c>
(sched_context_switched+0xaf/0x101):
(XEN) [ 337.810672] 00 00 eb 18 f3 90 8b 02 <85> c0 75 f8 eb 0e 49 8b 7e 30
48 85 ff 74 05 e8
(XEN) [ 337.819080] Xen stack trace from rsp=ffff83081cc8fda0:
(XEN) [ 337.824713] ffff83081cc72000 ffff83081cc72000 0000000000000000
ffff83081cc615b0
(XEN) [ 337.832772] ffff83081cc8fe00 ffff82d0802404e0 0000000000000082
ffff83081ccb0e98
(XEN) [ 337.840832] 0000000000000001 ffff83081ccb0e98 0000000000000001
ffff82d080602628
(XEN) [ 337.848895] ffff83081cc8fe60 ffff82d080240aca 0000004d873bd669
0000000000000001
(XEN) [ 337.856952] ffff83081cc72000 0000004d873bdc1c ffff8308000000ff
ffff82d0805bba00
(XEN) [ 337.865012] ffff82d0805bb980 ffffffffffffffff ffff83081cc8ffff
0000000000000001
(XEN) [ 337.873072] ffff83081cc8fe90 ffff82d080242315 0000000000000080
ffff82d0805bb980
(XEN) [ 337.881132] 0000000000000001 ffff82d0806026f0 ffff83081cc8fea0
ffff82d08024236a
(XEN) [ 337.889196] ffff83081cc8fef0 ffff82d08027a151 ffff82d080242315
000000010665f000
(XEN) [ 337.897256] ffff83081cc72000 ffff83081cc72000 ffff83080665f000
ffff83081cc63000
(XEN) [ 337.905313] 0000000000000001 ffff830806684000 ffff83081cc8fd78
ffff88809ee08000
(XEN) [ 337.913373] ffff88809ee08000 0000000000000000 0000000000000000
0000000000000003
(XEN) [ 337.921434] ffff88809ee08000 0000000000000246 aaaaaaaaaaaaaaaa
0000000000000000
(XEN) [ 337.929497] 0000000096968abe 0000000000000000 ffffffff810013aa
ffffffff8203c190
(XEN) [ 337.937554] deadbeefdeadf00d deadbeefdeadf00d 0000010000000000
ffffffff810013aa
(XEN) [ 337.945615] 000000000000e033 0000000000000246 ffffc900400afeb0
000000000000e02b
(XEN) [ 337.953674] 000000000000beef 000000000000beef 000000000000beef
000000000000beef
(XEN) [ 337.961736] 0000e01000000001 ffff83081cc72000 000000379c66db80
00000000001526e0
(XEN) [ 337.969797] 0000000000000000 0000000000000000 0000060000000000
0000000000000000
(XEN) [ 337.977856] Xen call trace:
(XEN) [ 337.981152] [<ffff82d08024041c>] sched_context_switched+0xaf/0x101
(XEN) [ 337.988083] [<ffff82d0802404e0>]
schedule.c#sched_context_switch+0x72/0x151
(XEN) [ 337.995796] [<ffff82d080240aca>] schedule.c#sched_slave+0x2a3/0x2b2
(XEN) [ 338.002817] [<ffff82d080242315>] softirq.c#__do_softirq+0x85/0x90
(XEN) [ 338.009664] [<ffff82d08024236a>] do_softirq+0x13/0x15
(XEN) [ 338.015471] [<ffff82d08027a151>] domain.c#idle_loop+0xb2/0xc9
(XEN) [ 338.021970]
(XEN) [ 338.023965] CPU7 @ e008:ffff82d080242f94
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [ 338.032372] CPU5 @ e008:ffff82d080242f94
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [ 338.040776] CPU4 @ e008:ffff82d080242f94
(stop_machine.c#stopmachine_action+0x30/0xa0)
(XEN) [ 338.049182] CPU2 @ e008:ffff82d080242f9a
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [ 338.057591] CPU6 @ e008:ffff82d080242f9a
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [ 338.065999] CPU3 @ e008:ffff82d080242f9a
(stop_machine.c#stopmachine_action+0x36/0xa0)
(XEN) [ 338.074406] CPU0 @ e008:ffff82d0802532d1
(ns16550.c#ns_read_reg+0x21/0x42)
(XEN) [ 338.081773]
(XEN) [ 338.083764] ****************************************
(XEN) [ 338.089226] Panic on CPU 1:
(XEN) [ 338.092521] FATAL TRAP: vector = 2 (nmi)
(XEN) [ 338.096940] [error_code=0000]
(XEN) [ 338.100491] ****************************************
(XEN) [ 338.105951]
(XEN) [ 338.107946] Reboot in five seconds...
(XEN) [ 338.112105] Executing kexec image on cpu1
(XEN) [ 338.117383] Shot down all CPUs
And since Igor managed to fix kdump, I can now post backtraces from
all CPUs as well: https://paste.debian.net/1092609/
Thanks for the test (and report).
The fix is a one-liner. :-)
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index f0bc5b3161..da9efb147f 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -2207,6 +2207,7 @@ static struct sched_unit
*sched_wait_rendezvous_in(struct sched_unit *prev,
if ( unlikely(!scheduler_active) )
{
ASSERT(is_idle_unit(prev));
+ atomic_set(&prev->next_task->rendezvous_out_cnt, 0);
prev->rendezvous_in_cnt = 0;
}
}
Juergen
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel