** Changed in: linux (Ubuntu Bionic) Status: Fix Committed => Fix Released
** Changed in: ubuntu-power-systems Status: Triaged => Incomplete -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1757402 Title: Ubuntu18.04:pKVM - Host in hung state and out of network after few hours of stress run on all guests Status in The Ubuntu-power-systems project: Incomplete Status in linux package in Ubuntu: Fix Released Status in linux source package in Artful: Incomplete Status in linux source package in Bionic: Fix Released Bug description: == Comment: #0 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-02-11 12:37:25 == Problem Description: =================== After few hours of run system is in hung state with, "rcu_sched detected stalls on CPUs/tasks" messages on the host IPMI console and host is out of network . Steps to re-create: ================== > Installed Ubuntu1804 on boslcp3 host. root@boslcp3:~# uname -a Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 ppc64le ppc64le ppc64le GNU/Linux root@boslcp3:~# uname -r 4.13.0-25-generic > root@boslcp3:~# ppc64_cpu --smt SMT is off > Hugepage set up echo 8500 > /proc/sys/vm/nr_hugepages > Defined the guests from host machine Id Name State ---------------------------------------------------- 2 boslcp3g2 shut off 3 boslcp3g3 shut off 4 boslcp3g4 shut off 6 boslcp3g1 shut off 7 boslcp3g5 shut off > Started and installed ubuntu1804 daily build on all the guests. root@boslcp3:~# virsh list --all Id Name State ---------------------------------------------------- 2 boslcp3g2 running 3 boslcp3g3 running 4 boslcp3g4 running 6 boslcp3g1 running 7 boslcp3g5 running > Started regression run (IO_BASE_TCP_NFS) tests on all 5 guests. NOTE: Removed madvise test case from BASE focus areas. > Run went fine for few hours on all guests. > After few hours of run ,Host system is in hung state and host console dumps CPU stall messages as below [SOL Session operational. Use ~? for help] [250867.133429] INFO: rcu_sched detected stalls on CPUs/tasks: [250867.133499] (detected by 86, t=62711832 jiffies, g=497, c=496, q=31987857) [250867.133554] All QSes seen, last rcu_sched kthread activity 62711828 (4357609080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0 [250867.133690] rcu_sched kthread starved for 62711828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100 [250931.133433] INFO: rcu_sched detected stalls on CPUs/tasks: [250931.133494] (detected by 3, t=62727832 jiffies, g=497, c=496, q=31995625) [250931.133572] All QSes seen, last rcu_sched kthread activity 62727828 (4357625080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0 [250931.133741] rcu_sched kthread starved for 62727828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100 [250995.133432] INFO: rcu_sched detected stalls on CPUs/tasks: [250995.133480] (detected by 54, t=62743832 jiffies, g=497, c=496, q=32004479) [250995.133526] All QSes seen, last rcu_sched kthread activity 62743828 (4357641080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0 [250995.133645] rcu_sched kthread starved for 62743828 jiffies! g497 c496 f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100 > Not able to get the prompt > Ping /shh to boslcp3 also fails [ipjoga@kte ~]$ ping boslcp3 PING boslcp3.isst.aus.stglabs.ibm.com (10.33.0.157) 56(84) bytes of data. From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=1 Destination Host Unreachable From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=2 Destination Host Unreachable From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=3 Destination Host Unreachable [ipjoga@kte ~]$ ssh root@boslcp3 ssh: connect to host boslcp3 port 22: No route to host > boslcp3 is not reachable > Attached boslcp3 host console logs == Comment: #1 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-02-11 12:39:29 == Added Host console logs == Comment: #24 - VIPIN K. PARASHAR <vipar...@in.ibm.com> - 2018-02-16 05:46:13 == From Linux logs =========== [72072.290071] watchdog: BUG: soft lockup - CPU#132 stuck for 22s! [CPU 12/KVM:15579] [72072.290218] CPU: 132 PID: 15579 Comm: CPU 12/KVM Tainted: G W L 4.13.0-32-generic #35-Ubuntu [72072.290220] task: c000200debf82e00 task.stack: c000200e140f8000 [72072.290221] NIP: c000000000c779e0 LR: c0080000166893a0 CTR: c000000000c77980 [72072.290223] REGS: c000200e140fb790 TRAP: 0901 Tainted: G W L (4.13.0-32-generic) [72072.290224] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> [72072.290235] CR: 28024224 XER: 00000000 [72072.290236] CFAR: c000000000c779fc SOFTE: 1 [72072.290256] NIP [c000000000c779e0] _raw_spin_lock+0x60/0xe0 [72072.290261] LR [c0080000166893a0] kvmppc_pseries_do_hcall+0x548/0x8d0 [kvm_hv] [72072.290262] Call Trace: [72072.290263] [c000200e140fba10] [00000000ffffffff] 0xffffffff (unreliable) [72072.290270] [c000200e140fba40] [c008000016689334] kvmppc_pseries_do_hcall+0x4dc/0x8d0 [kvm_hv] [72072.290276] [c000200e140fbaa0] [c00800001668ab0c] kvmppc_vcpu_run_hv+0x1d4/0x470 [kvm_hv] [72072.290287] [c000200e140fbb10] [c0080000162559cc] kvmppc_vcpu_run+0x34/0x48 [kvm] [72072.290299] [c000200e140fbb30] [c008000016251a80] kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm] [72072.290309] [c000200e140fbbd0] [c008000016245018] kvm_vcpu_ioctl+0x400/0x7c8 [kvm] [72072.290313] [c000200e140fbd40] [c0000000003c1a64] do_vfs_ioctl+0xd4/0xa00 [72072.290316] [c000200e140fbde0] [c0000000003c2454] SyS_ioctl+0xc4/0x130 [72072.290320] [c000200e140fbe30] [c00000000000b184] system_call+0x58/0x6c [72072.290321] Instruction dump: [72072.290323] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 ebe1fff8 [72072.290331] 4e800020 60000000 60000000 60420000 <7c210b78> e92d0000 89290009 71290002 [72084.110070] watchdog: BUG: soft lockup - CPU#80 stuck for 23s! [CPU 11/KVM:15578] [72084.110223] CPU: 80 PID: 15578 Comm: CPU 11/KVM Tainted: G W L 4.13.0-32-generic #35-Ubuntu [72084.110225] task: c000200debfa3f00 task.stack: c000200e14080000 [72084.110226] NIP: c000000000c779e4 LR: c0080000166893a0 CTR: c000000000c77980 [72084.110229] REGS: c000200e14083790 TRAP: 0901 Tainted: G W L (4.13.0-32-generic) [72084.110230] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> [72084.110240] CR: 28024824 XER: 00000000 [72084.110241] CFAR: c000000000c779fc SOFTE: 1 [72084.110260] NIP [c000000000c779e4] _raw_spin_lock+0x64/0xe0 [72084.110266] LR [c0080000166893a0] kvmppc_pseries_do_hcall+0x548/0x8d0 [kvm_hv] [72084.110267] Call Trace: [72084.110269] [c000200e14083a10] [00000000ffffffff] 0xffffffff (unreliable) [72084.110275] [c000200e14083a40] [c008000016689334] kvmppc_pseries_do_hcall+0x4dc/0x8d0 [kvm_hv] [72084.110280] [c000200e14083aa0] [c00800001668ab0c] kvmppc_vcpu_run_hv+0x1d4/0x470 [kvm_hv] [72084.110292] [c000200e14083b10] [c0080000162559cc] kvmppc_vcpu_run+0x34/0x48 [kvm] [72084.110303] [c000200e14083b30] [c008000016251a80] kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm] [72084.110314] [c000200e14083bd0] [c008000016245018] kvm_vcpu_ioctl+0x400/0x7c8 [kvm] [72084.110318] [c000200e14083d40] [c0000000003c1a64] do_vfs_ioctl+0xd4/0xa00 [72084.110321] [c000200e14083de0] [c0000000003c2454] SyS_ioctl+0xc4/0x130 [72084.110325] [c000200e14083e30] [c00000000000b184] system_call+0x58/0x6c [72084.110326] Instruction dump: [72084.110328] 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 ebe1fff8 4e800020 [72084.110335] 60000000 60000000 60420000 7c210b78 <e92d0000> 89290009 71290002 40820050 From kernel logs, i see that most of CPUs reporting soft-lockup are busy executing kvmppc_pseries_do_hcall() and are waiting to acquire spin_lock. From xmon, i also examined all CPUs and most of them were either busy executing kvmppc_pseries_do_hcall() or kvmppc_run_core(). Tried triggering crash dump via sysrq keys but it didn't complete. Soft-lockup got logged upon reboot post crash and OS never made back to login prompt. Leaving machine is same state. == Comment: #38 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-01 00:11:04 == I looked at this system today. It was in a state where characters typed on the console would echo but it did not otherwise respond. Fortunately, ^O x was able to get me into xmon. CPU 0x54 is in _raw_spin_lock called from kvmppc_pseries_do_hcall, spinning on a lock held by cpu 0x60. CPU 0x60 is also in _raw_spin_lock called from kvmppc_pseries_do_hcall, spinning on the same lock. In other words, CPU 0x60 is trying to acquire a lock it already holds. I will need to disassemble the kvmppc_pseries_do_hcall function to work out where it is in the source code and try to work out how it got to be trying to acquire a lock it already holds. I will need the exact kernel binary image for this. I will try to find it but if anyone could send it to me or point me to it that would be helpful. I have left the system in xmon for now. == Comment: #40 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-02 05:39:03 == CPU 0x60 was stuck in the spin_lock call inside kvm_arch_vcpu_yield_to(). Somehow it has previously taken the vcore lock and failed to release it. I could add some debugging code to try to find out where it was last taken. I don't need the machine kept in the failed state any more. Did you have the indep_threads_mode parameter set to N on this machine when the problem occurred? == Comment: #45 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-03-06 01:33:26 == From host machine, while tests are running on guests root@boslcp3:~# cat /sys/module/kvm_hv/parameters/indep_threads_mode Y root@boslcp3: == Comment: #47 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-07 01:10:42 == This patch fixes a bug in the assembly code in the KVM guest exit path which could cause the trap number in r12 to get overwritten. The effect of this would be that if the exit is due to a doorbell interrupt from another CPU, we would effectively discard the doorbell interrupt and not handle it after returning to host context. I think this is why we are seeing CPUs locked up in smp_call_function_many() - their IPI to a CPU got ignored if the CPU was in a KVM guest, hence that other CPU never ran the requested function, hence the CPU doing smp_call_function_many() would wait forever. In many cases that CPU was holding spinlocks and then other CPUs would wait forever in raw_spin_lock(). == Comment: #116 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-03-21 04:42:09 == (In reply to comment #115) > (In reply to comment #113) > > After 2.2 upgrade, run completed 62 hours with 2 guests running. It looks > > the system is stable now. > > Hi Indira/Chethan, > > Can you please let know test run final status ? Hi Vipin, After 2.2 upgrade , 2 guests completed 72 hours of run. Host system is doing fine along with guests without crash/hang. Regards, Indira == Comment: #119 - VIPIN K. PARASHAR <vipar...@in.ibm.com> - 2018-03-21 05:33:19 == (In reply to comment #47) > Created attachment 125084 [details] > Patch to return trap number correctly from __kvmppc_vcore_entry > > This patch fixes a bug in the assembly code in the KVM guest exit path which > could cause the trap number in r12 to get overwritten. The effect of this > would be that if the exit is due to a doorbell interrupt from another CPU, > we would effectively discard the doorbell interrupt and not handle it after > returning to host context. I think this is why we are seeing CPUs locked up > in smp_call_function_many() - their IPI to a CPU got ignored if the CPU was > in a KVM guest, hence that other CPU never ran the requested function, hence > the CPU doing smp_call_function_many() would wait forever. In many cases > that CPU was holding spinlocks and then other CPUs would wait forever in > raw_spin_lock(). Commit a8b48a4dcc contains the fix with upstream kernel. It reads.. KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry break-fixes: fd7bacbca47a86a6f418440d8a5d7b7edbb2f8f9 a8b48a4dccea77e29462e59f1dbf0d5aa1ff167c To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu-power-systems/+bug/1757402/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp