[Kernel-packages] [Bug 1757402] Re: Ubuntu18.04:pKVM - Host in hung state and out of network after few hours of stress run on all guests

Andrew Cloke Mon, 23 Apr 2018 06:51:49 -0700

** Changed in: linux (Ubuntu Bionic)
       Status: Fix Committed => Fix Released


** Changed in: ubuntu-power-systems
       Status: Triaged => Incomplete

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1757402

Title:
  Ubuntu18.04:pKVM - Host  in hung state and out of network after few
  hours of stress run on all guests

Status in The Ubuntu-power-systems project:
  Incomplete
Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Artful:
  Incomplete
Status in linux source package in Bionic:
  Fix Released

Bug description:
  == Comment: #0 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-02-11 
12:37:25 ==
  Problem Description:
  ===================
  After few hours of run system is in hung state with, "rcu_sched detected 
stalls on CPUs/tasks" messages on the host IPMI console and host is out of 
network .

  Steps to re-create:
  ==================
  > Installed Ubuntu1804 on boslcp3 host.

  root@boslcp3:~# uname -a
  Linux boslcp3 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:15:55 UTC 2018 
ppc64le ppc64le ppc64le GNU/Linux
  root@boslcp3:~# uname -r
  4.13.0-25-generic

  > root@boslcp3:~# ppc64_cpu --smt
  SMT is off

  > Hugepage set up

  echo 8500 > /proc/sys/vm/nr_hugepages

  > Defined the guests from host machine

  Id    Name                           State
  ----------------------------------------------------
   2     boslcp3g2                      shut off
   3     boslcp3g3                      shut off
   4     boslcp3g4                      shut off
   6     boslcp3g1                      shut off
   7     boslcp3g5                      shut off

  > Started and installed ubuntu1804 daily build on all the guests.

  root@boslcp3:~# virsh list --all
   Id    Name                           State
  ----------------------------------------------------
   2     boslcp3g2                      running
   3     boslcp3g3                      running
   4     boslcp3g4                      running
   6     boslcp3g1                      running
   7     boslcp3g5                      running

  > Started regression run (IO_BASE_TCP_NFS) tests on all 5 guests.
  NOTE: Removed madvise test case from BASE focus areas.

  > Run went fine for few hours on all guests.

  >  After few hours of run ,Host system is in hung state and  host
  console dumps CPU stall messages as below

  [SOL Session operational.  Use ~? for help]
  [250867.133429] INFO: rcu_sched detected stalls on CPUs/tasks:
  [250867.133499]       (detected by 86, t=62711832 jiffies, g=497, c=496, 
q=31987857)
  [250867.133554] All QSes seen, last rcu_sched kthread activity 62711828 
(4357609080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [250867.133690] rcu_sched kthread starved for 62711828 jiffies! g497 c496 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100
  [250931.133433] INFO: rcu_sched detected stalls on CPUs/tasks:
  [250931.133494]       (detected by 3, t=62727832 jiffies, g=497, c=496, 
q=31995625)
  [250931.133572] All QSes seen, last rcu_sched kthread activity 62727828 
(4357625080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [250931.133741] rcu_sched kthread starved for 62727828 jiffies! g497 c496 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100
  [250995.133432] INFO: rcu_sched detected stalls on CPUs/tasks:
  [250995.133480]       (detected by 54, t=62743832 jiffies, g=497, c=496, 
q=32004479)
  [250995.133526] All QSes seen, last rcu_sched kthread activity 62743828 
(4357641080-4294897252), jiffies_till_next_fqs=1, root ->qsmask 0x0
  [250995.133645] rcu_sched kthread starved for 62743828 jiffies! g497 c496 
f0x2 RCU_GP_WAIT_FQS(3) ->state=0x100

  > Not able to get the prompt

  > Ping /shh to boslcp3 also fails

  [ipjoga@kte ~]$ ping boslcp3
  PING boslcp3.isst.aus.stglabs.ibm.com (10.33.0.157) 56(84) bytes of data.
  From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=1 Destination Host 
Unreachable
  From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=2 Destination Host 
Unreachable
  From kte.isst.aus.stglabs.ibm.com (10.33.11.31) icmp_seq=3 Destination Host 
Unreachable

  [ipjoga@kte ~]$ ssh root@boslcp3
  ssh: connect to host boslcp3 port 22: No route to host

  > boslcp3 is not reachable

  > Attached boslcp3 host console logs

  == Comment: #1 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-02-11 
12:39:29 ==
  Added Host console logs

  == Comment: #24 - VIPIN K. PARASHAR <vipar...@in.ibm.com> - 2018-02-16 
05:46:13 ==
  From Linux logs
  ===========

  [72072.290071] watchdog: BUG: soft lockup - CPU#132 stuck for 22s! [CPU 
12/KVM:15579]
  [72072.290218] CPU: 132 PID: 15579 Comm: CPU 12/KVM Tainted: G        W    L  
4.13.0-32-generic #35-Ubuntu
  [72072.290220] task: c000200debf82e00 task.stack: c000200e140f8000
  [72072.290221] NIP: c000000000c779e0 LR: c0080000166893a0 CTR: 
c000000000c77980
  [72072.290223] REGS: c000200e140fb790 TRAP: 0901   Tainted: G        W    L   
(4.13.0-32-generic)
  [72072.290224] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
  [72072.290235]   CR: 28024224  XER: 00000000
  [72072.290236] CFAR: c000000000c779fc SOFTE: 1
  [72072.290256] NIP [c000000000c779e0] _raw_spin_lock+0x60/0xe0
  [72072.290261] LR [c0080000166893a0] kvmppc_pseries_do_hcall+0x548/0x8d0 
[kvm_hv]
  [72072.290262] Call Trace:
  [72072.290263] [c000200e140fba10] [00000000ffffffff] 0xffffffff (unreliable)
  [72072.290270] [c000200e140fba40] [c008000016689334] 
kvmppc_pseries_do_hcall+0x4dc/0x8d0 [kvm_hv]
  [72072.290276] [c000200e140fbaa0] [c00800001668ab0c] 
kvmppc_vcpu_run_hv+0x1d4/0x470 [kvm_hv]
  [72072.290287] [c000200e140fbb10] [c0080000162559cc] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [72072.290299] [c000200e140fbb30] [c008000016251a80] 
kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
  [72072.290309] [c000200e140fbbd0] [c008000016245018] 
kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
  [72072.290313] [c000200e140fbd40] [c0000000003c1a64] do_vfs_ioctl+0xd4/0xa00
  [72072.290316] [c000200e140fbde0] [c0000000003c2454] SyS_ioctl+0xc4/0x130
  [72072.290320] [c000200e140fbe30] [c00000000000b184] system_call+0x58/0x6c
  [72072.290321] Instruction dump:
  [72072.290323] 40c20010 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 
ebe1fff8
  [72072.290331] 4e800020 60000000 60000000 60420000 <7c210b78> e92d0000 
89290009 71290002
  [72084.110070] watchdog: BUG: soft lockup - CPU#80 stuck for 23s! [CPU 
11/KVM:15578]
  [72084.110223] CPU: 80 PID: 15578 Comm: CPU 11/KVM Tainted: G        W    L  
4.13.0-32-generic #35-Ubuntu
  [72084.110225] task: c000200debfa3f00 task.stack: c000200e14080000
  [72084.110226] NIP: c000000000c779e4 LR: c0080000166893a0 CTR: 
c000000000c77980
  [72084.110229] REGS: c000200e14083790 TRAP: 0901   Tainted: G        W    L   
(4.13.0-32-generic)
  [72084.110230] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>
  [72084.110240]   CR: 28024824  XER: 00000000
  [72084.110241] CFAR: c000000000c779fc SOFTE: 1
  [72084.110260] NIP [c000000000c779e4] _raw_spin_lock+0x64/0xe0
  [72084.110266] LR [c0080000166893a0] kvmppc_pseries_do_hcall+0x548/0x8d0 
[kvm_hv]
  [72084.110267] Call Trace:
  [72084.110269] [c000200e14083a10] [00000000ffffffff] 0xffffffff (unreliable)
  [72084.110275] [c000200e14083a40] [c008000016689334] 
kvmppc_pseries_do_hcall+0x4dc/0x8d0 [kvm_hv]
  [72084.110280] [c000200e14083aa0] [c00800001668ab0c] 
kvmppc_vcpu_run_hv+0x1d4/0x470 [kvm_hv]
  [72084.110292] [c000200e14083b10] [c0080000162559cc] 
kvmppc_vcpu_run+0x34/0x48 [kvm]
  [72084.110303] [c000200e14083b30] [c008000016251a80] 
kvm_arch_vcpu_ioctl_run+0x108/0x320 [kvm]
  [72084.110314] [c000200e14083bd0] [c008000016245018] 
kvm_vcpu_ioctl+0x400/0x7c8 [kvm]
  [72084.110318] [c000200e14083d40] [c0000000003c1a64] do_vfs_ioctl+0xd4/0xa00
  [72084.110321] [c000200e14083de0] [c0000000003c2454] SyS_ioctl+0xc4/0x130
  [72084.110325] [c000200e14083e30] [c00000000000b184] system_call+0x58/0x6c
  [72084.110326] Instruction dump:
  [72084.110328] 7d40192d 40c2fff0 7c2004ac 2fa90000 409e001c 38210030 ebe1fff8 
4e800020
  [72084.110335] 60000000 60000000 60420000 7c210b78 <e92d0000> 89290009 
71290002 40820050

  From kernel logs, i see that most of CPUs reporting soft-lockup are busy 
executing
  kvmppc_pseries_do_hcall() and are waiting to acquire spin_lock.

  From xmon, i also examined all CPUs and most of them were either busy 
executing
  kvmppc_pseries_do_hcall() or kvmppc_run_core(). Tried triggering crash dump 
via sysrq keys
  but it didn't complete. Soft-lockup got logged upon reboot post crash and OS 
never made back
  to login prompt. Leaving machine is same state.

  == Comment: #38 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-01 00:11:04 ==
  I looked at this system today. It was in a state where characters typed on 
the console would echo but it did not otherwise respond. Fortunately, ^O x was 
able to get me into xmon.

  CPU 0x54 is in _raw_spin_lock called from kvmppc_pseries_do_hcall,
  spinning on a lock held by cpu 0x60. CPU 0x60 is also in
  _raw_spin_lock called from kvmppc_pseries_do_hcall, spinning on the
  same lock. In other words, CPU 0x60 is trying to acquire a lock it
  already holds. I will need to disassemble the kvmppc_pseries_do_hcall
  function to work out where it is in the source code and try to work
  out how it got to be trying to acquire a lock it already holds. I will
  need the exact kernel binary image for this. I will try to find it but
  if anyone could send it to me or point me to it that would be helpful.

  I have left the system in xmon for now.

  == Comment: #40 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-02 05:39:03 ==
  CPU 0x60 was stuck in the spin_lock call inside kvm_arch_vcpu_yield_to(). 
Somehow it has previously taken the vcore lock and failed to release it. I 
could add some debugging code to try to find out where it was last taken. I 
don't need the machine kept in the failed state any more.

  Did you have the indep_threads_mode parameter set to N on this machine
  when the problem occurred?

  == Comment: #45 - INDIRA P. JOGA <indira.pr...@in.ibm.com> -
  2018-03-06 01:33:26 ==

  From host machine, while tests are running on guests

  root@boslcp3:~# cat /sys/module/kvm_hv/parameters/indep_threads_mode
  Y
  root@boslcp3:

  == Comment: #47 - Paul Mackerras <p...@au1.ibm.com> - 2018-03-07 01:10:42 ==
  This patch fixes a bug in the assembly code in the KVM guest exit path which 
could cause the trap number in r12 to get overwritten. The effect of this would 
be that if the exit is due to a doorbell interrupt from another CPU, we would 
effectively discard the doorbell interrupt and not handle it after returning to 
host context. I think this is why we are seeing CPUs locked up in 
smp_call_function_many() - their IPI to a CPU got ignored if the CPU was in a 
KVM guest, hence that other CPU never ran the requested function, hence the CPU 
doing smp_call_function_many() would wait forever. In many cases that CPU was 
holding spinlocks and then other CPUs would wait forever in raw_spin_lock().

  == Comment: #116 - INDIRA P. JOGA <indira.pr...@in.ibm.com> - 2018-03-21 
04:42:09 ==
  (In reply to comment #115)
  > (In reply to comment #113)
  > > After 2.2 upgrade, run completed 62 hours with 2 guests running. It looks
  > > the system is stable now.
  >
  > Hi Indira/Chethan,
  >
  > Can you please let know test run final status ?

  Hi Vipin,

  After 2.2 upgrade , 2 guests completed 72 hours of run. Host system is
  doing fine along with guests without crash/hang.

  Regards,
  Indira

  == Comment: #119 - VIPIN K. PARASHAR <vipar...@in.ibm.com> - 2018-03-21 
05:33:19 ==
  (In reply to comment #47)
  > Created attachment 125084 [details]
  > Patch to return trap number correctly from __kvmppc_vcore_entry
  >
  > This patch fixes a bug in the assembly code in the KVM guest exit path which
  > could cause the trap number in r12 to get overwritten. The effect of this
  > would be that if the exit is due to a doorbell interrupt from another CPU,
  > we would effectively discard the doorbell interrupt and not handle it after
  > returning to host context. I think this is why we are seeing CPUs locked up
  > in smp_call_function_many() - their IPI to a CPU got ignored if the CPU was
  > in a KVM guest, hence that other CPU never ran the requested function, hence
  > the CPU doing smp_call_function_many() would wait forever. In many cases
  > that CPU was holding spinlocks and then other CPUs would wait forever in
  > raw_spin_lock().

  Commit a8b48a4dcc contains the fix with upstream kernel.
  It reads..

  KVM: PPC: Book3S HV: Fix trap number return from __kvmppc_vcore_entry

  break-fixes: fd7bacbca47a86a6f418440d8a5d7b7edbb2f8f9
  a8b48a4dccea77e29462e59f1dbf0d5aa1ff167c

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1757402/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1757402] Re: Ubuntu18.04:pKVM - Host in hung state and out of network after few hours of stress run on all guests

Reply via email to