Public bug reported: Issue: -------------- NMI Watchdog Bug and soft lockup occurs when htx memory test is run in ubuntu 16.10.
Environment: -------------------------- Arch : ppc64le Platform : Ubuntu KVM Guest Host : ubuntu 16.10 [4.8.0-17 -kernel ] Guest : ubuntu 16.10 [4.8.0-17 - Kernel] Steps To Reproduce: ----------------------------------- 1 - Install a Ubuntu KVM Guest and install htx package in the guest got from the link, http://ausgsa.ibm.com/projects/h/htx/public_html/htxonly/htxubuntu-413.deb 2 - Run the Htx mdt.mem 3 - The system Hits soft lockup Issue as below: dmesg o/p: [60287.590335] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 1141s! [hxemem64:23468] [60287.590572] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum [60287.590585] CPU: 3 PID: 23468 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu [60287.590587] task: c0000012a0971e00 task.stack: c0000012a2d40000 [60287.590589] NIP: c000000000015004 LR: c000000000015004 CTR: c000000000165e90 [60287.590591] REGS: c0000012a2d439a0 TRAP: 0901 Tainted: G L (4.8.0-17-generic) [60287.590592] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 48004244 XER: 00000000 [60287.590603] CFAR: c000000000165890 SOFTE: 1 GPR00: c000000000165f9c c0000012a2d43c20 c0000000014e5e00 0000000000000900 GPR04: 0000000000000000 0000000000000008 0000000100e4d61a 0000000000000000 GPR08: 0000000000000000 0000000000000006 0000000100e4d619 c0000012bfee3130 GPR12: 00003fffae6cdc70 00003fffae436900 [60287.590627] NIP [c000000000015004] arch_local_irq_restore+0x74/0x90 [60287.590630] LR [c000000000015004] arch_local_irq_restore+0x74/0x90 [60287.590631] Call Trace: [60287.590634] [c0000012a2d43c20] [c0000012bfeccd80] 0xc0000012bfeccd80 (unreliable) [60287.590639] [c0000012a2d43c40] [c000000000165f9c] run_timer_softirq+0x10c/0x230 [60287.590644] [c0000012a2d43ce0] [c000000000b94adc] __do_softirq+0x18c/0x3fc [60287.590648] [c0000012a2d43de0] [c0000000000d5828] irq_exit+0xc8/0x100 [60287.590653] [c0000012a2d43e00] [c000000000024810] timer_interrupt+0xa0/0xe0 [60287.590657] [c0000012a2d43e30] [c000000000002814] decrementer_common+0x114/0x180 [60287.590659] Instruction dump: [60287.590662] 994d023a 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010 [60287.590670] 7c0803a6 4e800020 60420000 4bfed259 <60000000> 4bffffe4 60420000 e92d0020 [63127.581494] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 339s! [hxemem64:23467] [63127.629682] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum [63127.629699] CPU: 2 PID: 23467 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu [63127.629701] task: c0000012a0965800 task.stack: c0000012a2d58000 [63127.629703] NIP: 0000000010011e60 LR: 000000001000ec6c CTR: 0000000000f33196 [63127.629706] REGS: c0000012a2d5bea0 TRAP: 0901 Tainted: G L (4.8.0-17-generic) [63127.629707] MSR: 800000010000d033 <SF,EE,PR,ME,IR,DR,RI,LE,TM[E]> CR: 42004482 XER: 00000000 [63127.629719] CFAR: 0000000010011e68 SOFTE: 1 GPR00: 000000001000e854 00003fffadc2e540 0000000010047f00 000000000000000d GPR04: 0000000002000000 00003ff5a8000000 5a5a5a5a5a5a5a5a 00003ff5b0667348 GPR08: 0000000000000000 000000001006c8e0 000000001006ca04 fffffffffffff001 GPR12: 00003fffae6cdc70 00003fffadc36900 [63127.629740] NIP [0000000010011e60] 0x10011e60 [63127.629742] LR [000000001000ec6c] 0x1000ec6c [63127.629743] Call Trace: == Comment: #3 - Santhosh G <santh...@in.ibm.com> - 2016-09-28 02:17:29 == Memory Info : root@ubuntu:~# cat /proc/meminfo MemTotal: 78539776 kB MemFree: 72219392 kB MemAvailable: 77217088 kB Buffers: 212544 kB Cached: 5249088 kB SwapCached: 0 kB Active: 1440832 kB Inactive: 4107264 kB Active(anon): 93888 kB Inactive(anon): 8640 kB Active(file): 1346944 kB Inactive(file): 4098624 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 3443648 kB SwapFree: 3443648 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 87296 kB Mapped: 30400 kB Shmem: 16128 kB Slab: 381440 kB SReclaimable: 295872 kB SUnreclaim: 85568 kB KernelStack: 2176 kB PageTables: 2048 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 42639808 kB Committed_AS: 224768 kB VmallocTotal: 8589934592 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 9 HugePages_Free: 9 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 16384 kB free -h : total used free shared buff/cache available Mem: 74G 545M 68G 15M 5.5G 73G Swap: 3.3G 0B 3.3G == Comment: #5 - Santhosh G <santh...@in.ibm.com> - 2016-09-29 02:49:49 == (In reply to comment #4) > Hi Santhosh, > After how long are you seeing this error ? > Can you share the output by: > 1) start the mdt.mem tests. > 2) While the tests are running what is the output of 'free -h' ? > 3) Attach /tmp/htxerr > > Thank you. Hi Vaishnavi, I have run the test for more than 12 hours and not sure exactly when the lockup occurs. Before starting the tests, free -h : total used free shared buff/cache available Mem: 74G 528M 68G 15M 5.5G 73G Swap: 3.3G 0B 3.3G After running the tests for more than 10 min : total used free shared buff/cache available Mem: 74G 570M 20G 48G 53G 25G Swap: 3.3G 0B 3.3G The memory usage gradually Increases. Not sure exactly at which point the lockup occurs. And /tmp/htxerror is empty. == Comment: #7 - Vaishnavi Bhat <vaish...@in.ibm.com> - 2016-09-30 04:03:23 == Hi Santhosh , While running the mdt.mem, we see that the about 60% of memory is used and free swap is reduced to 0B. total used free shared buff/cache available Mem: 74G 570M 20G 48G 53G 25G Swap: 3.3G 0B 3.3G Top output PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1860 root 38 18 48.484g 0.046t 0.046t S 318.1 63.5 4865:53 hxemem64 Also the dmesg shows traces of OOM and softlock up with hxemem. Can you please try increasing vm.min_free_kbytes value and see if it shows any improvement? I would suggest starting with the double of the current value. Current value : $ sysctl -n vm.min_free_kbytes 180224 New value: $sysctl -w vm.min_free_kbytes=<new value> Thank you. == Comment: #10 - Vaishnavi Bhat <vaish...@in.ibm.com> - 2016-10-20 04:06:20 == (In reply to comment #9) > Hi Vaishnavi, > > I am able to reproduce this issue even in 4.8.0-22-generic > > o/p: > sysctl -n vm.min_free_kbytes > 360448 > > Please, take a look in to the issue. > > Thanks. Thanks for the confirmation, the issue is being reproduced with sysctl -n vm.min_free_kbytes 360448 Thank you. ** Affects: linux (Ubuntu) Importance: Undecided Assignee: Taco Screen team (taco-screen-team) Status: New ** Tags: architecture-ppc64le bugnameltc-146905 severity-high targetmilestone-inin--- ** Tags added: architecture-ppc64le bugnameltc-146905 severity-high targetmilestone-inin--- -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1649513 Title: [Ubuntu 16.10] NMI watchdog and soft lockup while running htx memory tests in kernel 4.8.0-17-generic To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649513/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs