Public bug reported:

Issue:
--------------
NMI Watchdog Bug and soft lockup occurs when htx memory test is run in ubuntu 
16.10.

Environment:
--------------------------
Arch : ppc64le
Platform : Ubuntu KVM Guest
Host : ubuntu 16.10 [4.8.0-17 -kernel ]
Guest : ubuntu 16.10 [4.8.0-17 - Kernel]

Steps To Reproduce:
-----------------------------------

1 - Install a Ubuntu KVM Guest and install htx package in the guest got from 
the link,
http://ausgsa.ibm.com/projects/h/htx/public_html/htxonly/htxubuntu-413.deb 

2 - Run the Htx mdt.mem

3 - The system Hits soft lockup Issue as below:

dmesg o/p:
[60287.590335] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 1141s! 
[hxemem64:23468]
[60287.590572] Modules linked in: vmx_crypto ip_tables x_tables autofs4 
ibmvscsi crc32c_vpmsum
[60287.590585] CPU: 3 PID: 23468 Comm: hxemem64 Tainted: G             L  
4.8.0-17-generic #19-Ubuntu
[60287.590587] task: c0000012a0971e00 task.stack: c0000012a2d40000
[60287.590589] NIP: c000000000015004 LR: c000000000015004 CTR: c000000000165e90
[60287.590591] REGS: c0000012a2d439a0 TRAP: 0901   Tainted: G             L   
(4.8.0-17-generic)
[60287.590592] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48004244  XER: 
00000000
[60287.590603] CFAR: c000000000165890 SOFTE: 1 
               GPR00: c000000000165f9c c0000012a2d43c20 c0000000014e5e00 
0000000000000900 
               GPR04: 0000000000000000 0000000000000008 0000000100e4d61a 
0000000000000000 
               GPR08: 0000000000000000 0000000000000006 0000000100e4d619 
c0000012bfee3130 
               GPR12: 00003fffae6cdc70 00003fffae436900 
[60287.590627] NIP [c000000000015004] arch_local_irq_restore+0x74/0x90
[60287.590630] LR [c000000000015004] arch_local_irq_restore+0x74/0x90
[60287.590631] Call Trace:
[60287.590634] [c0000012a2d43c20] [c0000012bfeccd80] 0xc0000012bfeccd80 
(unreliable)
[60287.590639] [c0000012a2d43c40] [c000000000165f9c] 
run_timer_softirq+0x10c/0x230
[60287.590644] [c0000012a2d43ce0] [c000000000b94adc] __do_softirq+0x18c/0x3fc
[60287.590648] [c0000012a2d43de0] [c0000000000d5828] irq_exit+0xc8/0x100
[60287.590653] [c0000012a2d43e00] [c000000000024810] timer_interrupt+0xa0/0xe0
[60287.590657] [c0000012a2d43e30] [c000000000002814] 
decrementer_common+0x114/0x180
[60287.590659] Instruction dump:
[60287.590662] 994d023a 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 
e8010010 
[60287.590670] 7c0803a6 4e800020 60420000 4bfed259 <60000000> 4bffffe4 60420000 
e92d0020 
[63127.581494] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 339s! 
[hxemem64:23467]
[63127.629682] Modules linked in: vmx_crypto ip_tables x_tables autofs4 
ibmvscsi crc32c_vpmsum
[63127.629699] CPU: 2 PID: 23467 Comm: hxemem64 Tainted: G             L  
4.8.0-17-generic #19-Ubuntu
[63127.629701] task: c0000012a0965800 task.stack: c0000012a2d58000
[63127.629703] NIP: 0000000010011e60 LR: 000000001000ec6c CTR: 0000000000f33196
[63127.629706] REGS: c0000012a2d5bea0 TRAP: 0901   Tainted: G             L   
(4.8.0-17-generic)
[63127.629707] MSR: 800000010000d033 <SF,EE,PR,ME,IR,DR,RI,LE,TM[E]>  CR: 
42004482  XER: 00000000
[63127.629719] CFAR: 0000000010011e68 SOFTE: 1 
               GPR00: 000000001000e854 00003fffadc2e540 0000000010047f00 
000000000000000d 
               GPR04: 0000000002000000 00003ff5a8000000 5a5a5a5a5a5a5a5a 
00003ff5b0667348 
               GPR08: 0000000000000000 000000001006c8e0 000000001006ca04 
fffffffffffff001 
               GPR12: 00003fffae6cdc70 00003fffadc36900 
[63127.629740] NIP [0000000010011e60] 0x10011e60
[63127.629742] LR [000000001000ec6c] 0x1000ec6c
[63127.629743] Call Trace:

== Comment: #3 - Santhosh G <santh...@in.ibm.com> - 2016-09-28 02:17:29 ==
Memory Info :

root@ubuntu:~# cat /proc/meminfo 
MemTotal:       78539776 kB
MemFree:        72219392 kB
MemAvailable:   77217088 kB
Buffers:          212544 kB
Cached:          5249088 kB
SwapCached:            0 kB
Active:          1440832 kB
Inactive:        4107264 kB
Active(anon):      93888 kB
Inactive(anon):     8640 kB
Active(file):    1346944 kB
Inactive(file):  4098624 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       3443648 kB
SwapFree:        3443648 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:         87296 kB
Mapped:            30400 kB
Shmem:             16128 kB
Slab:             381440 kB
SReclaimable:     295872 kB
SUnreclaim:        85568 kB
KernelStack:        2176 kB
PageTables:         2048 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    42639808 kB
Committed_AS:     224768 kB
VmallocTotal:   8589934592 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       9
HugePages_Free:        9
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

free -h :
              total        used        free      shared  buff/cache   available
Mem:            74G        545M         68G         15M        5.5G         73G
Swap:          3.3G          0B        3.3G

== Comment: #5 - Santhosh G <santh...@in.ibm.com> - 2016-09-29 02:49:49 ==
(In reply to comment #4)
> Hi Santhosh, 
> After how long are you seeing this error ?
> Can you share the output by:
> 1) start the mdt.mem tests.
> 2) While the tests are running what is the output of 'free -h' ?
> 3) Attach /tmp/htxerr 
> 
> Thank you.

Hi Vaishnavi,

I have run the test for more than 12 hours and not sure exactly when the
lockup occurs.

Before starting the tests,

free -h :
              total        used        free      shared  buff/cache   available
Mem:            74G        528M         68G         15M        5.5G         73G
Swap:          3.3G          0B        3.3G

After running the tests for more than 10 min :

total        used        free      shared  buff/cache   available
Mem:            74G        570M         20G         48G         53G         25G
Swap:          3.3G          0B        3.3G

The memory usage gradually Increases.

Not sure exactly at which point the lockup occurs.

And /tmp/htxerror is empty.

== Comment: #7 - Vaishnavi Bhat <vaish...@in.ibm.com> - 2016-09-30 04:03:23 ==
Hi Santhosh ,

While running the mdt.mem, we see that the about 60% of memory is used and free 
swap is reduced to 0B. 
total        used        free      shared  buff/cache   available
Mem:            74G        570M         20G         48G         53G         25G
Swap:          3.3G          0B        3.3G

Top output
 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND      
                                                                                
      
 1860 root      38  18 48.484g 0.046t 0.046t S 318.1 63.5   4865:53 hxemem64 

Also the dmesg shows traces of OOM and softlock up with hxemem.

Can you please try increasing vm.min_free_kbytes value and see if it shows any 
improvement? I would suggest starting with the double of the current value.
Current value :
$ sysctl -n vm.min_free_kbytes
180224
New value:
$sysctl -w vm.min_free_kbytes=<new value>

Thank you.

== Comment: #10 - Vaishnavi Bhat <vaish...@in.ibm.com> - 2016-10-20 04:06:20 ==
(In reply to comment #9)
> Hi Vaishnavi,
> 
> I am able to reproduce this issue even in 4.8.0-22-generic
> 
> o/p:
> sysctl -n vm.min_free_kbytes
> 360448
> 
> Please, take a look in to the issue.
> 
> Thanks.

Thanks for the confirmation, the issue is being reproduced with 
sysctl -n vm.min_free_kbytes
360448

Thank you.

** Affects: linux (Ubuntu)
     Importance: Undecided
     Assignee: Taco Screen team (taco-screen-team)
         Status: New


** Tags: architecture-ppc64le bugnameltc-146905 severity-high 
targetmilestone-inin---

** Tags added: architecture-ppc64le bugnameltc-146905 severity-high
targetmilestone-inin---

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1649513

Title:
  [Ubuntu 16.10] NMI watchdog and soft lockup while running htx memory
  tests in kernel 4.8.0-17-generic

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1649513/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to