Before trying the upstream kernel, I tried to replicate the issue. After noticing it was happening every time there are heavy file I/O. I was able to easily reproduce it at will by running apps that do lot of file I/O. I was also monitoring free memory every second to understand why kernel is invoking oom-killer to randomly killing applications. When oom-killer started to kill random applications, the memory looked like this.
Every 1.0s: free -h gorilla: Sat Jan 14 09:52:01 2017 total used free shared buff/cache available Mem: 5.9G 755M 127M 17M 5.1G 4.6G Swap: 2.0G 0B 2.0G As you can see, there are lot of available memory (mostly in cache and I am very sure most of it are clean cache) but for some reason, it was not reclaimed by kernel (kswapd0?). So I decided to run "echo 3 > /proc/sys/vm/drop_caches" frequently to force dropping cache, and sure enough everything worked fine. Right now, I haven't seen this problem in the last 2+ days. root@gorilla:~# cat /var/log/syslog|egrep "NMI watchdog: BUG: soft lockup|oom-killer" root@gorilla:~# uptime 07:37:29 up 2 days, 19:34, 1 user, load average: 1.63, 0.77, 0.29 Now that I suspect this may be a possible bug in kswapd0, I did a search here for similar issues for kswapd0 and found one (see below) but I am not sure it is the same problem though the symptoms and workaround are same. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1518457 At the end of this report (comment #142) says, they have no problem in 4.4.0-45 kernel but Yakkety based 4.8+ kernel has this problem. Assuming this is the same issue, I can confirm the same as I have never had this problem before upgrading to Yakkety. I am wondering if the bug made its way back since this fix. Since I have a workaround, I am going to continue with it; it is not ideal but seem to hold it. The last note on the above report says the bug is fixed and any new problem should be opened as a new bug. Can this report be treated as new bug to address this problem? Thanks ** Changed in: linux (Ubuntu) Status: Incomplete => Confirmed -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1655356 Title: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kswapd0:50]; oom-killer; and eventual kernel panic on 16.10 (upgrade from 16.04) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1655356/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs