Hello Linux-Kernel Team I have a unique situation and request some assistance or guidance.
We are running a software solution which is running on Gentoo OS with kernel version 3.18.34. One of our customer encounters frequent hang of the VM which is running in VMWare environment and we do not have any control over the customer's VMWare infrastructure. We have enabled kdump with the debug kernel for the customer and I have set up the same on our local test environment too. The kdump is configured and required sysctl settings are also set so that the system would generate a crashdump upon a sysrq trigger to force a panic. On my test environment, the exact same settings work just fine and upon sending a sysrq trigger 'Alt + SysRQ + c' I get a panic triggered, and system reboots automatically. However, on the customer's environment it does not reboot after a panic and the system just remains hung. I tried to dump a threadlist like 'Alt + SysRQ + t' and that works on customer setup suggesting that the sysrq is passed to the kernel but when attempting to crash, it gets hung and does not reboot and hence we do not get a valid crashdump. The settings below are identical on my environment and our customer env and this is an appliance based solution so we are shipping the OS and our software with it. The only difference being the VMWare environment which is different in the customer's setup. $ sysctl -a | egrep 'panic|sysctl' error: "Invalid argument" reading key "fs.binfmt_misc.register" fs.xfs.panic_mask = 0 kernel.hung_task_panic = 0 kernel.panic = 0 kernel.panic_on_io_nmi = 0 kernel.panic_on_oops = 1 kernel.panic_on_unrecovered_nmi = 0 kernel.softlockup_panic = 0 kernel.sysctl_writes_strict = 0 kernel.unknown_nmi_panic = 1 error: permission denied on key 'net.ipv4.route.flush' error: permission denied on key 'net.ipv6.route.flush' error: permission denied on key 'vm.compact_memory' vm.panic_on_oom = 0 I have also tried to send a NMI from the VMWare hypervisor and I get the same thing. Panic and reboot on my test environment but a hung OS which does not reboot. So upon reading the kernel documentation for kernel.panic I also set the value to 10 on customer setup and still no difference. On my test setup 0 or 10 it gets me a valid crashdump. https://www.kernel.org/doc/Documentation/sysctl/kernel.txt So any suggestions or pointers on what could lead to a successful trigger of the panic but a hung OS and the only option is to reset the VM from the hypervisor or it just sits there forever. Thus, I am not able to get a valid crashdump to investigate the original issue on why our software is having a problem on the customer environment leading to infrequent hung VM, whereas other VM's on the same host are all fine and no hardware issues or errors are seen. Thank you very much! Regards, Jimmy