Hello Junien, (recommendations with *) I'm replying to you and to the LP bug so it gets proper documentation. Under comment #91: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/comments/91 You can see my kernel dump analysis, where I am showing you that the OS is stuck in a "migration thread", possibly because of a lack of IPIs synchronisation (maybe even an IPI being lost). We have already seen cases like this - specially in nested virtualisation environments - and this has been discussed in LKML. Before we move further I need you to follow some kind of "best practices" for Proliant Servers: 1 - NMIs caused during MWAIT instruction (caused by intel_idle module): & HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580) (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837) * Firmware: Configure a maximum of a C3 c-state for CPU savings (CPU C-STATES) * Firmware: Disable packed CPU c-state * Firmware: Disable Cooperative Power Management * Make sure NOT TO LOAD HPWDT kernel module (LP: #1432837 Fix Released 3.13.0-49.81) 2 - Recently discovered NMIs caused by a BUG in Intel microcode (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1416414) ** If you have Intel based Proliant Servers, because of Intel microcode issue, use at least* 3.13.0-35.61. 3 - X2APIC support for HP Proliant Servers (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497) * For Proliant prior to G8 (<= G7) use "nox2apic intremap=off" into grub cmdline * For Proliant G8 use "intremap=no_x2apic_optout" into grub cmdline 4 - HP Proliant Latest Firmware MOST IMPORTANT Upgrade server firmware to latest version There were numerous firmware fixes from HP. ---> If we are facing a firmware problem - related to IPIs, the inter-processor-interrupts, being missed - we have to make sure this is reproducible in the latest firmware in order to work together with HP ROM engineering team. Summary: Could you follow all these steps and provide feedback ? I understand this might take awhile if you have a big number of servers and - if so - I would take a statistical approach here, by changing only half of the servers and sticking with the first half as the "control group", for future comparisons. Is this feasible ? Looking forward to hearing your feedback. Best Regards Rafael Tinoco Sustaining Engineering
-- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1505564 Title: Soft lockup with "block nbdX: Attempted send on closed socket" spam To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs