I'm attaching the crash tool output from the 3.13 kernel dump.

Much likely related to the situation already found in the following case:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540

Handled by Chris Arges and I on LKML discussions with Ingo and Linus:
-> http://www.kernelhub.org/?p=2&msg=683682

FOR NOW, it is LIKELY that I'll rely on already known recommendations for 
Proliant (including 
the ones related to X2APIC mode): 
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580

So we can TRY TO GUARANTEE that there are no LOST IRQs (IPIs) using the 
firmware you're using.
Hopefully with the proper APIC mode set, like HP recommends, we will not have 
those IPIs problems.

OBS: Whenever IPIs are lost (we've seen this on some nested KVMs and some buggy 
HW) 
we can be locked up in the SMP callback state machine. This means that the 
state machine 
looses IPIs ACKs and the state machine loops forever trying to shutdown the CPU 
for the 
SMP task queue to continue. 

I'll provide SOON a comment with SUGGESTIONS and asking for FEEDBACK.

################

For now, from the 3.13 kernel dump, the most interesting part:

We had 7 CPUs executing the migration kernel thread (for the SMP
callback state machine execution):

#### migration tasks (state machine loop)

>    93      2   4  ffff8808147b47d0  RU   0.0       0      0  [migration/4]
>   118      2   9  ffff881814a2c7d0  RU   0.0       0      0  [migration/9]
>   123      2  10  ffff88081404c7d0  RU   0.0       0      0  [migration/10]
>   128      2  11  ffff881814a4c7d0  RU   0.0       0      0  [migration/11]
>   138      2  13  ffff881814a647d0  RU   0.0       0      0  [migration/13]
>   165      2  18  ffff8810149ec7d0  RU   0.0       0      0  [migration/18]
>   195      2  24  ffff881014a647d0  RU   0.0       0      0  [migration/24]

This logic will try to migrate tasks from one CPU to another. In order for that 
to happen they have to 
rely on the state machine logic of shutting CPUs down before migrating the 
tasks (turning off
IRQs, etc). The state machine - shutting down the CPUs on phases - relies on 
the SMP callbacks bellow. 

We had 3 CPUs in a part of the kernel that we have already identified to
be problematic under certain conditions and/or HW.

** > 17247      1  23  ffff881007055fc0  RU   1.6 7358428 2192548  qemu-
system-x86

PID: 17247  TASK: ffff881007055fc0  CPU: 23  COMMAND: "qemu-system-x86"
 #0 [ffff88203eac6e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203eac6e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203eac6ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203eac6ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff8810ea7c96e0  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff8810ea7c96e0  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff8810ea7c96e0  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000206  R14: 000000007bc87bc6  R15: ffff8814959f76c0
    ORIG_RAX: ffff8814959f76c0  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8810ea7c96e0] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** > 21036      1  27  ffff8810b69947d0  RU   1.0 7484828 1401940  qemu-
system-x86

PID: 21036  TASK: ffff8810b69947d0  CPU: 27  COMMAND: "qemu-system-x86"
 #0 [ffff88203eb46e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203eb46e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203eb46ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203eb46ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff8814959f7670  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff8814959f7670  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff8814959f7670  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000282  R14: 0000000000000000  R15: 0000000000000100
    ORIG_RAX: 0000000000000100  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff8814959f7670] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** > 18516      1  31  ffff881dd54a2fe0  RU   1.6 7358428 2192548  qemu-
system-x86

PID: 18516  TASK: ffff881dd54a2fe0  CPU: 31  COMMAND: "qemu-system-x86"
 #0 [ffff88203ebc6e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203ebc6e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203ebc6ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203ebc6ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712  RSP: ffff881dd55597a0  RFLAGS: 00000202
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202
    RDX: ffff881dd55597a0  RSI: 0000000000000018  RDI: 0000000000000001
    RBP: ffffffff810db712   R8: ffffffff810db712   R9: 0000000000000018
    R10: ffff881dd55597a0  R11: 0000000000000202  R12: ffffffffffffffff
    R13: 0000000000000206  R14: 000000007bca7bc8  R15: ffff8814959f76c0
    ORIG_RAX: ffff8814959f76c0  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #4 [ffff881dd55597a0] generic_exec_single at ffffffff810db712

!!!! CSD_FLAG logic discussed with Linus

108             while (csd->flags & CSD_FLAG_LOCK)
   0xffffffff810db712 <+130>:   testb  $0x1,0x20(%rbx)
   0xffffffff810db716 <+134>:   jne    0xffffffff810db710 
<generic_exec_single+128>

109                     cpu_relax();
110     }

** Attachment added: "lp1505564-3.13-kdump-crash-output.txt"
   
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+attachment/4509470/+files/lp1505564-3.13-kdump-crash-output.txt

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1505564

Title:
  Soft lockup with "block nbdX: Attempted send on closed socket" spam

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to