-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1821259

Title:
  Hard lockup in 2 CPUs due to deadlock in cpu_stoppers

Status in linux package in Ubuntu:
  Incomplete

Bug description:
  [Impact]

   * This problem hard locks up 2 CPUs in a deadlock, and this
     soft locks up other CPUs as an effect; the system becomes
     unusable.

   * This is relatively rare / difficult to hit because it's a
     corner case in scheduling/load balancing that needs timing
     with CPU stopper code. And it needs SMP plus _NUMA_ system.
     (but it can be hit with synthetic test case attached in LP.)

   * Since SMP plus NUMA usually equals _servers_ it looks like
     a good idea to prevent this bug / hard lockups / rebooting.

   * The fix resolves the potential deadlock by removing one of
     the calls required to deadlock from under the locked code.

  [Test Case]

   * There's a synthetic test case to reproduce this problem
     (although without the stack traces - just a system hang)
     attached to this LP bug.

   * It uses kprobes/mdelay/cpu stopper calls to force the code
     to execute and force the timing/locking condition to occur.

   * $ sudo insmod kmod-stopper.ko

     Some dmesg logging occurs, and systems either hangs or not.
     See examples in comments.
     
  [Regression Potential] 

   * These are patches to the cpu stop_machine.c code, and they
     change a bit how it works;  however, there are no upstream
     fixes for these patches anymore and they are still the top
     of the 'git log --oneline -- kernel/stop_machine.c' output.

   * These patches have been verified with the synthetic test case
     and 'stress-ng --class scheduler --sequential 0' (no regressions)
     on guest with 2 CPUs and one physical system with 24 CPUs.

  [Other Info]
   
   * The patches are required on Xenial and later.
   * There are 4 patches for Xenial, and 2 patches pending for Bionic.
   * All patches are applied from Cosmic onwards.

  [Original Description]

  These 2 hard lockups happened all of a sudden in the logs, and many
  soft lockups occur after them as a fallout.

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.477086] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 10
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.483800] Modules linked in: 
<...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484066] CPU: 10 PID: 58 
Comm: migration/10 Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484068] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484070] task: 
ffff883ff2a76200 ti: ffff883ff2110000 task.ti: ffff883ff2110000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484071] RIP: 
0010:[<ffffffff810c8cb0>]  [<ffffffff810c8cb0>] 
native_queued_spin_lock_slowpath+0x160/0x170
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484079] RSP: 
0000:ffff883ff2113c58  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484080] RAX: 
0000000000000101 RBX: 0000000000000086 RCX: 0000000000000001
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484081] RDX: 
0000000000000101 RSI: 0000000000000001 RDI: ffff881fff991ba8
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484083] RBP: 
ffff883ff2113c58 R08: 0000000000000101 R09: ffff883ff082e200
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484084] R10: 
0000000000002e04 R11: 0000000000002e04 R12: ffff881fff997c60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484085] R13: 
ffff881fff991ba8 R14: 0000000000000000 R15: ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484087] FS:  
0000000000000000(0000) GS:ffff883fff000000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484088] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484090] CR2: 
00007f7caaa23020 CR3: 0000001f46740000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484091] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484092]  ffff883ff2113c68 
ffffffff811870eb ffff883ff2113c80 ffffffff81819907
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484094]  ffff881fff991ba0 
ffff883ff2113cb0 ffffffff8111c600 ffff881fff997300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484096]  ffff881fff997c90 
ffff881ff03dd400 0000000000000000 ffff883ff2113cc0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484098] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484105]  
[<ffffffff811870eb>] queued_spin_lock_slowpath+0xb/0xf
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484109]  
[<ffffffff81819907>] _raw_spin_lock_irqsave+0x37/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484113]  
[<ffffffff8111c600>] cpu_stop_queue_work+0x30/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484116]  
[<ffffffff8111ccd0>] stop_one_cpu_nowait+0x30/0x40
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484119]  
[<ffffffff810bbb5b>] load_balance+0x71b/0x940
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484122]  
[<ffffffff810bbff5>] pick_next_task_fair+0x275/0x4b0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484126]  
[<ffffffff81816166>] __schedule+0x6c6/0x7f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484132]  
[<ffffffff810a2560>] ? sort_range+0x30/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484134]  
[<ffffffff818162c5>] schedule+0x35/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484136]  
[<ffffffff810a262d>] smpboot_thread_fn+0xcd/0x180
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484139]  
[<ffffffff8109f138>] kthread+0xd8/0xf0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484141]  
[<ffffffff8109f060>] ? kthread_park+0x60/0x60
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484143]  
[<ffffffff81819ff5>] ret_from_fork+0x55/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603802.484144]  
[<ffffffff8109f060>] ? kthread_park+0x60/0x60

      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.644471] NMI watchdog: 
Watchdog detected hard LOCKUP on cpu 6
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651086] Modules linked in: 
<...>
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651342] CPU: 6 PID: 204932 
Comm: ceph-osd Not tainted 4.4.0-116-generic #140~14.04.1-Ubuntu
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651344] Hardware name: HP 
ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 02/17/2017
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651345] task: 
ffff881ff03dd400 ti: ffff883cda77c000 task.ti: ffff883cda77c000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651347] RIP: 
0010:[<ffffffff810aacb6>]  [<ffffffff810aacb6>] try_to_wake_up+0x86/0x3f0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651353] RSP: 
0000:ffff883cda77fa78  EFLAGS: 00000002
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651354] RAX: 
0000000000000001 RBX: ffff883ff2a76200 RCX: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651355] RDX: 
0000000000000001 RSI: 0000000000000003 RDI: ffff883ff2a768d4
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651356] RBP: 
ffff883cda77fab8 R08: 000000000000000a R09: ffff881ff03dd400
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651357] R10: 
0000000000000001 R11: 0000000000000000 R12: 0000000000017300
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651359] R13: 
ffff883ff2a768d4 R14: 0000000000000046 R15: 0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651360] FS:  
00007ff8ecbc9700(0000) GS:ffff881fff980000(0000) knlGS:0000000000000000
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651362] CS:  0010 DS: 0000 
ES: 0000 CR0: 0000000080050033
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651363] CR2: 
0000000014583550 CR3: 0000003d4ac96000 CR4: 0000000000160670
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651364] Stack:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651365]  0000000000000202 
ffff883cda77fa98 0000000000000003 0000000000000006
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651368]  000000000000000a 
ffff883cda77fb70 ffff883fff011ba0 ffff881fff991ba0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651370]  ffff883cda77fac8 
ffffffff810ab035 ffff883cda77fbc8 ffffffff8111cc22
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651372] Call Trace:
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651375]  
[<ffffffff810ab035>] wake_up_process+0x15/0x20
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651379]  
[<ffffffff8111cc22>] stop_two_cpus+0x1b2/0x230
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651382]  
[<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651384]  
[<ffffffff810b5d15>] ? dequeue_entity+0x455/0x8a0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651386]  
[<ffffffff8111c650>] ? cpu_stop_queue_work+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651388]  
[<ffffffff810aaa70>] ? __migrate_swap_task.part.83+0x80/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651390]  
[<ffffffff810ab18e>] migrate_swap+0xae/0x130
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651392]  
[<ffffffff810b4e44>] task_numa_migrate+0x504/0x930
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651394]  
[<ffffffff810b52e9>] numa_migrate_preferred+0x79/0x80
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651396]  
[<ffffffff810b9373>] task_numa_fault+0x923/0xcd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651400]  
[<ffffffff8175e407>] ? tcp_recvmsg+0x6b7/0xbd0
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651404]  
[<ffffffff811da9be>] ? mpol_misplaced+0x14e/0x190
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651408]  
[<ffffffff811b7836>] handle_pte_fault+0x5a6/0x1440
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651411]  
[<ffffffff816f6693>] ? sock_recvmsg+0x43/0x50
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651413]  
[<ffffffff811b9540>] handle_mm_fault+0x250/0x540
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651417]  
[<ffffffff81069e1a>] __do_page_fault+0x19a/0x430
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651419]  
[<ffffffff8106a0d2>] do_page_fault+0x22/0x30
      Nov 23 15:48:33 SYSTEM_NAME kernel: [4603805.651423]  
[<ffffffff8181c5a8>] page_fault+0x28/0x30

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1821259/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to