Verification done on Bionic (no errors seen on dmesg).
Still waiting on verification by the reporter (different hardware), but this 
verification shows no regression in dmesg.

Same steps as described in the previous comment.

** Tags removed: verification-needed-bionic
** Tags added: verification-done-bionic

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1810998

Title:
  CPU hard lockup with rigorous writes to NVMe drive

Status in linux package in Ubuntu:
  Confirmed
Status in linux source package in Bionic:
  Fix Committed
Status in linux source package in Cosmic:
  Fix Committed

Bug description:
  [Impact]

   * Users may experience cpu hard lockups when performing
     rigorous writes to NVMe drives.

   * The fix addresses an scheduling issue in the original
     implementation of wbt/writeback throttling

   * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock
     contention and thundering herd issue in wbt_wait"),
     plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve
     waking of tasks").

   * Plus a few dependency commits for each fix.

   * Backports are trivial: mainly replace rq_wait_inc_below()
     with the equivalent atomic_inc_below(), and maintain the
     __wbt_done() signature, both due to the lack of commit
     a79050434b45 ("blk-rq-qos: refactor out common elements
     of blk-wbt"), that changes lots of other/unrelated code.

  [Test Case]

   * This command has been reported to reproduce the problem:

     $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t
  128

   * It generates stack traces as below in the original kernel,
     and does not generate them in the modified/patched kernel.

   * The user/reporter verified the test kernel with these patches
     resolved the problem.

   * The developer verified in 2 systems (4-core and 24-core but
     no NVMe) for regressions, and no error messages were logged
     to dmesg.

  [Regression Potential]

   * The regression potential is contained within writeback
     throttling mechanism (block/blk-wbt.*).

   * The commits have been verified for fixes in later commits in
     linux-next as of 2019-01-08 and all known fix commits are in.

  [Other Info]

   * The problem has been introduced with the blk-wbt mechanism,
     in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2,
     so only Bionic and Cosmic needs this.

  [Stack Traces]

  [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30
  ...
  [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic 
#21-Ubuntu
  ...
  [ 393.628720] Call Trace:
  [ 393.628721] <IRQ>
  [ 393.628724] enqueue_task_fair+0x6c/0x7f0
  [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
  [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150
  [ 393.628731] activate_task+0x57/0xc0
  [ 393.628735] ? sched_clock+0x9/0x10
  [ 393.628736] ? sched_clock+0x9/0x10
  [ 393.628738] ttwu_do_activate+0x49/0x90
  [ 393.628739] try_to_wake_up+0x1df/0x490
  [ 393.628741] default_wake_function+0x12/0x20
  [ 393.628743] autoremove_wake_function+0x12/0x40
  [ 393.628744] __wake_up_common+0x73/0x130
  [ 393.628745] __wake_up_common_lock+0x80/0xc0
  [ 393.628746] __wake_up+0x13/0x20
  [ 393.628749] __wbt_done.part.21+0xa4/0xb0
  [ 393.628749] wbt_done+0x72/0xa0
  [ 393.628753] blk_mq_free_request+0xca/0x1a0
  [ 393.628755] blk_mq_end_request+0x48/0x90
  [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core]
  [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme]
  [ 393.628764] __blk_mq_complete_request+0xd2/0x140
  [ 393.628766] blk_mq_complete_request+0x18/0x20
  [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme]
  [ 393.628768] nvme_irq+0x23/0x50 [nvme]
  [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0
  [ 393.628773] handle_irq_event_percpu+0x32/0x80
  [ 393.628774] handle_irq_event+0x3b/0x60
  [ 393.628778] handle_edge_irq+0x7c/0x190
  [ 393.628779] handle_irq+0x20/0x30
  [ 393.628783] do_IRQ+0x46/0xd0
  [ 393.628784] common_interrupt+0x84/0x84
  [ 393.628785] </IRQ>
  ...
  [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0
  [ 393.628796] cpuidle_enter+0x17/0x20
  [ 393.628797] call_cpuidle+0x23/0x40
  [ 393.628798] do_idle+0x18c/0x1f0
  [ 393.628799] cpu_startup_entry+0x73/0x80
  [ 393.628802] start_secondary+0x1a6/0x200
  [ 393.628804] secondary_startup_64+0xa5/0xb0
  [ 393.628805] Code: ...

  [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled

  [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks:
  [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485
  [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818)
  [ 435.617025] Sending NMI from CPU 8 to CPUs 30:
  [ 435.617029] NMI backtrace for cpu 30
  [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic 
#21-Ubuntu
  ...
  [ 435.617047] Call Trace:
  [ 435.617048] <IRQ>
  [ 435.617051] enqueue_entity+0x9f/0x6b0
  [ 435.617053] enqueue_task_fair+0x6c/0x7f0
  [ 435.617056] activate_task+0x57/0xc0
  [ 435.617059] ? sched_clock+0x9/0x10
  [ 435.617060] ? sched_clock+0x9/0x10
  [ 435.617061] ttwu_do_activate+0x49/0x90
  [ 435.617063] try_to_wake_up+0x1df/0x490
  [ 435.617065] default_wake_function+0x12/0x20
  [ 435.617067] autoremove_wake_function+0x12/0x40
  [ 435.617068] __wake_up_common+0x73/0x130
  [ 435.617069] __wake_up_common_lock+0x80/0xc0
  [ 435.617070] __wake_up+0x13/0x20
  [ 435.617073] __wbt_done.part.21+0xa4/0xb0
  [ 435.617074] wbt_done+0x72/0xa0
  [ 435.617077] blk_mq_free_request+0xca/0x1a0
  [ 435.617079] blk_mq_end_request+0x48/0x90
  [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core]
  [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme]
  [ 435.617088] __blk_mq_complete_request+0xd2/0x140
  [ 435.617090] blk_mq_complete_request+0x18/0x20
  [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme]
  [ 435.617093] nvme_irq+0x23/0x50 [nvme]
  [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0
  [ 435.617097] handle_irq_event_percpu+0x32/0x80
  [ 435.617098] handle_irq_event+0x3b/0x60
  [ 435.617101] handle_edge_irq+0x7c/0x190
  [ 435.617102] handle_irq+0x20/0x30
  [ 435.617106] do_IRQ+0x46/0xd0
  [ 435.617107] common_interrupt+0x84/0x84
  [ 435.617108] </IRQ>
  ...
  [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0
  [ 435.617118] cpuidle_enter+0x17/0x20
  [ 435.617119] call_cpuidle+0x23/0x40
  [ 435.617121] do_idle+0x18c/0x1f0
  [ 435.617122] cpu_startup_entry+0x73/0x80
  [ 435.617125] start_secondary+0x1a6/0x200
  [ 435.617127] secondary_startup_64+0xa5/0xb0
  [ 435.617128] Code: ...

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810998/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to