Verification done on Bionic (no errors seen on dmesg). Still waiting on verification by the reporter (different hardware), but this verification shows no regression in dmesg.
Same steps as described in the previous comment. ** Tags removed: verification-needed-bionic ** Tags added: verification-done-bionic -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/1810998 Title: CPU hard lockup with rigorous writes to NVMe drive Status in linux package in Ubuntu: Confirmed Status in linux source package in Bionic: Fix Committed Status in linux source package in Cosmic: Fix Committed Bug description: [Impact] * Users may experience cpu hard lockups when performing rigorous writes to NVMe drives. * The fix addresses an scheduling issue in the original implementation of wbt/writeback throttling * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait"), plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve waking of tasks"). * Plus a few dependency commits for each fix. * Backports are trivial: mainly replace rq_wait_inc_below() with the equivalent atomic_inc_below(), and maintain the __wbt_done() signature, both due to the lack of commit a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt"), that changes lots of other/unrelated code. [Test Case] * This command has been reported to reproduce the problem: $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128 * It generates stack traces as below in the original kernel, and does not generate them in the modified/patched kernel. * The user/reporter verified the test kernel with these patches resolved the problem. * The developer verified in 2 systems (4-core and 24-core but no NVMe) for regressions, and no error messages were logged to dmesg. [Regression Potential] * The regression potential is contained within writeback throttling mechanism (block/blk-wbt.*). * The commits have been verified for fixes in later commits in linux-next as of 2019-01-08 and all known fix commits are in. [Other Info] * The problem has been introduced with the blk-wbt mechanism, in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2, so only Bionic and Cosmic needs this. [Stack Traces] [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30 ... [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 393.628720] Call Trace: [ 393.628721] <IRQ> [ 393.628724] enqueue_task_fair+0x6c/0x7f0 [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628731] activate_task+0x57/0xc0 [ 393.628735] ? sched_clock+0x9/0x10 [ 393.628736] ? sched_clock+0x9/0x10 [ 393.628738] ttwu_do_activate+0x49/0x90 [ 393.628739] try_to_wake_up+0x1df/0x490 [ 393.628741] default_wake_function+0x12/0x20 [ 393.628743] autoremove_wake_function+0x12/0x40 [ 393.628744] __wake_up_common+0x73/0x130 [ 393.628745] __wake_up_common_lock+0x80/0xc0 [ 393.628746] __wake_up+0x13/0x20 [ 393.628749] __wbt_done.part.21+0xa4/0xb0 [ 393.628749] wbt_done+0x72/0xa0 [ 393.628753] blk_mq_free_request+0xca/0x1a0 [ 393.628755] blk_mq_end_request+0x48/0x90 [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core] [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 393.628764] __blk_mq_complete_request+0xd2/0x140 [ 393.628766] blk_mq_complete_request+0x18/0x20 [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme] [ 393.628768] nvme_irq+0x23/0x50 [nvme] [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0 [ 393.628773] handle_irq_event_percpu+0x32/0x80 [ 393.628774] handle_irq_event+0x3b/0x60 [ 393.628778] handle_edge_irq+0x7c/0x190 [ 393.628779] handle_irq+0x20/0x30 [ 393.628783] do_IRQ+0x46/0xd0 [ 393.628784] common_interrupt+0x84/0x84 [ 393.628785] </IRQ> ... [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0 [ 393.628796] cpuidle_enter+0x17/0x20 [ 393.628797] call_cpuidle+0x23/0x40 [ 393.628798] do_idle+0x18c/0x1f0 [ 393.628799] cpu_startup_entry+0x73/0x80 [ 393.628802] start_secondary+0x1a6/0x200 [ 393.628804] secondary_startup_64+0xa5/0xb0 [ 393.628805] Code: ... [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks: [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485 [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818) [ 435.617025] Sending NMI from CPU 8 to CPUs 30: [ 435.617029] NMI backtrace for cpu 30 [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 435.617047] Call Trace: [ 435.617048] <IRQ> [ 435.617051] enqueue_entity+0x9f/0x6b0 [ 435.617053] enqueue_task_fair+0x6c/0x7f0 [ 435.617056] activate_task+0x57/0xc0 [ 435.617059] ? sched_clock+0x9/0x10 [ 435.617060] ? sched_clock+0x9/0x10 [ 435.617061] ttwu_do_activate+0x49/0x90 [ 435.617063] try_to_wake_up+0x1df/0x490 [ 435.617065] default_wake_function+0x12/0x20 [ 435.617067] autoremove_wake_function+0x12/0x40 [ 435.617068] __wake_up_common+0x73/0x130 [ 435.617069] __wake_up_common_lock+0x80/0xc0 [ 435.617070] __wake_up+0x13/0x20 [ 435.617073] __wbt_done.part.21+0xa4/0xb0 [ 435.617074] wbt_done+0x72/0xa0 [ 435.617077] blk_mq_free_request+0xca/0x1a0 [ 435.617079] blk_mq_end_request+0x48/0x90 [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core] [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 435.617088] __blk_mq_complete_request+0xd2/0x140 [ 435.617090] blk_mq_complete_request+0x18/0x20 [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme] [ 435.617093] nvme_irq+0x23/0x50 [nvme] [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0 [ 435.617097] handle_irq_event_percpu+0x32/0x80 [ 435.617098] handle_irq_event+0x3b/0x60 [ 435.617101] handle_edge_irq+0x7c/0x190 [ 435.617102] handle_irq+0x20/0x30 [ 435.617106] do_IRQ+0x46/0xd0 [ 435.617107] common_interrupt+0x84/0x84 [ 435.617108] </IRQ> ... [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0 [ 435.617118] cpuidle_enter+0x17/0x20 [ 435.617119] call_cpuidle+0x23/0x40 [ 435.617121] do_idle+0x18c/0x1f0 [ 435.617122] cpu_startup_entry+0x73/0x80 [ 435.617125] start_secondary+0x1a6/0x200 [ 435.617127] secondary_startup_64+0xa5/0xb0 [ 435.617128] Code: ... To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1810998/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp