On 7/8/20 5:44 PM, Parav Pandit wrote: ... snip .. >> > > It is likely because events_cleanup() freed the memory using kvfree() that > health recovery context is trying to access in notifier chain. > > While reviewing I see few more errors as below. > (a) mlx5_pci_err_detected() invokes mlx5_drain_health_wq() outside of > intf_state_mutex. > (b) mlx5_enter_error_state() in commit b6e0b6bebe0 read and updates dev state > outside of intf_state_mutex. > (c) due to drain_health_wq() introduction in mlx5_pci_close() in commit > 42ea9f1b5c625 health_wq is flushed twice. > (d) priv->events freed is kvfree() but allocated using kzalloc(). > > This code certainly needs rework. > > In meantime to avoid this regression, I believe below hunk eliminates error > introduced in patch 41798df9bfc. > Will you please help test it? > > Shay and I did the review of below patch. > If it works I will get it through Saeed's tree and internal reviews. > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c > b/drivers/net/ethernet/mellanox/mlx5/core/main.c > index ebec2318dbc4..529df5703737 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c > @@ -785,11 +785,6 @@ static int mlx5_pci_init(struct mlx5_core_dev *dev, > struct pci_dev *pdev, > > static void mlx5_pci_close(struct mlx5_core_dev *dev) > { > - /* health work might still be active, and it needs pci bar in > - * order to know the NIC state. Therefore, drain the health WQ > - * before removing the pci bars > - */ > - mlx5_drain_health_wq(dev); > iounmap(dev->iseg); > pci_clear_master(dev->pdev); > release_bar(dev->pdev); > @@ -1235,6 +1230,7 @@ void mlx5_unload_one(struct mlx5_core_dev *dev, bool > cleanup) > if (cleanup) { > mlx5_unregister_device(dev); > mlx5_devlink_unregister(priv_to_devlink(dev)); > + mlx5_drain_health_wq(dev); I think with the above you can remove the mlx5_drain_health_wq(dev) in remove_one() as that calls mlx5_unload_one() with cleanup == true. Also I wonder if it is a problem that the order of mlx5_devlink_unregister(priv_to_devlink(dev)) and mlx5_unregister_device(dev) is switched compared to the 5.7 code. That said changing both seems to result in a deadlock though not the "leak of a command resource":
[ 48.082222] mlx5_core 0000:00:00.0: poll_health:702:(pid 0): Fatal error 1 detected [ 48.082296] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff [ 48.082303] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[1] 0xffffffff [ 48.082309] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[2] 0xffffffff [ 48.082316] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[3] 0xffffffff [ 48.082322] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[4] 0xffffffff [ 48.082329] mlx5_core 0000:00:00.0: print_health_info:383:(pid 0): assert_exit_ptr 0xffffffff [ 48.082336] mlx5_core 0000:00:00.0: print_health_info:385:(pid 0): assert_callra 0xffffffff [ 48.082345] mlx5_core 0000:00:00.0: print_health_info:388:(pid 0): fw_ver 65535.65535.65535 [ 48.082353] mlx5_core 0000:00:00.0: print_health_info:389:(pid 0): hw_id 0xffffffff [ 48.082360] mlx5_core 0000:00:00.0: print_health_info:390:(pid 0): irisc_index 255 [ 48.082374] mlx5_core 0000:00:00.0: print_health_info:391:(pid 0): synd 0xff: unrecognized error [ 48.082390] mlx5_core 0000:00:00.0: print_health_info:393:(pid 0): ext_synd 0xffff [ 48.082396] mlx5_core 0000:00:00.0: print_health_info:395:(pid 0): raw fw_ver 0xffffffff [ 48.082824] crw_info : CRW reports slct=0, oflw=0, chn=0, rsc=B, anc=0, erc=0, rsid=0 [ 48.082852] mlx5_core 0000:00:00.0: PME# disabled [ 244.103844] INFO: task kworker/u128:1:63 blocked for more than 122 seconds. [ 244.104076] Not tainted 5.8.0-rc4-dirty #2 [ 244.104081] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 244.104086] kworker/u128:1 D13264 63 2 0x00000000 [ 244.104158] Workqueue: mlx5_health0000:00:00.0 mlx5_fw_fatal_reporter_err_work [mlx5_core] [ 244.104164] Call Trace: [ 244.104174] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8 [ 244.104180] [<0000000031cf1f72>] schedule+0x5a/0x130 [ 244.104186] [<0000000031cf2704>] schedule_preempt_disabled+0x2c/0x48 [ 244.104192] [<0000000031cf49c2>] __mutex_lock+0x372/0x960 [ 244.104198] [<0000000031cf4fe2>] mutex_lock_nested+0x32/0x40 [ 244.104219] [<000003ff801e4978>] mlx5_enter_error_state+0xa0/0x100 [mlx5_core] [ 244.104238] [<000003ff801e4a0c>] mlx5_fw_fatal_reporter_err_work+0x34/0xb8 [mlx5_core] [ 244.104249] [<000000003117f84c>] process_one_work+0x27c/0x478 [ 244.104255] [<000000003117faae>] worker_thread+0x66/0x368 [ 244.104275] [<0000000031188a96>] kthread+0x176/0x1a0 [ 244.104280] [<0000000031cfa820>] ret_from_fork+0x24/0x2c [ 244.104285] 3 locks held by kworker/u128:1/63: [ 244.104289] #0: 00000000d0392948 ((wq_completion)mlx5_health0000:00:00.0){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478 [ 244.104298] #1: 000003e00295fe18 ((work_completion)(&health->fatal_report_work)){+.+.}-{0:0}, at: process_one_work+0x1dc/0x478 [ 244.104305] #2: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_enter_error_state+0xa0/0x100 [mlx5_core] [ 244.104328] INFO: task kmcheck:73 blocked for more than 122 seconds. [ 244.104333] Not tainted 5.8.0-rc4-dirty #2 [ 244.104337] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 244.104341] kmcheck D10888 73 2 0x00000000 [ 244.104350] Call Trace: [ 244.104355] [<0000000031cf1c46>] __schedule+0x2d6/0x5a8 [ 244.104360] [<0000000031cf1f72>] schedule+0x5a/0x130 [ 244.104365] [<0000000031cf8182>] schedule_timeout+0xfa/0x138 [ 244.104370] [<0000000031cf37e8>] wait_for_completion+0xc0/0x110 [ 244.104374] [<0000000031180798>] __flush_work+0xd8/0x120 [ 244.104380] [<0000000031180f88>] __cancel_work_timer+0x150/0x208 [ 244.104398] [<000003ff801d9c3a>] mlx5_unload_one+0x52/0x148 [mlx5_core] [ 244.104416] [<000003ff801d9dc2>] remove_one+0x52/0xd8 [mlx5_core] [ 244.104422] [<0000000031982500>] pci_device_remove+0x40/0x90 [ 244.104429] [<00000000319f0012>] __device_release_driver+0x17a/0x220 [ 244.104433] [<00000000319f00f8>] device_release_driver+0x40/0x50 [ 244.104439] [<0000000031977dcc>] pci_stop_bus_device+0x94/0xc0 [ 244.104444] [<00000000319780c0>] pci_stop_and_remove_bus_device_locked+0x30/0x48 [ 244.104449] [<000000003114efd6>] __zpci_event_availability+0x12e/0x318 [ 244.104456] [<0000000031c566fc>] chsc_process_event_information.constprop.0+0x214/0x228 [ 244.104461] [<0000000031c5daa8>] crw_collect_info+0x280/0x350 [ 244.104466] [<0000000031188a96>] kthread+0x176/0x1a0 [ 244.104470] [<0000000031cfa820>] ret_from_fork+0x24/0x2c [ 244.104475] 4 locks held by kmcheck/73: [ 244.104479] #0: 0000000032205e48 (crw_handler_mutex){+.+.}-{3:3}, at: crw_collect_info+0x250/0x350 [ 244.104486] #1: 00000000321e5668 (pci_rescan_remove_lock){+.+.}-{3:3}, at: pci_stop_and_remove_bus_device_locked+0x26/0x48 [ 244.104493] #2: 00000000d7fd1238 (&dev->mutex){....}-{3:3}, at: device_release_driver+0x32/0x50 [ 244.104553] #3: 00000000d1b2cbb8 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_unload_one+0x38/0x148 [mlx5_core] > } else { > mlx5_detach_device(dev); > } > @@ -1366,6 +1362,11 @@ static int init_one(struct pci_dev *pdev, const struct > pci_device_id *id) > return 0; > > err_load_one: > + /* health work might still be active, and it needs pci bar in > + * order to know the NIC state. Therefore, drain the health WQ > + * before removing the pci bars > + */ > + mlx5_drain_health_wq(dev); > mlx5_pci_close(dev); > pci_init_err: > mlx5_mdev_uninit(dev); > As is the patch above fixes the dereference but results in the same completion error as current 5.8-rc4 [ 103.257202] mlx5_core 0000:00:00.0: poll_health:702:(pid 930): Fatal error 1 detected [ 103.257405] mlx5_core 0000:00:00.0: print_health_info:380:(pid 930): assert_var[0] 0xffffffff ... [ 114.268505] mlx5_core 0000:00:00.0: poll_health:717:(pid 0): device's health compromised - reached miss count [ 114.268524] mlx5_core 0000:00:00.0: print_health_info:380:(pid 0): assert_var[0] 0xffffffff ... [ 167.308576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): 2RST_QP(0x50a) timeout. Will cause a leak of a command resource [ 167.308799] infiniband mlx5_0: destroy_qp_common:2367:(pid 73): mlx5_ib: modify QP 0x000724 to RESET failed [ 198.028576] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_VPORT_COUNTER(0x770) timeout. Will cause a leak of a command resource [ 228.748700] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource [ 259.468867] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource [ 290.189284] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource [ 320.909972] mlx5_core 0000:00:00.0: wait_func:1008:(pid 407): QUERY_Q_COUNTER(0x773) timeout. Will cause a leak of a command resource [ 351.628945] mlx5_core 0000:00:00.0: wait_func:1008:(pid 73): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource > > > ... snip ...