The work-around identified in comment #9 can be used to bypass this. It
delays further services from starting up an attempting to interact with
the mlnx cards which appears to cause kernel hung tasks due to the
kernel hung task timeout of 120 seconds. I'm not convinced at this
moment in time that managing the systemd service files from the charm is
the correct thing to do here. Notably, this would likely be a general
problem on Ubuntu with VFs etc. It may end up being that increasing the
timeout is a longer term solution rather than a work around, however we
need to understand the problem better in order to address the problem in
the right space.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2009594

Title:
  Mlx5 kworker blocked Kernel 5.19 (Jammy HWE)

Status in charm-ovn-chassis:
  Triaged
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  This is seen on particular with :
  * Charmed Openstack with Jammy Yoga
  * 5.19.0-35-generic (linux-generic-hwe-22.04/jammy-updates)
  * Mellanox Connectx-6 card with mlx5_core module being used 
  * SR-IOV is being used with VF-LAG for the use of OVN hardware offloading

  The servers enter into very high load (around 75~100) quickly during the boot 
with all process relying on network communication with the Mellanox network 
card being stuck or extremely slow.
  Kernel logs are being displayed about kworkers being blocked for more than 
120 seconds

  The number of SR-IOV devices configured both from the firmware and the kernel 
seems to have a serious correlation with the likeliness of this bug to occur.
  Having enabled more VF seems to hugely increase the risk for this bug to 
arise.

  This does not happen systematically at every boot, but with 32 VFs on each 
PF, it occurs about 40% of the time.
  To recover the server, a cold reboot is required.

  Look at a quick sample of the trace, this seems to involve directly
  the mlx5 driver within the kernel :

  Mar 07 05:24:56 nova-1 kernel: INFO: task kworker/0:1:19 blocked for more 
than 120 seconds.
  Mar 07 05:24:56 nova-1 kernel:       Tainted: P           OE     
5.19.0-35-generic #36~22.04.1-Ubuntu
  Mar 07 05:24:56 nova-1 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  Mar 07 05:24:56 nova-1 kernel: task:kworker/0:1     state:D stack:    0 pid:  
 19 ppid:     2 flags:0x00004000
  Mar 07 05:24:56 nova-1 kernel: Workqueue: events work_for_cpu_fn
  Mar 07 05:24:56 nova-1 kernel: Call Trace:
  Mar 07 05:24:56 nova-1 kernel:  <TASK>
  Mar 07 05:24:56 nova-1 kernel:  __schedule+0x257/0x5d0
  Mar 07 05:24:56 nova-1 kernel:  schedule+0x68/0x110
  Mar 07 05:24:56 nova-1 kernel:  schedule_preempt_disabled+0x15/0x30
  Mar 07 05:24:56 nova-1 kernel:  __mutex_lock.constprop.0+0x4f1/0x750
  Mar 07 05:24:56 nova-1 kernel:  __mutex_lock_slowpath+0x13/0x20
  Mar 07 05:24:56 nova-1 kernel:  mutex_lock+0x3e/0x50
  Mar 07 05:24:56 nova-1 kernel:  mlx5_register_device+0x1c/0xb0 [mlx5_core]
  Mar 07 05:24:56 nova-1 kernel:  mlx5_init_one+0xe4/0x110 [mlx5_core]
  Mar 07 05:24:56 nova-1 kernel:  probe_one+0xcb/0x120 [mlx5_core]
  Mar 07 05:24:56 nova-1 kernel:  local_pci_probe+0x4b/0x90
  Mar 07 05:24:56 nova-1 kernel:  work_for_cpu_fn+0x1a/0x30
  Mar 07 05:24:56 nova-1 kernel:  process_one_work+0x21f/0x400
  Mar 07 05:24:56 nova-1 kernel:  worker_thread+0x200/0x3f0
  Mar 07 05:24:56 nova-1 kernel:  ? rescuer_thread+0x3a0/0x3a0
  Mar 07 05:24:56 nova-1 kernel:  kthread+0xee/0x120
  Mar 07 05:24:56 nova-1 kernel:  ? kthread_complete_and_exit+0x20/0x20
  Mar 07 05:24:56 nova-1 kernel:  ret_from_fork+0x22/0x30
  Mar 07 05:24:56 nova-1 kernel:  </TASK>

To manage notifications about this bug go to:
https://bugs.launchpad.net/charm-ovn-chassis/+bug/2009594/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to