[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Simon Fels Tue, 30 Jan 2024 12:41:12 -0800

I gave this another spin today with 6.5.0-17-generic #17~22.04.1 and the
LRM modules of the 535 driver (6.5.0-17.17~22.04.1+1 of linux-modules-
nvidia-535-server-generic-hwe-22.04) on our Altra system with 2x L4 GPUs
and the same problem exists as with the DKMS modules:


[   39.437849] watchdog: BUG: soft lockup - CPU#62 stuck for 26s! 
[systemd-udevd:850]
[   39.445411] Modules linked in: nvidia(POE+) crct10dif_ce polyval_ce 
polyval_generic ghash_ce ast mlx5_core video drm_shmem_helper sm4 mlxfw sha2_ce 
drm_kms_helper nvme psample sha256_arm64 sha1_ce nvme_core igb drm tls xhci_pci 
nvme_common pci_hyperv_intf xhci_pci_renesas i2c_algo_bit aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
[   39.474949] CPU: 62 PID: 850 Comm: systemd-udevd Tainted: P           OE     
 6.5.0-17-generic #17~22.04.1-Ubuntu
[   39.485196] Hardware name: GIGABYTE G242-P30-JG/MP32-AR0-JG, BIOS F07 
03/22/2021
[   39.492578] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   39.499526] pc : smp_call_function_many_cond+0x19c/0x720
[   39.504830] lr : smp_call_function_many_cond+0x1b8/0x720
[   39.510130] sp : ffff80008934b920
[   39.513431] x29: ffff80008934b920 x28: ffffaef99146dd10 x27: 0000000000000000
[   39.520554] x26: 000000000000004f x25: ffff085dcfffbb80 x24: 0000000000000026
[   39.527677] x23: 0000000000000001 x22: ffff085dcfdd6708 x21: ffffaef9914726e0
[   39.534799] x20: ffff085dcfadbb80 x19: ffff085dcfdd6700 x18: ffff800089341060
[   39.541921] x17: 0000000000000000 x16: 0000000000000000 x15: 43535f5f00656c75
[   39.549044] x14: 0c030b111b111303 x13: 0000000000000006 x12: 3931413337353339
[   39.556166] x11: 0101010101010101 x10: 000000000000004f x9 : ffffaef98ee015b8
[   39.563289] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 000000000000003e
[   39.570411] x5 : ffffaef99146d000 x4 : 0000000000000000 x3 : ffff085dcfadbb88
[   39.577533] x2 : 0000000000000026 x1 : 0000000000000011 x0 : 0000000000000000
[   39.584656] Call trace:
[   39.587090]  smp_call_function_many_cond+0x19c/0x720
[   39.592043]  kick_all_cpus_sync+0x50/0xa8
[   39.596040]  flush_module_icache+0x94/0xf8
[   39.600125]  load_module+0x448/0x8e0
[   39.603688]  init_module_from_file+0x94/0x110
[   39.608033]  idempotent_init_module+0x194/0x2b0
[   39.612551]  __arm64_sys_finit_module+0x74/0x100
[   39.617155]  invoke_syscall+0x7c/0x130
[   39.620892]  el0_svc_common.constprop.0+0x5c/0x170
[   39.625670]  do_el0_svc+0x38/0x68
[   39.628972]  el0_svc+0x30/0xe0
[   39.632016]  el0t_64_sync_handler+0x128/0x158
[   39.636360]  el0t_64_sync+0x1b0/0x1b8

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-hwe-6.5 in Ubuntu.
https://bugs.launchpad.net/bugs/2029934

Title:
  arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Status in linux-aws package in Ubuntu:
  Incomplete
Status in linux-hwe-6.5 package in Ubuntu:
  New
Status in nvidia-graphics-drivers-525 package in Ubuntu:
  Incomplete
Status in nvidia-graphics-drivers-525-server package in Ubuntu:
  Incomplete
Status in nvidia-graphics-drivers-535 package in Ubuntu:
  Confirmed
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
  Confirmed

Bug description:
  Loading the nvidia driver dkms modules with "modprove nvidia" will
  result in the host hanging and being completely unusable. This was
  reproduced using both the linux generic and linux-aws kernels on lunar
  and mantic using an AWS g5g.xlarge instance.

  To reproduce using the generic kernel:
  # Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.

  # Install the linux generic kernel from lunar-updates:
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o 
DPkg::Options::=--force-confold linux-generic

  # Boot to the linux-generic kernel (this can be accomplished by removing the 
existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o 
DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008 
linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws 
linux-image-aws linux-modules-6.2.0-1008-aws  linux-headers-6.2.0-1008-aws 
linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws
  $ reboot

  # Install the Nvidia 535-server driver DKMS package:
  $ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y 
nvidia-driver-535-server

  # Enable the driver
  $ sudo modprobe nvidia

  # At this point the system will hang and never return.
  # A reboot instead of a modprobe will result in a system that never boots up 
all the way. I was able to recover the console logs from such a system and 
found (the full captured log is attached):

  [    1.964942] nvidia: loading out-of-tree module taints kernel.
  [    1.965475] nvidia: module license 'NVIDIA' taints kernel.
  [    1.965905] Disabling lock debugging due to kernel taint
  [    1.980905] nvidia: module verification failed: signature and/or required 
key missing - tainting kernel
  [    2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device 
number 510
  [    2.012715] 
  [   62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [   62.025807] rcu:   3-...0: (14 ticks this GP) 
idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=3301
  [   62.026516]        (detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
  [   62.027018] Task dump for CPU 3:
  [   62.027290] task:systemd-udevd   state:R  running task     stack:0     
pid:164   ppid:144    flags:0x0000000e
  [   62.028066] Call trace:
  [   62.028273]  __switch_to+0xbc/0x100
  [   62.028567]  0x228
  Timed out for waiting the udev queue being empty.
  Timed out for waiting the udev queue being empty.
  [  242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [  242.045655] rcu:   3-...0: (14 ticks this GP) 
idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=12303
  [  242.046373]        (detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
  [  242.046874] Task dump for CPU 3:
  [  242.047146] task:systemd-udevd   state:R  running task     stack:0     
pid:164   ppid:144    flags:0x0000000f
  [  242.047922] Call trace:
  [  242.048128]  __switch_to+0xbc/0x100
  [  242.048417]  0x228
  Timed out for waiting the udev queue being empty.
  Begin: Loading essential drivers ... [  384.001142] watchdog: BUG: soft 
lockup - CPU#2 stuck for 22s! [modprobe:215]
  [  384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce 
polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce 
sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs 
aes_neon_blk aes_ce_blk aes_ce_cipher
  [  384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P           OE      
6.2.0-26-generic #26-Ubuntu
  [  384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
  [  384.004715] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
  [  384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4
  [  384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4
  [  384.006108] sp : ffff8000089a3a70
  [  384.006381] x29: ffff8000089a3a70 x28: 0000000000000003 x27: 
ffff00056d1fafa0
  [  384.006954] x26: ffff00056d1d76c8 x25: ffffc87cf18bdd10 x24: 
0000000000000003
  [  384.007527] x23: 0000000000000001 x22: ffff00056d1d76c8 x21: 
ffffc87cf18c2690
  [  384.008086] x20: ffff00056d1fafa0 x19: ffff00056d1d76c0 x18: 
ffff80000896d058
  [  384.008645] x17: 0000000000000000 x16: 0000000000000000 x15: 
617362755f5f0073
  [  384.009209] x14: 0000000000000001 x13: 0000000000000006 x12: 
4630354535323145
  [  384.009779] x11: 0101010101010101 x10: ffffb78318e9c0e0 x9 : 
ffffc87ceeac7da4
  [  384.010339] x8 : ffff00056d1d76f0 x7 : 0000000000000000 x6 : 
0000000000000000
  [  384.010894] x5 : 0000000000000004 x4 : 0000000000000000 x3 : 
ffff00056d1fafa8
  [  384.011464] x2 : 0000000000000003 x1 : 0000000000000011 x0 : 
0000000000000000
  [  384.012030] Call trace:
  [  384.012241]  smp_call_function_many_cond+0x1b4/0x4b4
  [  384.012635]  kick_all_cpus_sync+0x50/0xa0
  [  384.012961]  flush_module_icache+0x64/0xd0
  [  384.013294]  load_module+0x4ec/0xb54
  [  384.013588]  __do_sys_finit_module+0xb0/0x150
  [  384.013944]  __arm64_sys_finit_module+0x2c/0x50
  [  384.014306]  invoke_syscall+0x7c/0x124
  [  384.014613]  el0_svc_common.constprop.0+0x5c/0x1cc
  [  384.015000]  do_el0_svc+0x38/0x60
  [  384.015280]  el0_svc+0x30/0xe0
  [  384.015540]  el0t_64_sync_handler+0x11c/0x150
  [  384.015896]  el0t_64_sync+0x1a8/0x1ac

  This same procedure impacts the 525, 525-server, 535 and 535-server
  drivers. It does *not* hang a similarly configured host running focal
  or jammy.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2029934/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 2029934] Re: arm64 AWS host hangs during modprobe nvidia on lunar and mantic

Reply via email to