Status changed to 'Confirmed' because the bug affects multiple users.
** Changed in: linux-aws (Ubuntu)
Status: New => Confirmed
--
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-aws in Ubuntu.
https://bugs.launchpad.net/bugs/2029934
Title:
arm64 AWS host hangs during modprobe nvidia on lunar and mantic
Status in linux-aws package in Ubuntu:
Confirmed
Status in nvidia-graphics-drivers-525 package in Ubuntu:
Confirmed
Status in nvidia-graphics-drivers-525-server package in Ubuntu:
Confirmed
Status in nvidia-graphics-drivers-535 package in Ubuntu:
Confirmed
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
Confirmed
Bug description:
Loading the nvidia driver dkms modules with "modprove nvidia" will
result in the host hanging and being completely unusable. This was
reproduced using both the linux generic and linux-aws kernels on lunar
and mantic using an AWS g5g.xlarge instance.
To reproduce using the generic kernel:
# Deploy a arm64 host with an nvidia gpu, such as an AWS g5g.xlarge.
# Install the linux generic kernel from lunar-updates:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y -o
DPkg::Options::=--force-confold linux-generic
# Boot to the linux-generic kernel (this can be accomplished by removing the
existing kernel, in this case it was the linux-aws 6.2.0-1008-aws kernel)
$ sudo DEBIAN_FRONTEND=noninteractive apt-get purge -y -o
DPkg::Options::=--force-confold linux-aws linux-aws-headers-6.2.0-1008
linux-headers-6.2.0-1008-aws linux-headers-aws linux-image-6.2.0-1008-aws
linux-image-aws linux-modules-6.2.0-1008-aws linux-headers-6.2.0-1008-aws
linux-image-6.2.0-1008-aws linux-modules-6.2.0-1008-aws
$ reboot
# Install the Nvidia 535-server driver DKMS package:
$ sudo DEBIAN_FRONTEND=noninteractive apt-get install -y
nvidia-driver-535-server
# Enable the driver
$ sudo modprobe nvidia
# At this point the system will hang and never return.
# A reboot instead of a modprobe will result in a system that never boots up
all the way. I was able to recover the console logs from such a system and
found (the full captured log is attached):
[ 1.964942] nvidia: loading out-of-tree module taints kernel.
[ 1.965475] nvidia: module license 'NVIDIA' taints kernel.
[ 1.965905] Disabling lock debugging due to kernel taint
[ 1.980905] nvidia: module verification failed: signature and/or required
key missing - tainting kernel
[ 2.012067] nvidia-nvlink: Nvlink Core is being initialized, major device
number 510
[ 2.012715]
[ 62.025143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 62.025807] rcu: 3-...0: (14 ticks this GP)
idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=3301
[ 62.026516] (detected by 0, t=15003 jiffies, g=-699, q=216 ncpus=4)
[ 62.027018] Task dump for CPU 3:
[ 62.027290] task:systemd-udevd state:R running task stack:0
pid:164 ppid:144 flags:0x0000000e
[ 62.028066] Call trace:
[ 62.028273] __switch_to+0xbc/0x100
[ 62.028567] 0x228
Timed out for waiting the udev queue being empty.
Timed out for waiting the udev queue being empty.
[ 242.045143] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 242.045655] rcu: 3-...0: (14 ticks this GP)
idle=c04c/1/0x4000000000000000 softirq=653/654 fqs=12303
[ 242.046373] (detected by 1, t=60008 jiffies, g=-699, q=937 ncpus=4)
[ 242.046874] Task dump for CPU 3:
[ 242.047146] task:systemd-udevd state:R running task stack:0
pid:164 ppid:144 flags:0x0000000f
[ 242.047922] Call trace:
[ 242.048128] __switch_to+0xbc/0x100
[ 242.048417] 0x228
Timed out for waiting the udev queue being empty.
Begin: Loading essential drivers ... [ 384.001142] watchdog: BUG: soft
lockup - CPU#2 stuck for 22s! [modprobe:215]
[ 384.001738] Modules linked in: nvidia(POE+) crct10dif_ce video polyval_ce
polyval_generic drm_kms_helper ghash_ce syscopyarea sm4 sysfillrect sha2_ce
sysimgblt sha256_arm64 sha1_ce drm nvme nvme_core ena nvme_common aes_neon_bs
aes_neon_blk aes_ce_blk aes_ce_cipher
[ 384.003513] CPU: 2 PID: 215 Comm: modprobe Tainted: P OE
6.2.0-26-generic #26-Ubuntu
[ 384.004210] Hardware name: Amazon EC2 g5g.xlarge/, BIOS 1.0 11/1/2018
[ 384.004715] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 384.005259] pc : smp_call_function_many_cond+0x1b4/0x4b4
[ 384.005683] lr : smp_call_function_many_cond+0x1d0/0x4b4
[ 384.006108] sp : ffff8000089a3a70
[ 384.006381] x29: ffff8000089a3a70 x28: 0000000000000003 x27:
ffff00056d1fafa0
[ 384.006954] x26: ffff00056d1d76c8 x25: ffffc87cf18bdd10 x24:
0000000000000003
[ 384.007527] x23: 0000000000000001 x22: ffff00056d1d76c8 x21:
ffffc87cf18c2690
[ 384.008086] x20: ffff00056d1fafa0 x19: ffff00056d1d76c0 x18:
ffff80000896d058
[ 384.008645] x17: 0000000000000000 x16: 0000000000000000 x15:
617362755f5f0073
[ 384.009209] x14: 0000000000000001 x13: 0000000000000006 x12:
4630354535323145
[ 384.009779] x11: 0101010101010101 x10: ffffb78318e9c0e0 x9 :
ffffc87ceeac7da4
[ 384.010339] x8 : ffff00056d1d76f0 x7 : 0000000000000000 x6 :
0000000000000000
[ 384.010894] x5 : 0000000000000004 x4 : 0000000000000000 x3 :
ffff00056d1fafa8
[ 384.011464] x2 : 0000000000000003 x1 : 0000000000000011 x0 :
0000000000000000
[ 384.012030] Call trace:
[ 384.012241] smp_call_function_many_cond+0x1b4/0x4b4
[ 384.012635] kick_all_cpus_sync+0x50/0xa0
[ 384.012961] flush_module_icache+0x64/0xd0
[ 384.013294] load_module+0x4ec/0xb54
[ 384.013588] __do_sys_finit_module+0xb0/0x150
[ 384.013944] __arm64_sys_finit_module+0x2c/0x50
[ 384.014306] invoke_syscall+0x7c/0x124
[ 384.014613] el0_svc_common.constprop.0+0x5c/0x1cc
[ 384.015000] do_el0_svc+0x38/0x60
[ 384.015280] el0_svc+0x30/0xe0
[ 384.015540] el0t_64_sync_handler+0x11c/0x150
[ 384.015896] el0t_64_sync+0x1a8/0x1ac
This same procedure impacts the 525, 525-server, 535 and 535-server
drivers. It does *not* hang a similarly configured host running focal
or jammy.
To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-aws/+bug/2029934/+subscriptions
--
Mailing list: https://launchpad.net/~kernel-packages
Post to : [email protected]
Unsubscribe : https://launchpad.net/~kernel-packages
More help : https://help.launchpad.net/ListHelp