** Changed in: fabric-manager-535 (Ubuntu) Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)
** Changed in: linux (Ubuntu) Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin) ** Changed in: nvidia-graphics-drivers-535-server (Ubuntu) Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin) ** Changed in: fabric-manager-535 (Ubuntu) Status: New => Fix Released ** Changed in: linux (Ubuntu) Status: New => Fix Released ** Changed in: nvidia-graphics-drivers-535-server (Ubuntu) Status: New => Fix Released -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2052663 Title: fabric-manager-535 setup fails during install on Grace/Hopper arm64 system running noble Status in fabric-manager-535 package in Ubuntu: Fix Released Status in linux package in Ubuntu: Fix Released Status in nvidia-graphics-drivers-535-server package in Ubuntu: Fix Released Bug description: This error occurs on both the standard and largemem variants of the latest Noble server build of Ubuntu: Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k aarch64) (iso link: https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso) Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) (iso link: https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso) CPU/GPU: Nvidia Grace/Hopper lsb_release -rd: No LSB modules are available. Description: Ubuntu Noble Numbat (development branch) Release: 24.04 Kernel versions affected: GNU/Linux 6.6.0-14-generic-64k aarch64 GNU/Linux 6.6.0-14-generic aarch64 Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64) Expected behavior: Package starts as expected during post-install setup steps Actual behavior: On our grace/hopper system running noble, when installing nvidia-fabricmanager-535, the installation froze at 60% twice, along with all ssh processes. I am also unable to ssh back into the system after this happens. This is the last output I see from my installer shell: + apt install -y nvidia-fabricmanager-535 Reading package lists... Done Building dependency tree... Done Reading state information... Done The following NEW packages will be installed: nvidia-fabricmanager-535 0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded. Need to get 1795 kB of archives. After this operation, 8679 kB of additional disk space will be used. Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64 nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB] Fetched 1795 kB in 1s (2439 kB/s) Selecting previously unselected package nvidia-fabricmanager-535. (Reading database ... 103745 files and directories currently installed.) Preparing to unpack .../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ... Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ... Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ... Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → /lib/systemd/system/nvidia-fabricmanager.service. Progress: [ 60%] [#################################################################################.......................................................] This does not appear to cause a panic/reboot, as I can still interact with the console, and it even appears that the apt process is still running in ps aux (although it doesn't seem to progress). However, I observe the following output in the console that I believe may be related: [ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! [(udev-worker):33269] [ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! [(udev-worker):33269] [ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! [(udev-worker):33269] [ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! [(udev-worker):33269] [ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks: { 17-...D } 240893 ji ffies s: 653 root: 0x2/. [ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): l=1:15-29:0x4/. [ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! [(udev-worker):33269] [ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! [(udev-worker):33269] [ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 1630.568973] rcu: 17-...0: (1 GPs behind) idle=2444/1/0x4000000000000000 softirq=13696/13700 f qs=126842 [ 1630.578665] rcu: hardirqs softirqs csw/system [ 1630.584381] rcu: number: 0 0 0 [ 1630.590109] rcu: cputime: 0 0 0 ==> 1110384(ms) [ 1630.597458] rcu: (detected by 20, t=285099 jiffies, g=74061, q=113266 ncpus=72) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp