** Changed in: fabric-manager-535 (Ubuntu)
     Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
     Assignee: (unassigned) => Mitchell Augustin (mitchellaugustin)

** Changed in: fabric-manager-535 (Ubuntu)
       Status: New => Fix Released

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Changed in: nvidia-graphics-drivers-535-server (Ubuntu)
       Status: New => Fix Released

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2052663

Title:
  fabric-manager-535 setup fails during install on Grace/Hopper arm64
  system running noble

Status in fabric-manager-535 package in Ubuntu:
  Fix Released
Status in linux package in Ubuntu:
  Fix Released
Status in nvidia-graphics-drivers-535-server package in Ubuntu:
  Fix Released

Bug description:
  This error occurs on both the standard and largemem variants of the latest 
Noble server build of Ubuntu:
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic-64k 
aarch64) (iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207.1/noble-live-server-arm64+largemem.iso)
  Ubuntu Noble Numbat (development branch) (GNU/Linux 6.6.0-14-generic aarch64) 
(iso link: 
https://cdimage.ubuntu.com/ubuntu-server/daily-live/20240207/noble-live-server-arm64.iso)
  CPU/GPU: Nvidia Grace/Hopper

  lsb_release -rd: 
  No LSB modules are available.
  Description:  Ubuntu Noble Numbat (development branch)
  Release:      24.04

  Kernel versions affected:
  GNU/Linux 6.6.0-14-generic-64k aarch64
  GNU/Linux 6.6.0-14-generic aarch64

  Package version: nvidia-fabricmanager-535 (535.154.05-0ubuntu1 arm64)

  Expected behavior: Package starts as expected during post-install
  setup steps

  Actual behavior:
  On our grace/hopper system running noble, when installing 
nvidia-fabricmanager-535, the installation froze at 60% twice, along with all 
ssh processes. I am also unable to ssh back into the system after this happens.

  This is the last output I see from my installer shell:
  + apt install -y nvidia-fabricmanager-535
  Reading package lists... Done
  Building dependency tree... Done
  Reading state information... Done
  The following NEW packages will be installed:
    nvidia-fabricmanager-535
  0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
  Need to get 1795 kB of archives.
  After this operation, 8679 kB of additional disk space will be used.
  Get:1 http://ports.ubuntu.com/ubuntu-ports noble/multiverse arm64 
nvidia-fabricmanager-535 arm64 535.154.05-0ubuntu1 [1795 kB]
  Fetched 1795 kB in 1s (2439 kB/s)                
  Selecting previously unselected package nvidia-fabricmanager-535.
  (Reading database ... 103745 files and directories currently installed.)
  Preparing to unpack 
.../nvidia-fabricmanager-535_535.154.05-0ubuntu1_arm64.deb ...
  Unpacking nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Setting up nvidia-fabricmanager-535 (535.154.05-0ubuntu1) ...
  Created symlink 
/etc/systemd/system/multi-user.target.wants/nvidia-fabricmanager.service → 
/lib/systemd/system/nvidia-fabricmanager.service.

  Progress: [ 60%]
  
[#################################################################################.......................................................]

  
  This does not appear to cause a panic/reboot, as I can still interact with 
the console, and it even appears that the apt process is still running in ps 
aux (although it doesn't seem to progress). However, I observe the following 
output in the console that I believe may be related:
  [ 1453.814597] watchdog: BUG: soft lockup - CPU#16 stuck for 670s! 
[(udev-worker):33269]
  [ 1477.814602] watchdog: BUG: soft lockup - CPU#16 stuck for 693s! 
[(udev-worker):33269]
  [ 1501.814606] watchdog: BUG: soft lockup - CPU#16 stuck for 715s! 
[(udev-worker):33269]
  [ 1525.814611] watchdog: BUG: soft lockup - CPU#16 stuck for 738s! 
[(udev-worker):33269]
  [ 1579.666718] rcu: INFO: rcu_preempt detected expedited stalls on 
CPUs/tasks: { 17-...D } 240893 ji
  ffies s: 653 root: 0x2/.
  [ 1579.678114] rcu: blocking rcu_node structures (internal RCU debug): 
l=1:15-29:0x4/.
  [ 1597.814625] watchdog: BUG: soft lockup - CPU#16 stuck for 805s! 
[(udev-worker):33269]
  [ 1621.814630] watchdog: BUG: soft lockup - CPU#16 stuck for 827s! 
[(udev-worker):33269]
  [ 1630.562655] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  [ 1630.568973] rcu:   17-...0: (1 GPs behind) idle=2444/1/0x4000000000000000 
softirq=13696/13700 f
  qs=126842
  [ 1630.578665] rcu:            hardirqs   softirqs   csw/system
  [ 1630.584381] rcu:    number:        0          0            0
  [ 1630.590109] rcu:   cputime:        0          0            0   ==> 
1110384(ms)
  [ 1630.597458] rcu:   (detected by 20, t=285099 jiffies, g=74061, q=113266 
ncpus=72)

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/fabric-manager-535/+bug/2052663/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to