You have been subscribed to a public bug:
An AMD Milan Delta system with HGX A100 8-GPUs is having issues
detecting all 8 GPUs due to problem in enabling the fabric manager on
both Ubuntu 18.04 and 20.04. But with other Linux variants -such as
CentOS and RHEL, there’s no problem in detecting all 8-GPUs.
>From clean Ubuntu 18.04 install
A100 Delta board
Output from systemctl status nvidia-fabricmanager process terminated due to
NVSwitch driver failure
------------
Feb 06 04:44:11 milan-delta systemd[1]: Starting NVIDIA fabric manager
service...
Feb 06 04:44:12 milan-delta nv-fabricmanager[64822]: request to query NVSwitch
device information from NVSw>
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Control
process exited, code=exited, >
Feb 06 04:44:12 milan-delta systemd[1]: nvidia-fabricmanager.service: Failed
with result 'exit-code'.
Feb 06 04:44:12 milan-delta systemd[1]: Failed to start NVIDIA fabric manager
service.
------------
Syslog output
-----------
Feb 6 04:44:14 milan-delta kernel: [ 1185.231538] NVRM: GPU 0000:85:00.0:
RmInitAdapter failed! (0x23:0xffff:624)
Feb 6 04:44:14 milan-delta kernel: [ 1185.231895] NVRM: GPU 0000:85:00.0:
rm_init_adapter failed, device minor number 2
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:85:00.0 - failed to
open.
Feb 6 04:44:14 milan-delta nvidia-persistenced: device 0000:8b:00.0 - registered
-----------
The dmesg
-----------
[ 1170.435712] NVRM: This PCI I/O region assigned to your NVIDIA device is
invalid:
NVRM: BAR0 is 0M @ 0x0 (PCI:0000:45:00.0)
[ 1170.435714] NVRM: The system BIOS may have misconfigured your GPU.
[ 1170.435725] nvidia: probe of 0000:45:00.0 failed with error -1
[ 1182.379923] nvidia: loading out-of-tree module taints kernel.
[ 1182.379936] nvidia: module license 'NVIDIA' taints kernel.
[ 1182.379937] Disabling lock debugging due to kernel taint
[ 1182.389651] nvidia: module verification failed: signature and/or required
key missing - tainting kernel
[ 1182.406795] nvidia-nvlink: Nvlink Core is being initialized, major device
number 235
[ 1182.406939] nvidia-nvswitch: Probing device 0000:d4:00.0, Vendor Id =
0x10de, Device Id = 0x1af1, Class = 0x68000
[ 1182.407252] nvidia-nvswitch0: Failed to map BAR0 region : -12
-----------
** Affects: ubuntu
Importance: Undecided
Status: New
--
Milan Delta A100 GPU fails to detect on Ubuntu 18.04 and 20.04
https://bugs.launchpad.net/bugs/1915413
You received this bug notification because you are a member of Ubuntu Bugs,
which is subscribed to Ubuntu.
--
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs