I think here is the strongest hint; the host dmesg is floodeed with messages "BAR 1: can't reserve"
2021-12-30T10:01:59.456992+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x1e@0x258 2021-12-30T10:01:59.457413+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: vfio_ecap_init: hiding ecap 0x19@0x900 2021-12-30T10:01:59.457675+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref] 2021-12-30T10:01:59.486586+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.1: enabling device (0000 -> 0002) 2021-12-30T10:01:59.546592+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.3: enabling device (0000 -> 0002) . . . 2021-12-30T10:09:58.164738+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref] 2021-12-30T10:09:58.164811+0000 gauss.lan.incorrekt.net kernel: vfio-pci 0000:81:00.0: BAR 1: can't reserve [mem 0x20000000000-0x2000fffffff 64bit pref] I will try adjusting host BIOS options. B. On Thu, 30 Dec 2021, at 10:25 AM, Bronek Kozicki wrote: > Some more information: > > > 1. driver seem to be loading fine in guest > bronekk@euclid:~$ sudo dmesg | grep -E "nvidia|0d:00" > [ 0.810066] pci 0000:0d:00.0: [10de:1eb1] type 00 class 0x030000 > [ 0.814518] pci 0000:0d:00.0: reg 0x10: [mem 0xc0000000-0xc0ffffff] > [ 0.818518] pci 0000:0d:00.0: reg 0x14: [mem > 0x1000000000-0x100fffffff 64bit pref] > [ 0.825110] pci 0000:0d:00.0: reg 0x1c: [mem > 0x1010000000-0x1011ffffff 64bit pref] > [ 0.829048] pci 0000:0d:00.0: reg 0x24: [io 0x9000-0x907f] > [ 0.834899] pci 0000:0d:00.0: PME# supported from D0 D3hot D3cold > [ 0.836042] pci 0000:0d:00.1: [10de:10f8] type 00 class 0x040300 > [ 0.837841] pci 0000:0d:00.1: reg 0x10: [mem 0xc1000000-0xc1003fff] > [ 0.845020] pci 0000:0d:00.2: [10de:1ad8] type 00 class 0x0c0330 > [ 0.847351] pci 0000:0d:00.2: reg 0x10: [mem > 0x1012000000-0x101203ffff 64bit pref] > [ 0.854518] pci 0000:0d:00.2: reg 0x1c: [mem > 0x1012040000-0x101204ffff 64bit pref] > [ 0.858820] pci 0000:0d:00.2: PME# supported from D0 D3hot D3cold > [ 0.862836] pci 0000:0d:00.3: [10de:1ad9] type 00 class 0x0c8000 > [ 0.864838] pci 0000:0d:00.3: reg 0x10: [mem 0xc1004000-0xc1004fff] > [ 0.873964] pci 0000:0d:00.3: PME# supported from D0 D3hot D3cold > [ 0.932598] pci 0000:0d:00.0: vgaarb: VGA device added: > decodes=io+mem,owns=none,locks=none > [ 0.934523] pci > 0000:0d:00.0: vgaarb: bridge control possible > > [ 0.936134] pci 0000:0d:00.0: vgaarb: setting as boot > device (VGA legacy resources not available) > [ 1.440190] pci > 0000:0d:00.1: D0 power state depends on 0000:0d:00.0 > > [ 1.441170] pci 0000:0d:00.2: D0 power state depends > on 0000:0d:00.0 > [ 1.443582] pci > 0000:0d:00.3: D0 power state depends on 0000:0d:00.0 > [ 2.619525] xhci_hcd 0000:0d:00.2: xHCI Host Controller > [ 2.620624] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned > bus number 11 > [ 2.622792] xhci_hcd 0000:0d:00.2: hcc params 0x0180ff05 hci version > 0x110 quirks 0x0000000000000010 > [ 2.672211] usb usb11: SerialNumber: 0000:0d:00.2 > [ 2.676422] xhci_hcd 0000:0d:00.2: xHCI Host Controller > [ 2.677944] xhci_hcd 0000:0d:00.2: new USB bus registered, assigned > bus number 12 > [ 2.681209] xhci_hcd 0000:0d:00.2: Host supports USB 3.1 Enhanced > SuperSpeed > [ 2.705956] usb usb12: SerialNumber: 0000:0d:00.2 > [ 3.926249] nvidia: loading out-of-tree module taints kernel. > [ 3.927118] nvidia: module license 'NVIDIA' taints kernel. > [ 3.938804] nvidia: module verification failed: signature and/or > required key missing - tainting kernel > [ 3.966693] nvidia-nvlink: Nvlink Core is being initialized, major > device number 249 > [ 3.971181] nvidia 0000:0d:00.0: vgaarb: changed VGA decodes: > olddecodes=io+mem,decodes=none:owns=none > [ 4.070078] nvidia-modeset: Loading NVIDIA Kernel Mode Setting > Driver for UNIX platforms 460.91.03 Fri Jul 2 05:43:38 UTC 2021 > [ 4.349705] [drm] [nvidia-drm] [GPU ID 0x00000d00] Loading driver > [ 4.352647] [drm] Initialized nvidia-drm 0.0.0 20160202 for > 0000:0d:00.0 on minor 0 > [ 4.527067] audit: type=1400 audit(1640858541.112:5): > apparmor="STATUS" operation="profile_load" profile="unconfined" > name="nvidia_modprobe" pid=650 comm="apparmor_parser" > [ 4.527073] audit: type=1400 audit(1640858541.112:6): > apparmor="STATUS" operation="profile_load" profile="unconfined" > name="nvidia_modprobe//kmod" pid=650 comm="apparmor_parser" > [ 4.915963] snd_hda_intel 0000:0d:00.1: Disabling MSI > [ 4.954737] snd_hda_intel 0000:0d:00.1: Handle vga_switcheroo audio > client > [ 5.244486] input: HDA NVidia HDMI/DP,pcm=3 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input6 > [ 5.247732] input: HDA NVidia HDMI/DP,pcm=7 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input7 > [ 5.250636] input: HDA NVidia HDMI/DP,pcm=8 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input8 > [ 5.253520] input: HDA NVidia HDMI/DP,pcm=9 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input9 > [ 5.256445] input: HDA NVidia HDMI/DP,pcm=10 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input10 > [ 5.259401] input: HDA NVidia HDMI/DP,pcm=11 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input11 > [ 5.262271] input: HDA NVidia HDMI/DP,pcm=12 as > /devices/pci0000:00/0000:00:03.4/0000:0d:00.1/sound/card0/input12 > > > bronekk@euclid:~$ sudo nvidia-smi > Thu Dec 30 10:04:48 2021 > +-----------------------------------------------------------------------------+ > | NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 > | > |-------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC > | > | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. > | > | | | MIG M. > | > |===============================+======================+======================| > | 0 Quadro RTX 4000 On | 00000000:0D:00.0 Off | N/A > | > | 30% 39C P8 3W / 125W | 1MiB / 7982MiB | 0% Default > | > | | | N/A > | > +-------------------------------+----------------------+----------------------+ > > +-----------------------------------------------------------------------------+ > | Processes: > | > | GPU GI CI PID Type Process name GPU Memory > | > | ID ID Usage > | > |=============================================================================| > | No running processes found > | > +-----------------------------------------------------------------------------+ > > bronekk@euclid:~$ sudo lspci -vnn -s 0d:00.0 > 0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL > [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller]) > Subsystem: Dell TU104GL [Quadro RTX 4000] [1028:12a0] > Physical Slot: 0-12 > Flags: bus master, fast devsel, latency 0, IRQ 116 > Memory at c0000000 (32-bit, non-prefetchable) [size=16M] > Memory at 1000000000 (64-bit, prefetchable) [size=256M] > Memory at 1010000000 (64-bit, prefetchable) [size=32M] > I/O ports at 9000 [size=128] > Capabilities: [60] Power Management version 3 > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Capabilities: [78] Express Legacy Endpoint, MSI 00 > Capabilities: [100] Virtual Channel > Capabilities: [250] Latency Tolerance Reporting > Capabilities: [128] Power Budgeting <?> > Capabilities: [420] Advanced Error Reporting > Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 > Len=024 <?> > Kernel driver in use: nvidia > Kernel modules: nvidia > > > 2. host should not be trying to access the card: > > bronekk@gauss ~ % sudo lspci -vnn -s 81:00.0 > 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL > [Quadro RTX 4000] [10de:1eb1] (rev a1) (prog-if 00 [VGA controller]) > Subsystem: Dell Device [1028:12a0] > Flags: bus master, fast devsel, latency 0, IRQ 381, IOMMU group > 31 > Memory at bc000000 (32-bit, non-prefetchable) [size=16M] > Memory at 20000000000 (64-bit, prefetchable) [size=256M] > Memory at 20010000000 (64-bit, prefetchable) [size=32M] > I/O ports at b000 [size=128] > Expansion ROM at bd000000 [disabled] [size=512K] > Capabilities: [60] Power Management version 3 > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > Capabilities: [78] Express Legacy Endpoint, MSI 00 > Capabilities: [100] Virtual Channel > Capabilities: [258] L1 PM Substates > Capabilities: [128] Power Budgeting <?> > Capabilities: [420] Advanced Error Reporting > Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 > Len=024 <?> > Capabilities: [900] Secondary PCI Express > Capabilities: [bb0] Physical Resizable BAR > Kernel driver in use: vfio-pci > Kernel modules: nouveau > > > bronekk@gauss ~ % sudo cat /etc/modprobe.d/40-blacklist.conf > # This host is headless, prevent any modules from attaching to video hardware > > # NVIDIA > blacklist nouveau > blacklist nvidia > > # AMD > blacklist radeon > blacklist amdgpu > blacklist amdkfd > blacklist fglrx > > # HDMI sound on a GPU > blacklist snd_hda_intel > > # Framebuffers (ALL of them) > blacklist vesafb > blacklist aty128fb > blacklist atyfb > blacklist radeonfb > blacklist cirrusfb > blacklist cyber2000fb > blacklist cyblafb > blacklist gx1fb > blacklist hgafb > blacklist i810fb > blacklist intelfb > blacklist kyrofb > blacklist lxfb > blacklist matroxfb_base > blacklist neofb > blacklist nvidiafb > blacklist pm2fb > blacklist rivafb > blacklist s1d13xxxfb > blacklist savagefb > blacklist sisfb > blacklist sstfb > blacklist tdfxfb > blacklist tridentfb > blacklist vfb > blacklist viafb > blacklist vt8623fb > blacklist udlfb > > bronekk@gauss ~ % sudo cat /etc/modprobe.d/30-vfio.conf > # 10de:* are NVIDIA > # 1912:0015 is Renesas Technology Corp. uPD720202 USB 3.0 Host Controller > options vfio-pci ids=10de:1eb1,10de:10f8,10de:1ad8,10de:1ad9,1912:0015 > options vfio-pci disable_vga=1 > > bronekk@gauss ~ % sudo lspci -nn | grep -F "10de:" > 81:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104GL > [Quadro RTX 4000] [10de:1eb1] (rev a1) > 81:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio > Controller [10de:10f8] (rev a1) > 81:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host > Controller [10de:1ad8] (rev a1) > 81:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB > Type-C UCSI Controller [10de:1ad9] (rev a1) > > 3. device mapping in libvirt: > > <hostdev mode='subsystem' type='pci' managed='yes'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x81' slot='0x00' function='0x0'/> > </source> > <rom bar='off'/> > <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' > function='0x0' multifunction='on'/> > </hostdev> > <hostdev mode='subsystem' type='pci' managed='yes'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x81' slot='0x00' function='0x1'/> > </source> > <rom bar='off'/> > <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' > function='0x1'/> > </hostdev> > <hostdev mode='subsystem' type='pci' managed='yes'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x81' slot='0x00' function='0x2'/> > </source> > <rom bar='off'/> > <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' > function='0x2'/> > </hostdev> > <hostdev mode='subsystem' type='pci' managed='yes'> > <driver name='vfio'/> > <source> > <address domain='0x0000' bus='0x81' slot='0x00' function='0x3'/> > </source> > <rom bar='off'/> > <address type='pci' domain='0x0000' bus='0x0d' slot='0x00' > function='0x3'/> > </hostdev> > > > 4. something is definitely wrong inside the guest, since I am getting these: > > [ 1236.179163] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! > [Xorg:2982] > [ 1236.179961] Modules linked in: hid_generic usbhid hid rfkill > snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel > soundwire_generic_allocation snd_soc_core ghash_clmulni_intel > snd_compress soundwire_cadence nls_ascii snd_hda_codec nls_cp437 vfat > fat aesni_intel snd_hda_core libaes snd_hwdep crypto_simd soundwire_bus > cryptd nvidia_drm(POE) snd_pcm glue_helper snd_timer drm_kms_helper snd > iTCO_wdt intel_pmc_bxt joydev iTCO_vendor_support sg serio_raw cec > watchdog soundcore virtio_console virtio_balloon pcspkr evdev > efi_pstore qemu_fw_cfg nvidia_modeset(POE) nvidia(POE) drm fuse > configfs efivarfs virtio_rng rng_core ip_tables x_tables autofs4 ext4 > crc16 mbcache jbd2 crc32c_generic sd_mod t10_pi sr_mod crc_t10dif cdrom > crct10dif_generic ahci libahci xhci_pci libata xhci_hcd virtio_scsi > virtio_net net_failover failover scsi_mod usbcore crct10dif_pclmul > psmouse crct10dif_common crc32_pclmul crc32c_intel i2c_i801 virtio_pci > lpc_ich i2c_smbus virtio_ring usb_common virtio but > ton > [ 1236.189681] CPU: 12 PID: 2982 Comm: Xorg Tainted: P OEL > 5.10.0-10-amd64 #1 Debian 5.10.84-1 > [ 1236.190725] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS > 0.0.0 02/06/2015 > [ 1236.191711] RIP: 0010:_nv032887rm+0x12/0x40 [nvidia] > [ 1236.192286] Code: d2 0e 31 c0 e8 af 7d 78 ff e8 ca 3c eb ff 31 c0 48 > 83 c4 08 c3 0f 1f 00 48 83 ec 08 39 4a 10 76 17 48 8b 02 c1 e9 02 8b 04 > 88 <48> 83 c4 08 c3 66 0f 1f 84 00 00 00 00 00 be 00 00 d5 09 bf 0a ad > [ 1236.194379] RSP: 0018:ffffa9b840f6ba98 EFLAGS: 00000256 > [ 1236.194977] RAX: 00000000164000a1 RBX: 0000000000000020 RCX: > 0000000000000000 > [ 1236.195804] RDX: ffff9995889fd0a0 RSI: ffff9995889fc008 RDI: > ffff99958b67d008 > [ 1236.196617] RBP: ffff999586b02a00 R08: 0000000000000020 R09: > 0000000000000000 > [ 1236.197425] R10: ffff9995889fc008 R11: ffff9995889fd0a0 R12: > 0000000000000000 > [ 1236.198217] R13: 0000000000000000 R14: 0000000000000000 R15: > ffff9995889fc008 > [ 1236.199011] FS: 00007f7bcbbd6a40(0000) GS:ffff999cdfb00000(0000) > knlGS:0000000000000000 > [ 1236.199931] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1236.200587] CR2: 0000564dd482a3a8 CR3: 0000000102d80005 CR4: > 0000000000770ee0 > [ 1236.201392] PKRU: 55555554 > [ 1236.201699] Call Trace: > [ 1236.202118] ? _nv009235rm+0x1f1/0x230 [nvidia] > [ 1236.202763] ? _nv036126rm+0x62/0x70 [nvidia] > [ 1236.203393] ? _nv028825rm+0x46/0x4a0 [nvidia] > [ 1236.204041] ? _nv009323rm+0x7b/0x90 [nvidia] > [ 1236.204667] ? _nv009319rm+0xfb/0x4f0 [nvidia] > [ 1236.205302] ? _nv037231rm+0xfd/0x180 [nvidia] > [ 1236.205939] ? _nv034489rm+0x248/0x370 [nvidia] > [ 1236.206528] ? _nv009448rm+0x3d/0x90 [nvidia] > [ 1236.207153] ? _nv029075rm+0x14c/0x670 [nvidia] > [ 1236.207759] ? _nv028910rm+0x520/0x900 [nvidia] > [ 1236.208378] ? _nv002525rm+0x9/0x20 [nvidia] > [ 1236.208966] ? _nv003517rm+0x1b/0x80 [nvidia] > [ 1236.209551] ? _nv013021rm+0x6fe/0x770 [nvidia] > [ 1236.210149] ? _nv038021rm+0xb3/0x150 [nvidia] > [ 1236.210736] ? _nv038020rm+0x388/0x4e0 [nvidia] > [ 1236.211336] ? _nv036312rm+0xbe/0x140 [nvidia] > [ 1236.211939] ? _nv036313rm+0x42/0x70 [nvidia] > [ 1236.212525] ? _nv008273rm+0x4b/0x90 [nvidia] > [ 1236.213117] ? _nv000709rm+0x4ef/0x880 [nvidia] > [ 1236.213709] ? rm_ioctl+0x54/0xb0 [nvidia] > [ 1236.214228] ? nvidia_ioctl+0x66c/0x880 [nvidia] > [ 1236.214816] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia] > [ 1236.215516] ? __x64_sys_ioctl+0x83/0xb0 > [ 1236.215972] ? do_syscall_64+0x33/0x80 > [ 1236.216405] ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > On Wed, 29 Dec 2021, at 11:16 PM, Bronek Kozicki wrote: >> Hi >> >> Hoping someone solved this one before. >> >> My host if Epyc Milan, running on Asrock ROMED8-2T, GPU is NVIDIA >> Quadro RTX 4000, running on fresh Arch Linux install. The guest is >> Debian 11 , with NVIDIA-460 drivers . I can see the drivers are >> correctly loaded in the guest (with nvidia-smi), but Xorg fails to >> initialize. The /var/log/Xorg.0.log tail is: >> >> >> [ 254.714] (II) NVIDIA: Using 24576.00 MB of virtual memory for >> indirect memory >> [ 254.714] (II) NVIDIA: access. >> [ 257.719] (EE) NVIDIA(GPU-0): Failed to initialize DMA. >> [ 257.720] (EE) NVIDIA(0): Failed to allocate push buffer >> [ 257.829] (EE) >> Fatal server error: >> [ 257.829] (EE) AddScreen/ScreenInit failed for driver 0 >> [ 257.829] (EE) >> [ 257.829] (EE) >> Please consult the The X.Org Foundation support >> at http://wiki.x.org >> for help. >> [ 257.829] (EE) Please also check the log file at >> "/var/log/Xorg.0.log" for additional information. >> [ 257.829] (EE) >> [ 257.829] (EE) Server terminated with error (1). Closing log file. >> >> I am running similar configuration (same card, also Debian 11 and >> nvidia-460 drivers) on a different host, with an older Intel Xeon CPU. >> No problems there. >> >> Any hints? >> >> >> B. >> >> -- >> Bronek Kozicki >> b...@incorrekt.com >> >> _______________________________________________ >> vfio-users mailing list >> vfio-users@redhat.com >> https://listman.redhat.com/mailman/listinfo/vfio-users > > -- > Bronek Kozicki > b...@incorrekt.com > > > _______________________________________________ > vfio-users mailing list > vfio-users@redhat.com > https://listman.redhat.com/mailman/listinfo/vfio-users -- Bronek Kozicki b...@incorrekt.com _______________________________________________ vfio-users mailing list vfio-users@redhat.com https://listman.redhat.com/mailman/listinfo/vfio-users