** Description changed: [ Impact ] - * Microsoft Azure NV-series instances with NVidia GRID drivers started + * Microsoft Azure NV-series instances with NVidia GRID drivers started to experience xserver crashes while following Microsoft's official guide to installing Nvidia drivers [1]. - * Root cause analysis showed that it was due to having a device with + * Root cause analysis showed that it was due to having a device with BusID "PCI:0@<domain_id>:0:0", where domain id is >= 32767 while the hyperv_drm kernel module is loaded. - * Removing either the BusID specification or unloading the hyperv_drm + * Removing either the BusID specification or unloading the hyperv_drm kernel module seems to fix the crash. - * The crash is happening while X.server is trying to enumerate PCI + * The crash is happening while X.server is trying to enumerate PCI devices. X.server dereferences a NULL pointer while trying to access to the PCI device info. - * The reason why it only happens while the hyperv_drm kernel module is + * The reason why it only happens while the hyperv_drm kernel module is loaded is that the hyperv_drm module does not expose PCI hardware information since it's a virtual device. - * The upstream patch [2] addresses the issue and it's confirmed that + * The upstream patch [2] addresses the issue and it's confirmed that the xserver with the patch does not experience the crash. - * Ubuntu Focal `xorg-server` package does not include the patch [2] at + * Ubuntu Focal `xorg-server` package does not include the patch [2] at the moment (xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6). - [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms - [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928 + [1]: https://learn.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup#install-grid-drivers-on-nv-or-nvv3-series-vms + [2]: https://github.com/freedesktop/xorg-xserver/commit/0d93bbfa2cfacbb73741f8bed0e32fa1a656b928 [ Test Plan ] Part (a) is quoted from Microsoft's official guide [1]. Part (a): - * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU - - e.g. `NV36adms A10` - * Install updates, required tooling, and the desktop environment: - - sudo apt-get update - - sudo apt-get upgrade -y - - sudo apt-get dist-upgrade -y - - sudo apt-get install build-essential ubuntu-desktop -y - - sudo apt-get install linux-azure -y - * Disable nouveau kernel driver: - # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents: - blacklist nouveau - blacklist lbm-nouveau - * Reboot the VM, re-connect, and then stop X server: - - sudo reboot - # wait for the reboot, reconnect, and continue: - - sudo systemctl stop lightdm.service - * Download and install the NVidia GRID driver: - - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272 - - chmod +x NVIDIA-Linux-x86_64-grid.run - - sudo ./NVIDIA-Linux-x86_64-grid.run - - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes. - * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf - - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf - * Edit /etc/nvidia/grid.conf - - sudo nano /etc/nvidia/grid.conf - # Append the following lines: - IgnoreSP=FALSE - EnableUI=FALSE - # Remove this line if present: - FeatureType=0 - # And save. - * Reboot the VM + * Spawn a Microsoft Azure NV-series instance with an NVidia GRID-supported GPU + - e.g. `NV36adms A10` + * Install updates, required tooling, and the desktop environment: + - sudo apt-get update + - sudo apt-get upgrade -y + - sudo apt-get dist-upgrade -y + - sudo apt-get install build-essential ubuntu-desktop -y + - sudo apt-get install linux-azure -y + * Disable nouveau kernel driver: + # Create a blacklist file /etc/modprobe.d/nouveau.conf with following contents: + blacklist nouveau + blacklist lbm-nouveau + * Reboot the VM, re-connect, and then stop X server: + - sudo reboot + # wait for the reboot, reconnect, and continue: + - sudo systemctl stop lightdm.service + * Download and install the NVidia GRID driver: + - wget -O NVIDIA-Linux-x86_64-grid.run https://go.microsoft.com/fwlink/?linkid=874272 + - chmod +x NVIDIA-Linux-x86_64-grid.run + - sudo ./NVIDIA-Linux-x86_64-grid.run + - # When the setup asks whether you want to run the nvidia-xconfig utility to update your X configuration file, select Yes. + * Copy /etc/nvidia/gridd.conf.template to /etc/nvidia/gridd.conf + - sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf + * Edit /etc/nvidia/grid.conf + - sudo nano /etc/nvidia/grid.conf + # Append the following lines: + IgnoreSP=FALSE + EnableUI=FALSE + # Remove this line if present: + FeatureType=0 + # And save. + * Reboot the VM - Part (b): + Part (b): - * Ensure that the hyperv_drm kernel module is loaded: - - sudo modprobe hyperv_drm - * Use the attached xorg.conf file to override /etc/X11/xorg.conf file - * try to start the `xserver`: - - sudo startx - * `xserver` should crash with a similar output to the following: - X.Org X Server 1.20.13 - X Protocol Version 11, Revision 0 - Build Operating System: linux Ubuntu - Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64 - Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1 - Build Date: 07 February 2023 12:48:13PM - xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support) - Current version of pixman: 0.38.4 - Before reporting problems, check http://wiki.x.org - to make sure that you have the latest version. - Markers: (--) probed, (**) from config file, (==) default setting, - (++) from command line, (!!) notice, (II) informational, - (WW) warning, (EE) error, (NI) not implemented, (??) unknown. - (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023 - (==) Using config file: "/etc/X11/xorg.conf" - (==) Using system config directory "/usr/share/X11/xorg.conf.d" - (EE) - (EE) Backtrace: - (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc] - (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420] - (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7] - (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0] - (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc] - (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23] - (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb] - (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4] - (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083] - (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace] - (EE) - (EE) Segmentation fault at address 0x124 - (EE) - Fatal server error: - (EE) Caught signal 11 (Segmentation fault). Server aborting - (EE) - (EE) - Please consult the The X.Org Foundation support - at http://wiki.x.org - for help. - (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information. - (EE) - (EE) Server terminated with error (1). Closing log file. - ^Cxinit: giving up - xinit: unable to connect to X server: Connection refused - xinit: unexpected signal 2 + * Ensure that the hyperv_drm kernel module is loaded: + - sudo modprobe hyperv_drm + * Use the attached xorg.conf file to override /etc/X11/xorg.conf file + * try to start the `xserver`: + - sudo startx + * `xserver` should crash with a similar output to the following: + X.Org X Server 1.20.13 + X Protocol Version 11, Revision 0 + Build Operating System: linux Ubuntu + Current Operating System: Linux a10test 5.15.0-1031-azure #38~20.04.1-Ubuntu SMP Mon Jan 9 18:23:48 UTC 2023 x86_64 + Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-1031-azure root=PARTUUID=4cac852b-afba-447b-b3e7-c002155c1305 ro console=tty1 console=ttyS0 earlyprintk=ttyS0 panic=-1 + Build Date: 07 February 2023 12:48:13PM + xorg-server 2:1.20.13-1ubuntu1~20.04.6 (For technical support please see http://www.ubuntu.com/support) + Current version of pixman: 0.38.4 + Before reporting problems, check http://wiki.x.org + to make sure that you have the latest version. + Markers: (--) probed, (**) from config file, (==) default setting, + (++) from command line, (!!) notice, (II) informational, + (WW) warning, (EE) error, (NI) not implemented, (??) unknown. + (==) Log file: "/var/log/Xorg.1.log", Time: Sat Feb 18 10:54:26 2023 + (==) Using config file: "/etc/X11/xorg.conf" + (==) Using system config directory "/usr/share/X11/xorg.conf.d" + (EE) + (EE) Backtrace: + (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x13c) [0x55e7787c5ecc] + (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x60) [0x7f9576cac420] + (EE) 2: /usr/lib/xorg/Xorg (xf86PlatformDeviceCheckBusID+0xa7) [0x55e7786c4db7] + (EE) 3: /usr/lib/xorg/Xorg (xf86PlatformMatchDriver+0x700) [0x55e7786bf1b0] + (EE) 4: /usr/lib/xorg/Xorg (xf86CallDriverProbe+0x5c) [0x55e7786971dc] + (EE) 5: /usr/lib/xorg/Xorg (xf86BusConfig+0x43) [0x55e778697b23] + (EE) 6: /usr/lib/xorg/Xorg (InitOutput+0x90b) [0x55e7786a59eb] + (EE) 7: /usr/lib/xorg/Xorg (InitFonts+0x1d4) [0x55e778667ea4] + (EE) 8: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xf3) [0x7f9576ac8083] + (EE) 9: /usr/lib/xorg/Xorg (_start+0x2e) [0x55e778651ace] + (EE) + (EE) Segmentation fault at address 0x124 + (EE) + Fatal server error: + (EE) Caught signal 11 (Segmentation fault). Server aborting + (EE) + (EE) + Please consult the The X.Org Foundation support + at http://wiki.x.org + for help. + (EE) Please also check the log file at "/var/log/Xorg.1.log" for additional information. + (EE) + (EE) Server terminated with error (1). Closing log file. + ^Cxinit: giving up + xinit: unable to connect to X server: Connection refused + xinit: unexpected signal 2 + + # To verify patch fixes the issue: + * Enable the following PPA that includes the fix: + - sudo add-apt-repository ppa:mustafakemalgilor/lp2007746 + - sudo apt update + * Install the package + - sudo apt install xserver-xorg-core=2:1.20.13-1ubuntu1~20.04.6ubuntu1 + * Try to start xserver: + - sudo startx + * xserver should not crash. + [ Where problems could occur ] - * The regression risk is low, given that the patch is well-isolated and + * The regression risk is low, given that the patch is well-isolated and basically adds a null check that is already assumed to be there in the first place. [ Other Info ] - * workaround #1: unload hyperv_drm kernel module: - - sudo modprobe -r hyperv_drm - * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section: - Section "Device" - Identifier "Device0" - Driver "nvidia" - VendorName "NVIDIA Corporation" - # BusID "PCI:0@32828:0:0" - Option "HardDPMS" "false" - Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid" - EndSection + * workaround #1: unload hyperv_drm kernel module: + - sudo modprobe -r hyperv_drm + * workaround #2: Comment out BusID line in /etc/X11/xorg.conf [Device] section: + Section "Device" + Identifier "Device0" + Driver "nvidia" + VendorName "NVIDIA Corporation" + # BusID "PCI:0@32828:0:0" + Option "HardDPMS" "false" + Option "CustomEDID" "DFP-0:/etc/X11/vdisplay.edid" + EndSection
-- You received this bug notification because you are a member of Ubuntu-X, which is subscribed to xorg-server in Ubuntu. https://bugs.launchpad.net/bugs/2007746 Title: [SRU] xserver crashes when hyperv_drm kernel module is loaded on azure NV series instances w/ nvidia grid driver To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/2007746/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-x-swat Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-x-swat More help : https://help.launchpad.net/ListHelp

