** Description changed: [Description] + + On a VM on Azure with a Tesla gpu it was noticed that when removing the + gpu from the pci the vm would crash. In case the nvidia drivers are + loaded, the machine won't crash. Instead the removing process will hang + and the machine will crash on reboot. + + This is related to bug [1]. + The bug reported in [1] regards another driver but the root cause is the same. + It is still investigated whether this is a bug in pci, or it is a bug of various drivers on how they use pci. + + For this case we have identified that removing commit [2] prevents the + kernel crashes. + + Azure has requested to revert this commit, at least for the time being. + This commit is not in upstream, so it just need to be reverted from Ubuntu kernels. [Test Case] + On an Azure vm with a gpu : + + # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove + + where '0001:00:00.0' the pci address of the gpu. + The vm will crash. [Where things could go wrong] + [Other] - [Other] + [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515 + [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?h=Ubuntu-azure-5.15.0-1043.50&id=75af0c10b3703400890d314d1d91d25294234a81
-- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-azure in Ubuntu. https://bugs.launchpad.net/bugs/2042568 Title: Azure - Kernel crashes when removing gpu from pci Status in linux-azure package in Ubuntu: New Status in linux-azure source package in Jammy: New Status in linux-azure source package in Lunar: New Bug description: [Description] On a VM on Azure with a Tesla gpu it was noticed that when removing the gpu from the pci the vm would crash. In case the nvidia drivers are loaded, the machine won't crash. Instead the removing process will hang and the machine will crash on reboot. This is related to bug [1]. The bug reported in [1] regards another driver but the root cause is the same. It is still investigated whether this is a bug in pci, or it is a bug of various drivers on how they use pci. For this case we have identified that removing commit [2] prevents the kernel crashes. Azure has requested to revert this commit, at least for the time being. This commit is not in upstream, so it just need to be reverted from Ubuntu kernels. [Test Case] On an Azure vm with a gpu : # echo '1' > /sys/bus/pci/devices/0001:00:00.0/remove where '0001:00:00.0' the pci address of the gpu. The vm will crash. [Where things could go wrong] [Other] [1] https://bugzilla.kernel.org/show_bug.cgi?id=215515 [2] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/commit/?h=Ubuntu-azure-5.15.0-1043.50&id=75af0c10b3703400890d314d1d91d25294234a81 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/2042568/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp