Public bug reported: [Impact] Description: Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P GPU
There is a random write to VF BAR0's memory region that causes the kernel got MCE error. Version-Release number : Ubuntu 24.04 How reproducible: Each time Steps to reproduce - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS - Run a fresh install on a DL380a server with 2P with GPU in slot17 Expected results No MCE and run installation w/o problem Actual results The kernel got MCE errors. Additional info: We have tracked this issue with RHEL9.4, it's caused by the following pathes. cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1) 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1) 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1) b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1) cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1) [Fix] Intel gave us a patch set that resolves the issue. https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r The following patches are required. f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters (v6.11-rc1) 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1) f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore units (v6.11-rc1) b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units (v6.11-rc1) 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units (v6.11-rc1) 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB tree (v6.11-rc1) c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1) 0007f3932592 perf/x86/uncore: Save the unit control address of all units (v6.11-rc1) [Where problems could occur] [Other Info] ** Affects: linux (Ubuntu) Importance: Medium Assignee: Michael Reed (mreed8855) Status: Fix Released ** Affects: linux (Ubuntu Noble) Importance: Undecided Status: In Progress ** Affects: linux (Ubuntu Oracular) Importance: Medium Assignee: Michael Reed (mreed8855) Status: Fix Released ** Changed in: linux (Ubuntu) Status: New => In Progress ** Changed in: linux (Ubuntu) Importance: Undecided => Medium ** Changed in: linux (Ubuntu) Assignee: (unassigned) => Michael Reed (mreed8855) ** Also affects: linux (Ubuntu Noble) Importance: Undecided Status: New ** Also affects: linux (Ubuntu Oracular) Importance: Medium Assignee: Michael Reed (mreed8855) Status: In Progress ** Changed in: linux (Ubuntu Noble) Status: New => In Progress ** Changed in: linux (Ubuntu Oracular) Status: In Progress => Fix Released ** Description changed: + [Impact] Description: Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P + NVidia L40 GPU in slot17. There is a random write to VF BAR0's memory region that causes the kernel got MCE error. Version-Release number : Ubuntu 24.04 How reproducible: Each time Steps to reproduce - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in slot17 Expected results No MCE and run installation w/o problem Actual results The kernel got MCE errors. Additional info: We have tracked this issue with RHEL9.4, it's caused by the following pathes. cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1) 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1) 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1) b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1) cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1) + + [Fix] + + [Where problems could occur] + + [Other Info] ** Description changed: [Impact] Description: Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P + NVidia L40 GPU in slot17. There is a random write to VF BAR0's memory region that causes the kernel got MCE error. Version-Release number : Ubuntu 24.04 How reproducible: Each time Steps to reproduce - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in slot17 Expected results No MCE and run installation w/o problem Actual results The kernel got MCE errors. Additional info: We have tracked this issue with RHEL9.4, it's caused by the following pathes. cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1) 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1) 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1) b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1) cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1) [Fix] + Intel gave us a patch set that resolves the issue. + https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r + + The following patches are required. + + f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters (v6.11-rc1) + 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1) + f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore units (v6.11-rc1) + b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units (v6.11-rc1) + 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units (v6.11-rc1) + 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB tree (v6.11-rc1) + c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1) + 0007f3932592 perf/x86/uncore: Save the unit control address of all units (v6.11-rc1) [Where problems could occur] [Other Info] ** Description changed: [Impact] Description: - Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P + NVidia L40 GPU in slot17. + Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P GPU There is a random write to VF BAR0's memory region that causes the kernel got MCE error. Version-Release number : Ubuntu 24.04 How reproducible: Each time Steps to reproduce - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS - - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in slot17 + - Run a fresh install on a DL380a server with 2P with GPU in slot17 Expected results No MCE and run installation w/o problem Actual results The kernel got MCE errors. Additional info: We have tracked this issue with RHEL9.4, it's caused by the following pathes. cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1) 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1) 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1) b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1) cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1) [Fix] Intel gave us a patch set that resolves the issue. https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r The following patches are required. f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters (v6.11-rc1) 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1) f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore units (v6.11-rc1) b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units (v6.11-rc1) 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units (v6.11-rc1) 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB tree (v6.11-rc1) c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1) 0007f3932592 perf/x86/uncore: Save the unit control address of all units (v6.11-rc1) [Where problems could occur] [Other Info] -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2081079 Title: [SRU]Ubuntu 24.04 - It cannot be installed with DL380a Gen12 (2P, SRF- SP) Status in linux package in Ubuntu: Fix Released Status in linux source package in Noble: In Progress Status in linux source package in Oracular: Fix Released Bug description: [Impact] Description: Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P GPU There is a random write to VF BAR0's memory region that causes the kernel got MCE error. Version-Release number : Ubuntu 24.04 How reproducible: Each time Steps to reproduce - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS - Run a fresh install on a DL380a server with 2P with GPU in slot17 Expected results No MCE and run installation w/o problem Actual results The kernel got MCE errors. Additional info: We have tracked this issue with RHEL9.4, it's caused by the following pathes. cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge (v6.8-rc1) 388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR (v6.8-rc1) 632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1) b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore offsets array (v6.8-rc1) cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format of SPR (v6.8-rc1) [Fix] Intel gave us a patch set that resolves the issue. https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r The following patches are required. f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters (v6.11-rc1) 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1) f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore units (v6.11-rc1) b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore units (v6.11-rc1) 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore units (v6.11-rc1) 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB tree (v6.11-rc1) c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1) 0007f3932592 perf/x86/uncore: Save the unit control address of all units (v6.11-rc1) [Where problems could occur] [Other Info] To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081079/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp