Public bug reported:

[Impact]
Description:
Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P GPU

There is a random write to VF BAR0's memory region that causes the
kernel got MCE error.

Version-Release number :
Ubuntu 24.04

How reproducible:
Each time

Steps to reproduce
- PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
- Run a fresh install on a DL380a server with 2P with GPU  in slot17

Expected results
No MCE and run installation w/o problem

Actual results
The kernel got MCE errors.

Additional info:

We have tracked this issue with RHEL9.4, it's caused by the following
pathes.

cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge 
(v6.8-rc1)
388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR 
(v6.8-rc1)
632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore 
offsets array (v6.8-rc1)
cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO format 
of SPR (v6.8-rc1)

[Fix]
Intel gave us a patch set that resolves the issue.
https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r

The following patches are required.

f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters 
(v6.11-rc1)
15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1)
f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore 
units (v6.11-rc1)
b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore 
units (v6.11-rc1)
80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore 
units (v6.11-rc1)
585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB 
tree (v6.11-rc1)
c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1)
0007f3932592 perf/x86/uncore: Save the unit control address of all units 
(v6.11-rc1)

[Where problems could occur]

[Other Info]

** Affects: linux (Ubuntu)
     Importance: Medium
     Assignee: Michael Reed (mreed8855)
         Status: Fix Released

** Affects: linux (Ubuntu Noble)
     Importance: Undecided
         Status: In Progress

** Affects: linux (Ubuntu Oracular)
     Importance: Medium
     Assignee: Michael Reed (mreed8855)
         Status: Fix Released

** Changed in: linux (Ubuntu)
       Status: New => In Progress

** Changed in: linux (Ubuntu)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu)
     Assignee: (unassigned) => Michael Reed (mreed8855)

** Also affects: linux (Ubuntu Noble)
   Importance: Undecided
       Status: New

** Also affects: linux (Ubuntu Oracular)
   Importance: Medium
     Assignee: Michael Reed (mreed8855)
       Status: In Progress

** Changed in: linux (Ubuntu Noble)
       Status: New => In Progress

** Changed in: linux (Ubuntu Oracular)
       Status: In Progress => Fix Released

** Description changed:

+ [Impact]
  Description:
  Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P 
+ NVidia L40 GPU in slot17.
  
  There is a random write to VF BAR0's memory region that causes the
  kernel got MCE error.
  
  Version-Release number :
  Ubuntu 24.04
  
  How reproducible:
  Each time
  
  Steps to reproduce
  - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
  - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in 
slot17
  
  Expected results
  No MCE and run installation w/o problem
  
  Actual results
  The kernel got MCE errors.
  
  Additional info:
  
  We have tracked this issue with RHEL9.4, it's caused by the following
  pathes.
  
  cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge 
(v6.8-rc1)
  388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR 
(v6.8-rc1)
  632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
  b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore 
offsets array (v6.8-rc1)
  cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO 
format of SPR (v6.8-rc1)
+ 
+ [Fix]
+ 
+ [Where problems could occur]
+ 
+ [Other Info]

** Description changed:

  [Impact]
  Description:
  Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P 
+ NVidia L40 GPU in slot17.
  
  There is a random write to VF BAR0's memory region that causes the
  kernel got MCE error.
  
  Version-Release number :
  Ubuntu 24.04
  
  How reproducible:
  Each time
  
  Steps to reproduce
  - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
  - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in 
slot17
  
  Expected results
  No MCE and run installation w/o problem
  
  Actual results
  The kernel got MCE errors.
  
  Additional info:
  
  We have tracked this issue with RHEL9.4, it's caused by the following
  pathes.
  
  cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge 
(v6.8-rc1)
  388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR 
(v6.8-rc1)
  632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
  b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore 
offsets array (v6.8-rc1)
  cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO 
format of SPR (v6.8-rc1)
  
  [Fix]
+ Intel gave us a patch set that resolves the issue.
+ 
https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r
+ 
+ The following patches are required.
+ 
+ f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters 
(v6.11-rc1)
+ 15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1)
+ f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore 
units (v6.11-rc1)
+ b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore 
units (v6.11-rc1)
+ 80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore 
units (v6.11-rc1)
+ 585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB 
tree (v6.11-rc1)
+ c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1)
+ 0007f3932592 perf/x86/uncore: Save the unit control address of all units 
(v6.11-rc1)
  
  [Where problems could occur]
  
  [Other Info]

** Description changed:

  [Impact]
  Description:
- Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P 
+ NVidia L40 GPU in slot17.
+ Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P 
GPU
  
  There is a random write to VF BAR0's memory region that causes the
  kernel got MCE error.
  
  Version-Release number :
  Ubuntu 24.04
  
  How reproducible:
  Each time
  
  Steps to reproduce
  - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
- - Run a fresh install on a DL380a server with 2P with GPU (NVidia L40) in 
slot17
+ - Run a fresh install on a DL380a server with 2P with GPU  in slot17
  
  Expected results
  No MCE and run installation w/o problem
  
  Actual results
  The kernel got MCE errors.
  
  Additional info:
  
  We have tracked this issue with RHEL9.4, it's caused by the following
  pathes.
  
  cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge 
(v6.8-rc1)
  388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR 
(v6.8-rc1)
  632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
  b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore 
offsets array (v6.8-rc1)
  cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO 
format of SPR (v6.8-rc1)
  
  [Fix]
  Intel gave us a patch set that resolves the issue.
  
https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r
  
  The following patches are required.
  
  f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters 
(v6.11-rc1)
  15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1)
  f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore 
units (v6.11-rc1)
  b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore 
units (v6.11-rc1)
  80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore 
units (v6.11-rc1)
  585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB 
tree (v6.11-rc1)
  c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1)
  0007f3932592 perf/x86/uncore: Save the unit control address of all units 
(v6.11-rc1)
  
  [Where problems could occur]
  
  [Other Info]

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/2081079

Title:
  [SRU]Ubuntu 24.04 - It cannot be installed with DL380a Gen12 (2P, SRF-
  SP)

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Noble:
  In Progress
Status in linux source package in Oracular:
  Fix Released

Bug description:
  [Impact]
  Description:
  Failed to install Ubuntu 24.04 on a DL380a Gen12 with Intel Sierra Forest 2P 
GPU

  There is a random write to VF BAR0's memory region that causes the
  kernel got MCE error.

  Version-Release number :
  Ubuntu 24.04

  How reproducible:
  Each time

  Steps to reproduce
  - PCI segment, Intel VT-d and SR-IOV , all enabled in the BIOS
  - Run a fresh install on a DL380a server with 2P with GPU  in slot17

  Expected results
  No MCE and run installation w/o problem

  Actual results
  The kernel got MCE errors.

  Additional info:

  We have tracked this issue with RHEL9.4, it's caused by the following
  pathes.

  cb4a6ccf3583 perf/x86/intel/uncore: Support Sierra Forest and Grand Ridge 
(v6.8-rc1)
  388d76175bd9 perf/x86/intel/uncore: Support IIO free-running counters on GNR 
(v6.8-rc1)
  632c4bf6d007 perf/x86/intel/uncore: Support Granite Rapids (v6.8-rc1)
  b560e0cd882b perf/x86/uncore: Use u64 to replace unsigned for the uncore 
offsets array (v6.8-rc1)
  cf35791476fc perf/x86/intel/uncore: Generic uncore_get_uncores and MMIO 
format of SPR (v6.8-rc1)

  [Fix]
  Intel gave us a patch set that resolves the issue.
  
https://lore.kernel.org/lkml/20240614134631.1092359-1-kan.li...@linux.intel.com/#r

  The following patches are required.

  f8a86a9bb5f7 perf/x86/intel/uncore: Support HBM and CXL PMON counters 
(v6.11-rc1)
  15a4bd51853b perf/x86/uncore: Cleanup unused unit structure (v6.11-rc1)
  f76a8420444b perf/x86/uncore: Apply the unit control RB tree to PCI uncore 
units (v6.11-rc1)
  b1d9ea2e1ca4 perf/x86/uncore: Apply the unit control RB tree to MSR uncore 
units (v6.11-rc1)
  80580dae65b9 perf/x86/uncore: Apply the unit control RB tree to MMIO uncore 
units (v6.11-rc1)
  585463fee642 perf/x86/uncore: Retrieve the unit ID from the unit control RB 
tree (v6.11-rc1)
  c74443d92f68 perf/x86/uncore: Support per PMU cpumask (v6.11-rc1)
  0007f3932592 perf/x86/uncore: Save the unit control address of all units 
(v6.11-rc1)

  [Where problems could occur]

  [Other Info]

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2081079/+subscriptions


-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to