From: Yang Wang <[email protected]>

[ Upstream commit a6571045cf06c4aa749b4801382ae96650e2f0e1 ]

Older versions of the MES firmware may cause abnormal GPU power consumption.
When performing inference tasks on the GPU (e.g., with Ollama using ROCm),
the GPU may show abnormal power consumption in idle state and incorrect GPU 
load information.
This issue has been fixed in firmware version 0x8b and newer.

Closes: https://github.com/ROCm/ROCm/issues/5706
Signed-off-by: Yang Wang <[email protected]>
Acked-by: Alex Deucher <[email protected]>
Signed-off-by: Alex Deucher <[email protected]>
(cherry picked from commit 4e22a5fe6ea6e0b057e7f246df4ac3ff8bfbc46a)
Signed-off-by: Sasha Levin <[email protected]>
---

LLM Generated explanations, may be completely bogus:

## Analysis

### What the commit fixes

This commit fixes abnormal GPU power consumption in idle state for AMD
gfx v12 hardware when running with MES firmware versions older than
0x8b. Users running GPU inference workloads (e.g., Ollama with ROCm)
experience the GPU staying in high power state even when idle, with
incorrect GPU load reporting. The fix is tracked in a real bug report:
ROCm/ROCm#5706.

### Code change analysis

The change is minimal and surgical:

1. **Adds firmware version detection** (3 lines): Creates a `mes_rev`
   variable that extracts the MES firmware revision from either
   `sched_version` or `kiq_version` depending on the pipe type, masked
   with `AMDGPU_MES_VERSION_MASK` (0x00000fff).

2. **Conditionally sets oversubscription timer** (1 line changed):
   Changes `oversubscription_timer = 50` to `oversubscription_timer =
   mes_rev < 0x8b ? 0 : 50`. For older firmware, the timer is disabled
   (0 = disabled per the comment). For newer firmware (>= 0x8b where the
   bug is fixed), behavior is unchanged.

This follows an established pattern already present in the same function
at line 782, which checks `sched_version >= 0x82` for the LR compute
workaround.

### Stable kernel criteria assessment

- **Fixes a real bug**: Yes - abnormal idle power consumption is a real
  user-facing issue
- **Obviously correct**: Yes - the pattern is well-established in this
  file
- **Small and contained**: Yes - 4 lines added, 1 line modified, single
  file
- **No new features**: Correct - this is a firmware workaround/quirk
- **Risk assessment**: Very low - newer firmware behavior is unchanged;
  only disables the oversubscription timer for older firmware that can't
  handle it properly

### Classification

This is a **firmware quirk/workaround**, which falls under the "QUIRKS
and WORKAROUNDS" exception category for stable trees. It's analogous to
USB quirks or PCI quirks - working around buggy firmware behavior in a
targeted way.

### Applicability

The file `mes_v12_0.c` was introduced in v6.11-rc1, so this fix is
applicable to stable trees 6.11.y and later that support gfx v12
hardware.

### Verification

- Verified `AMDGPU_MES_VERSION_MASK` is defined as `0x00000fff` in
  `amdgpu_mes.h:40`
- Verified `sched_version` and `kiq_version` fields exist in the
  `amdgpu_mes` structure (`amdgpu_mes.h:78-79`)
- Verified the same firmware-version-check pattern already exists at
  `mes_v12_0.c:782` (checks `>= 0x82` for LR compute workaround)
- Verified `mes_v12_0.c` was first added in commit `785f0f9fe7420`
  ("drm/amdgpu: Add mes v12_0 ip block support (v4)"), first present in
  v6.11-rc1
- Verified the current code at line 793 still has the unconditional
  `oversubscription_timer = 50` (the fix is not yet applied on this
  branch)
- Verified the commit was acked by Alex Deucher (AMD GPU maintainer)
- Verified the upstream commit `4e22a5fe6ea6e0b` exists and is authored
  by Yang Wang
- Could NOT directly verify the ROCm issue #5706 content (would require
  web fetch, but the commit message description is clear)

### Conclusion

This is a small, well-scoped firmware workaround that fixes a real user-
facing power consumption bug on AMD gfx v12 hardware. It follows
established patterns in the codebase, carries minimal regression risk,
and is acked by the subsystem maintainer. It meets all stable kernel
criteria as a hardware/firmware quirk.

**YES**

 drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c 
b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
index 744e95d3984ad..0d7e2dc414a81 100644
--- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c
@@ -731,6 +731,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes 
*mes, int pipe)
        int i;
        struct amdgpu_device *adev = mes->adev;
        union MESAPI_SET_HW_RESOURCES mes_set_hw_res_pkt;
+       uint32_t mes_rev = (pipe == AMDGPU_MES_SCHED_PIPE) ?
+               (mes->sched_version & AMDGPU_MES_VERSION_MASK) :
+               (mes->kiq_version & AMDGPU_MES_VERSION_MASK);
 
        memset(&mes_set_hw_res_pkt, 0, sizeof(mes_set_hw_res_pkt));
 
@@ -790,7 +793,7 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes 
*mes, int pipe)
         * handling support, other queue will not use the oversubscribe timer.
         * handling  mode - 0: disabled; 1: basic version; 2: basic+ version
         */
-       mes_set_hw_res_pkt.oversubscription_timer = 50;
+       mes_set_hw_res_pkt.oversubscription_timer = mes_rev < 0x8b ? 0 : 50;
        mes_set_hw_res_pkt.unmapped_doorbell_handling = 1;
 
        if (amdgpu_mes_log_enable) {
-- 
2.51.0

Reply via email to