From: Yang Wang <[email protected]> [ Upstream commit a6571045cf06c4aa749b4801382ae96650e2f0e1 ]
Older versions of the MES firmware may cause abnormal GPU power consumption. When performing inference tasks on the GPU (e.g., with Ollama using ROCm), the GPU may show abnormal power consumption in idle state and incorrect GPU load information. This issue has been fixed in firmware version 0x8b and newer. Closes: https://github.com/ROCm/ROCm/issues/5706 Signed-off-by: Yang Wang <[email protected]> Acked-by: Alex Deucher <[email protected]> Signed-off-by: Alex Deucher <[email protected]> (cherry picked from commit 4e22a5fe6ea6e0b057e7f246df4ac3ff8bfbc46a) Signed-off-by: Sasha Levin <[email protected]> --- LLM Generated explanations, may be completely bogus: ## Analysis ### What the commit fixes This commit fixes abnormal GPU power consumption in idle state for AMD gfx v12 hardware when running with MES firmware versions older than 0x8b. Users running GPU inference workloads (e.g., Ollama with ROCm) experience the GPU staying in high power state even when idle, with incorrect GPU load reporting. The fix is tracked in a real bug report: ROCm/ROCm#5706. ### Code change analysis The change is minimal and surgical: 1. **Adds firmware version detection** (3 lines): Creates a `mes_rev` variable that extracts the MES firmware revision from either `sched_version` or `kiq_version` depending on the pipe type, masked with `AMDGPU_MES_VERSION_MASK` (0x00000fff). 2. **Conditionally sets oversubscription timer** (1 line changed): Changes `oversubscription_timer = 50` to `oversubscription_timer = mes_rev < 0x8b ? 0 : 50`. For older firmware, the timer is disabled (0 = disabled per the comment). For newer firmware (>= 0x8b where the bug is fixed), behavior is unchanged. This follows an established pattern already present in the same function at line 782, which checks `sched_version >= 0x82` for the LR compute workaround. ### Stable kernel criteria assessment - **Fixes a real bug**: Yes - abnormal idle power consumption is a real user-facing issue - **Obviously correct**: Yes - the pattern is well-established in this file - **Small and contained**: Yes - 4 lines added, 1 line modified, single file - **No new features**: Correct - this is a firmware workaround/quirk - **Risk assessment**: Very low - newer firmware behavior is unchanged; only disables the oversubscription timer for older firmware that can't handle it properly ### Classification This is a **firmware quirk/workaround**, which falls under the "QUIRKS and WORKAROUNDS" exception category for stable trees. It's analogous to USB quirks or PCI quirks - working around buggy firmware behavior in a targeted way. ### Applicability The file `mes_v12_0.c` was introduced in v6.11-rc1, so this fix is applicable to stable trees 6.11.y and later that support gfx v12 hardware. ### Verification - Verified `AMDGPU_MES_VERSION_MASK` is defined as `0x00000fff` in `amdgpu_mes.h:40` - Verified `sched_version` and `kiq_version` fields exist in the `amdgpu_mes` structure (`amdgpu_mes.h:78-79`) - Verified the same firmware-version-check pattern already exists at `mes_v12_0.c:782` (checks `>= 0x82` for LR compute workaround) - Verified `mes_v12_0.c` was first added in commit `785f0f9fe7420` ("drm/amdgpu: Add mes v12_0 ip block support (v4)"), first present in v6.11-rc1 - Verified the current code at line 793 still has the unconditional `oversubscription_timer = 50` (the fix is not yet applied on this branch) - Verified the commit was acked by Alex Deucher (AMD GPU maintainer) - Verified the upstream commit `4e22a5fe6ea6e0b` exists and is authored by Yang Wang - Could NOT directly verify the ROCm issue #5706 content (would require web fetch, but the commit message description is clear) ### Conclusion This is a small, well-scoped firmware workaround that fixes a real user- facing power consumption bug on AMD gfx v12 hardware. It follows established patterns in the codebase, carries minimal regression risk, and is acked by the subsystem maintainer. It meets all stable kernel criteria as a hardware/firmware quirk. **YES** drivers/gpu/drm/amd/amdgpu/mes_v12_0.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c index 744e95d3984ad..0d7e2dc414a81 100644 --- a/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c +++ b/drivers/gpu/drm/amd/amdgpu/mes_v12_0.c @@ -731,6 +731,9 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) int i; struct amdgpu_device *adev = mes->adev; union MESAPI_SET_HW_RESOURCES mes_set_hw_res_pkt; + uint32_t mes_rev = (pipe == AMDGPU_MES_SCHED_PIPE) ? + (mes->sched_version & AMDGPU_MES_VERSION_MASK) : + (mes->kiq_version & AMDGPU_MES_VERSION_MASK); memset(&mes_set_hw_res_pkt, 0, sizeof(mes_set_hw_res_pkt)); @@ -790,7 +793,7 @@ static int mes_v12_0_set_hw_resources(struct amdgpu_mes *mes, int pipe) * handling support, other queue will not use the oversubscribe timer. * handling mode - 0: disabled; 1: basic version; 2: basic+ version */ - mes_set_hw_res_pkt.oversubscription_timer = 50; + mes_set_hw_res_pkt.oversubscription_timer = mes_rev < 0x8b ? 0 : 50; mes_set_hw_res_pkt.unmapped_doorbell_handling = 1; if (amdgpu_mes_log_enable) { -- 2.51.0
