amdgpu: Fix GPU reset handling after WGR is triggered

Nikola Petrovic Fri, 25 Apr 2025 08:39:50 -0700

I've identified a critical issue with the existing GPU reset mechanism that 
causes a BSOD on the Windows Hyper-V platform. The current function:


static void xgpu_nv_mailbox_flr_work(struct work_struct *work)  

incorrectly sets the AMDGPU_HOST_FLR flag if any engine is hanging. This 
approach wrongly assumes that the Host PF is always responsible for triggering 
FLR. However, a VF (VM-guest) can also cause a GPU hang, which results in an 
unsuccessful VM reset. This ultimately causes a FULL_ACCESS_TIMEOUT on the host 
side, leading to six attempts to retrigger a Whole Guest Reset (WGR), which 
results in a BSOD after five to six failed restarts.

Additionally, the current sequence sends a READY_TO_RESTART event and then 
requests FULL_ACCESS, which seems incorrect to me.

My fix addresses this problem by using REQ_GPU_RESET to initiate the necessary 
restart while appropriately handling the FULL ACCESS request. My implementation 
has successfully passed 100 loop tests, confirming its effectiveness.

Signed-off-by: Nikola Petrovic <nipet...@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 7f354cd532dc..a2a436707200 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -5314,11 +5314,12 @@ static int amdgpu_device_reset_sriov(struct 
amdgpu_device *adev,
        struct amdgpu_hive_info *hive = NULL;
 
        if (test_bit(AMDGPU_HOST_FLR, &reset_context->flags)) {
+               r = amdgpu_virt_wait_reset(adev);
+               if (r)
+                       return r;
                if (!amdgpu_ras_get_fed_status(adev))
-                       amdgpu_virt_ready_to_reset(adev);
-               amdgpu_virt_wait_reset(adev);
+                       amdgpu_virt_reset_gpu(adev);
                clear_bit(AMDGPU_HOST_FLR, &reset_context->flags);
-               r = amdgpu_virt_request_full_gpu(adev, true);
        } else {
                r = amdgpu_virt_reset_gpu(adev);
        }
-- 
2.43.0

[PATCH] drm/amd/amdgpu: Fix GPU reset handling after WGR is triggered

Reply via email to