amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

Zhang, Jesse(Jie) Wed, 12 Mar 2025 07:11:26 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

From: Koenig, Christian <christian.koe...@amd.com>
Sent: Wednesday, March 12, 2025 4:39 PM
To: Zhang, Jesse(Jie) <jesse.zh...@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deucher, Alexander <alexander.deuc...@amd.com>; Kim, Jonathan 
<jonathan....@amd.com>; Zhu, Jiadong <jiadong....@amd.com>
Subject: Re: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by 
removing dynamic callbacks

Am 12.03.25 um 09:15 schrieb Zhang, Jesse(Jie):

[SNIP9

-

+     gfx_ring->funcs->stop_queue(adev, instance_id);

Yeah that starts to look good. Question here is who is calling 
amdgpu_sdma_reset_engine()?

If this call comes from engine specific code we might not need the 
start/stop_queue callbacks all together.

    Kfd and sdma v4/v5/v5_2 will call amdgpu_sdma_reset_engine, and 
start/stop_queue callbacks are only implemented in sdmav4/sdmav5/sdma5_2.

Why would the KFD call this as well? Because it detects an issue with a SDMA 
user queue  If yes I would rather suggest that the KFD calls the reset function 
of the paging queue.

Since this reset function is specific to the SDMA HW generation anyway you 
don't need those extra functions to abstract starting and stopping of the queue 
for each HW generation.

kfd can't call reset function directly, unless we add a parameter src  to 
distinguish kfd and kgd in reset function, like this:
int (*reset)(struct amdgpu_ring *ring, unsigned int vmid, int src );
As Alex said in another thread,
We need to distinguish  kfd and kgd  in reset.
If kfd triggers a reset, kgd must save healthy jobs and recover jobs after 
reset.
If kgd triggers a reset, kgd must abandon bad jobs after reset.(and perhaps kfd 
needs to save its healthy jobs for recovery).

If we can add a parameter, I am ok for that solution too.

Additionally:
For sdma6/7, when a queue reset fails, we may need to fall back to an engine 
reset for a attempt.

Thanks
Jesse

Regards,
Christian.

Thanks

Jesse

Regards,

Christian.

      /* Perform the SDMA reset for the specified instance */

      ret = amdgpu_dpm_reset_sdma(adev, 1 << instance_id);

      if (ret) {

@@ -591,18 +573,7 @@ int amdgpu_sdma_reset_engine(struct amdgpu_device *adev, 
uint32_t instance_id, b

              goto exit;

      }

-     /* Invoke all registered post_reset callbacks */

-     list_for_each_entry(funcs, &adev->sdma.reset_callback_list, list) {

-             if (funcs->post_reset) {

-                     ret = funcs->post_reset(adev, instance_id);

-                     if (ret) {

-                             dev_err(adev->dev,

-                             "afterReset callback failed for instance %u: 
%d\n",

-                                     instance_id, ret);

-                             goto exit;

-                     }

-             }

-     }

+     gfx_ring->funcs->start_queue(adev, instance_id);

 exit:

      /* Restart the scheduler's work queue for the GFX and page rings

diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c

b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c

index fd34dc138081..c1f7ccff9c4e 100644

--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c

+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c

@@ -2132,6 +2132,8 @@ static const struct amdgpu_ring_funcs 
sdma_v4_4_2_ring_funcs = {

      .emit_reg_wait = sdma_v4_4_2_ring_emit_reg_wait,

      .emit_reg_write_reg_wait = amdgpu_ring_emit_reg_write_reg_wait_helper,

      .reset = sdma_v4_4_2_reset_queue,

+     .stop_queue = sdma_v4_4_2_stop_queue,

+     .start_queue = sdma_v4_4_2_restore_queue,

      .is_guilty = sdma_v4_4_2_ring_is_guilty,  };

RE: [PATCH 1/7] drm/amd/amdgpu: Simplify SDMA reset mechanism by removing dynamic callbacks

Reply via email to