On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <yubiao.w...@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>  {
>       int i;
>       struct dma_fence *old, **ptr;
> +     struct amdgpu_job *job;
>  
>       for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>               ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)
>               if (old && old->ops == &amdgpu_job_fence_ops) {
>                       RCU_INIT_POINTER(*ptr, NULL);
>                       dma_fence_put(old);
> +                     /* For non-sched bad job, i.e. failed ib test, we need 
> to force
> +                      * signal it right here or we won't be able to track 
> them in fence drv
> +                      * and they will remain unsignaled during sa_bo free.
> +                      */
> +                     job = container_of(old, struct amdgpu_job, hw_fence);
> +                     if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +                             dma_fence_signal(old);

Hi YuBiao,

Thanks for adding the clarifying comments and sending a v2 of this patch.

Perhaps move the chunk you're adding, to sit before, rather than after,
the statements of the if-conditional. Also move the "job" variable
declaration to be inside the if-conditional, since it is not used
by anything outside it. Something like this, (note a few small fixes to the 
comment),
        if (old && old->ops == &amdgpu_job_fence_ops) {
                struct amdgpu_job *job;

                /* For non-scheduler bad job, i.e. failed IB test, we need to 
signal
                 * it right here or we won't be able to track them in fence_drv
                 * and they will remain unsignaled during sa_bo free.
                 */
                job = container_of(old, struct amdgpu_job, hw_fence);
                if (!job->base.s_fence && !dma_fence_is_signaled(old))
                        dma_fence_signal(old);
                RCU_INIT_POINTER(*ptr, NULL);
                dma_fence_put(old);
        }
Then, give it a test.
With that change, and upon successful test results, this patch is,
Acked-by: Luben Tuikov <luben.tui...@amd.com>
-- 
Regards,
Luben

Reply via email to