Hi Maira,

Looks good to me, but don't we need to do the same in
v3d_csd_job_timedout?

Iago

El dom, 27-04-2025 a las 17:28 -0300, Maíra Canal escribió:
> When a CL/CSD job times out, we check if the GPU has made any
> progress
> since the last timeout. If so, instead of resetting the hardware, we
> skip
> the reset and let the timer get rearmed. This gives long-running jobs
> a
> chance to complete.
> 
> However, when `timedout_job()` is called, the job in question is
> removed
> from the pending list, which means it won't be automatically freed
> through
> `free_job()`. Consequently, when we skip the reset and keep the job
> running, the job won't be freed when it finally completes.
> 
> This situation leads to a memory leak, as exposed in [1].
> 
> Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler
> when
> GPU is still active"), this patch ensures the job is put back on the
> pending list when extending the timeout.
> 
> Cc: sta...@vger.kernel.org # 6.0
> Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1]
> Reported-by: Daivik Bhatia <dtgs1...@gmail.com>
> Signed-off-by: Maíra Canal <mca...@igalia.com>
> ---
>  drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
> b/drivers/gpu/drm/v3d/v3d_sched.c
> index b3be08b0ca91..a98dcf9d75b1 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d,
> struct drm_sched_job *sched_job)
>       return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
>  
> -/* If the current address or return address have changed, then the
> GPU
> - * has probably made progress and we should delay the reset.  This
> - * could fail if the GPU got in an infinite loop in the CL, but that
> - * is pretty unlikely outside of an i-g-t testcase.
> - */
>  static enum drm_gpu_sched_stat
>  v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue
> q,
>                   u32 *timedout_ctca, u32 *timedout_ctra)
> @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job
> *sched_job, enum v3d_queue q,
>       u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q));
>       u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));
>  
> +     /* If the current address or return address have changed,
> then the GPU
> +      * has probably made progress and we should delay the reset.
> This
> +      * could fail if the GPU got in an infinite loop in the CL,
> but that
> +      * is pretty unlikely outside of an i-g-t testcase.
> +      */
>       if (*timedout_ctca != ctca || *timedout_ctra != ctra) {
>               *timedout_ctca = ctca;
>               *timedout_ctra = ctra;
> +
> +             list_add(&sched_job->list, &sched_job->sched-
> >pending_list);
>               return DRM_GPU_SCHED_STAT_NOMINAL;
>       }
>  
> @@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job
> *sched_job)
>       struct v3d_dev *v3d = job->base.v3d;
>       u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d-
> >ver));
>  
> -     /* If we've made progress, skip reset and let the timer get
> -      * rearmed.
> +     /* If we've made progress, skip reset, add the job to the
> pending
> +      * list, and let the timer get rearmed.
>        */
>       if (job->timedout_batches != batches) {
>               job->timedout_batches = batches;
> +
> +             list_add(&sched_job->list, &sched_job->sched-
> >pending_list);
>               return DRM_GPU_SCHED_STAT_NOMINAL;
>       }
>  

Reply via email to