Hi Maira, Looks good to me, but don't we need to do the same in v3d_csd_job_timedout?
Iago El dom, 27-04-2025 a las 17:28 -0300, Maíra Canal escribió: > When a CL/CSD job times out, we check if the GPU has made any > progress > since the last timeout. If so, instead of resetting the hardware, we > skip > the reset and let the timer get rearmed. This gives long-running jobs > a > chance to complete. > > However, when `timedout_job()` is called, the job in question is > removed > from the pending list, which means it won't be automatically freed > through > `free_job()`. Consequently, when we skip the reset and keep the job > running, the job won't be freed when it finally completes. > > This situation leads to a memory leak, as exposed in [1]. > > Similarly to commit 704d3d60fec4 ("drm/etnaviv: don't block scheduler > when > GPU is still active"), this patch ensures the job is put back on the > pending list when extending the timeout. > > Cc: sta...@vger.kernel.org # 6.0 > Link: https://gitlab.freedesktop.org/mesa/mesa/-/issues/12227 [1] > Reported-by: Daivik Bhatia <dtgs1...@gmail.com> > Signed-off-by: Maíra Canal <mca...@igalia.com> > --- > drivers/gpu/drm/v3d/v3d_sched.c | 18 +++++++++++------- > 1 file changed, 11 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c > b/drivers/gpu/drm/v3d/v3d_sched.c > index b3be08b0ca91..a98dcf9d75b1 100644 > --- a/drivers/gpu/drm/v3d/v3d_sched.c > +++ b/drivers/gpu/drm/v3d/v3d_sched.c > @@ -734,11 +734,6 @@ v3d_gpu_reset_for_timeout(struct v3d_dev *v3d, > struct drm_sched_job *sched_job) > return DRM_GPU_SCHED_STAT_NOMINAL; > } > > -/* If the current address or return address have changed, then the > GPU > - * has probably made progress and we should delay the reset. This > - * could fail if the GPU got in an infinite loop in the CL, but that > - * is pretty unlikely outside of an i-g-t testcase. > - */ > static enum drm_gpu_sched_stat > v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue > q, > u32 *timedout_ctca, u32 *timedout_ctra) > @@ -748,9 +743,16 @@ v3d_cl_job_timedout(struct drm_sched_job > *sched_job, enum v3d_queue q, > u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q)); > u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q)); > > + /* If the current address or return address have changed, > then the GPU > + * has probably made progress and we should delay the reset. > This > + * could fail if the GPU got in an infinite loop in the CL, > but that > + * is pretty unlikely outside of an i-g-t testcase. > + */ > if (*timedout_ctca != ctca || *timedout_ctra != ctra) { > *timedout_ctca = ctca; > *timedout_ctra = ctra; > + > + list_add(&sched_job->list, &sched_job->sched- > >pending_list); > return DRM_GPU_SCHED_STAT_NOMINAL; > } > > @@ -790,11 +792,13 @@ v3d_csd_job_timedout(struct drm_sched_job > *sched_job) > struct v3d_dev *v3d = job->base.v3d; > u32 batches = V3D_CORE_READ(0, V3D_CSD_CURRENT_CFG4(v3d- > >ver)); > > - /* If we've made progress, skip reset and let the timer get > - * rearmed. > + /* If we've made progress, skip reset, add the job to the > pending > + * list, and let the timer get rearmed. > */ > if (job->timedout_batches != batches) { > job->timedout_batches = batches; > + > + list_add(&sched_job->list, &sched_job->sched- > >pending_list); > return DRM_GPU_SCHED_STAT_NOMINAL; > } >