Hi Ashley, On 10/04/2025 13:57, Ashley Smith wrote: > The timeout logic provided by drm_sched leads to races when we try > to suspend it while the drm_sched workqueue queues more jobs. Let's > overhaul the timeout handling in panthor to have our own delayed work > that's resumed/suspended when a group is resumed/suspended. When an > actual timeout occurs, we call drm_sched_fault() to report it > through drm_sched, still. But otherwise, the drm_sched timeout is > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of > how we protect modifications on the timer. > > One issue seems to be when we call drm_sched_suspend_timeout() from > both queue_run_job() and tick_work() which could lead to races due to > drm_sched_suspend_timeout() not having a lock. Another issue seems to > be in queue_run_job() if the group is not scheduled, we suspend the > timeout again which undoes what drm_sched_job_begin() did when calling > drm_sched_start_timeout(). So the timeout does not reset when a job > is finished. > > Co-developed-by: Boris Brezillon <boris.brezil...@collabora.com> > Signed-off-by: Boris Brezillon <boris.brezil...@collabora.com> > Tested-by: Daniel Stone <dani...@collabora.com> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block") > Signed-off-by: Ashley Smith <ashley.sm...@collabora.com> > --- > drivers/gpu/drm/panthor/panthor_sched.c | 244 +++++++++++++++++------- > 1 file changed, 177 insertions(+), 67 deletions(-) > > diff --git a/drivers/gpu/drm/panthor/panthor_sched.c > b/drivers/gpu/drm/panthor/panthor_sched.c > index 446ec780eb4a..32f5a75bc4f6 100644 > --- a/drivers/gpu/drm/panthor/panthor_sched.c > +++ b/drivers/gpu/drm/panthor/panthor_sched.c
[...] > @@ -2727,8 +2784,17 @@ void panthor_sched_suspend(struct panthor_device > *ptdev) > * automatically terminate all active groups, so let's > * force the state to halted here. > */ > - if (csg_slot->group->state != > PANTHOR_CS_GROUP_TERMINATED) > + if (csg_slot->group->state != > PANTHOR_CS_GROUP_TERMINATED) { > csg_slot->group->state = > PANTHOR_CS_GROUP_TERMINATED; > + > + /* Reset the queue slots manually if the > termination > + * request failed. > + */ > + for (i = 0; i < group->queue_count; i++) { > + if (group->queues[i]) > + cs_slot_reset_locked(ptdev, > csg_id, i); > + } > + } > slot_mask &= ~BIT(csg_id); > } > } So this seems to be the only change from v2 (a changelog can be helpful!). And I'm not convinced it belongs in this patch? It's not just "[making] the timeout per-queue instead of per-job". I haven't dug through the details, but I think this belongs in a separate patch. Thanks, Steve