On 14.07.25 12:16, Philipp Stanner wrote: > On Mon, 2025-07-14 at 11:23 +0200, Christian König wrote: >> On 13.07.25 21:03, Maíra Canal wrote: >>> Hi Christian, >>> >>> On 11/07/25 12:20, Christian König wrote: >>>> On 11.07.25 15:37, Philipp Stanner wrote: >>>>> On Fri, 2025-07-11 at 15:22 +0200, Christian König wrote: >>>>>> >>>>>> >>>>>> On 08.07.25 15:25, Maíra Canal wrote: >>>>>>> When the DRM scheduler times out, it's possible that the >>>>>>> GPU isn't hung; >>>>>>> instead, a job just took unusually long (longer than the >>>>>>> timeout) but is >>>>>>> still running, and there is, thus, no reason to reset the >>>>>>> hardware. This >>>>>>> can occur in two scenarios: >>>>>>> >>>>>>> 1. The job is taking longer than the timeout, but the >>>>>>> driver determined >>>>>>> through a GPU-specific mechanism that the hardware is >>>>>>> still making >>>>>>> progress. Hence, the driver would like the scheduler >>>>>>> to skip the >>>>>>> timeout and treat the job as still pending from then >>>>>>> onward. This >>>>>>> happens in v3d, Etnaviv, and Xe. >>>>>>> 2. Timeout has fired before the free-job worker. >>>>>>> Consequently, the >>>>>>> scheduler calls `sched->ops->timedout_job()` for a >>>>>>> job that isn't >>>>>>> timed out. >>>>>>> >>>>>>> These two scenarios are problematic because the job was >>>>>>> removed from the >>>>>>> `sched->pending_list` before calling `sched->ops- >>>>>>>> timedout_job()`, which >>>>>>> means that when the job finishes, it won't be freed by the >>>>>>> scheduler >>>>>>> though `sched->ops->free_job()` - leading to a memory leak. >>>>>> >>>>>> Yeah, that is unfortunately intentional. >>>>>> >>>>>>> To solve these problems, create a new `drm_gpu_sched_stat`, >>>>>>> called >>>>>>> DRM_GPU_SCHED_STAT_NO_HANG, which allows a driver to skip >>>>>>> the reset. The >>>>>>> new status will indicate that the job must be reinserted >>>>>>> into >>>>>>> `sched->pending_list`, and the hardware / driver will still >>>>>>> complete that >>>>>>> job. >>>>>> >>>>>> Well long story short we have already tried this and the >>>>>> whole approach doesn't work correctly in all cases. See the >>>>>> git history around how we used to destroy the jobs. >>>>>> >>>>>> The basic problem is that you can always race between timing >>>>>> out and Signaling/destroying the job. This is the long >>>>>> lasting job lifetime issue we already discussed more than >>>>>> once. >>>>> >>>>> The scheduler destroys the job, through free_job(). >>>>> I think we have agreed that for now the scheduler is the party >>>>> responsible for the job lifetime. >>>> >>>> That's what I strongly disagree on. The job is just a state bag >>>> between the submission and scheduling state of a submission. >>>> >>>> For the scheduler the control starts when it is pushed into the >>>> entity and ends when run_job is called. >>>> >>>> The real representation of the submission is the scheduler fence >>>> and that object has a perfectly defined lifetime, state and error >>>> handling. >>>> >>>>>> >>>>>> If you want to fix this I think the correct approach is to >>>>>> completely drop tracking jobs in the scheduler at all. >>>>> >>>>> I don't see how this series introduces a problem? >>>>> >>>>> The fact is that drivers are abusing the API by just firing >>>>> jobs back >>>>> into the scheduler's job list. This series legalizes the abuse >>>>> by >>>>> providing scheduler functionality for that. >>>>> >>>>> IOW, the series improves the situation but does not add a *new* >>>>> problem. Even less so as driver's aren't forced to use the new >>>>> status >>>>> code, but can continue having job completion race with timeout >>>>> handlers. >>>> >>>> Maybe yes, but I'm really not sure about it. >>>> >>>> Take a look at the git history or job destruction, we already had >>>> exactly that approach, removed it and said that leaking memory is >>>> at least better than an use after free issue. >>>> >>> >>> If the job was removed from the pending list in the beginning of >>> the >>> timeout and drm_sched_get_finished_job() fetches jobs from the >>> pending >>> list, how can we end up with an use-after-free issue? >> >> By adding it back into the list after the timeout handling completed. >> >> We had exactly that before we came up with the horrible design to add >> it back to the pending list in drm_sched_stop() and the free_guilty >> flag. >> >> Could be that your approach now works better, but I'm really not sure >> about that. > > This isn't Maíra's approach; this is an approach that is already > established practice in DRM. Have you looked at the code? There isn't > that much magic happening there. > > The job gets added back to pending_list. For the scheduler, this is > like a completely new job, with the timeout handler running again and > the other work items picking the job up when appropriate. > > Maíra ports several drivers which have been (illegally, outside our > API) doing that and it worked for them.
Oh, good point! I completely missed that. > Now there's a centralized solution; one which is only active for the > drivers which choose to use it. Other driver's aren't affected. > > IOW, my main argument still stands: this series doesn't add a new bug, > but improves the overall situation in DRM. Yeah, if it's just a cleanup then please go ahead with it. Certainly better to have it centralized in the scheduler then every driver doing it's own thing. Regards, Christian. > > > P. > >> >> Regards, >> Christian. >> >>> >>> Best Regards, >>> - Maíra >>> >> >