On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote: > On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote: > > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote: > > > > Sorry I don't see the argument for the claim it is relying on the > > > internals > > > with the re-positioned drm_sched_fini call. In that case it is > > > fully > > > compliant with: > > > > > > /** > > > * drm_sched_fini - Destroy a gpu scheduler > > > * > > > * @sched: scheduler instance > > > * > > > * Tears down and cleans up the scheduler. > > > * > > > * This stops submission of new jobs to the hardware through > > > * drm_sched_backend_ops.run_job(). Consequently, > > > drm_sched_backend_ops.free_job() > > > * will not be called for all jobs still in > > > drm_gpu_scheduler.pending_list. > > > * There is no solution for this currently. Thus, it is up to the > > > driver to > > > make > > > * sure that: > > > * > > > * a) drm_sched_fini() is only called after for all submitted jobs > > > * drm_sched_backend_ops.free_job() has been called or that > > > * b) the jobs for which drm_sched_backend_ops.free_job() has not > > > been > > > called > > > * > > > * FIXME: Take care of the above problem and prevent this function > > > from > > > leaking > > > * the jobs in drm_gpu_scheduler.pending_list under any > > > circumstances. > > > > > > ^^^ recommended solution b). > > > > This has been introduced recently with commit baf4afc58314 > > ("drm/sched: Improve > > teardown documentation") and I do not agree with this. The scheduler > > should > > *not* make any promises about implementation details to enable > > drivers to abuse > > their knowledge about component internals. > > > > This makes the problem *worse* as it encourages drivers to rely on > > implementation details, making maintainability of the scheduler even > > worse. > > > > For instance, what if I change the scheduler implementation, such > > that for every > > entry in the pending_list the scheduler allocates another internal > > object for > > ${something}? Then drivers would already fall apart leaking those > > internal > > objects. > > > > Now, obviously that's pretty unlikely, but I assume you get the idea. > > > > The b) paragraph in drm_sched_fini() should be removed for the given > > reasons. > > > > AFAICS, since the introduction of this commit, driver implementations > > haven't > > changed in this regard, hence we should be good. > > > > So, for me this doesn't change the fact that every driver > > implementation that > > just stops the scheduler at an arbitrary point of time and tries to > > clean things > > up manually relying on knowledge about component internals is broken. > > To elaborate on that, this documentation has been written so that we at > least have *some* documentation about the problem, instead of just > letting new drivers run into the knife. > > The commit explicitly introduced the FIXME, marking those two hacky > workarounds as undesirable. > > But back then we couldn't fix the problem quickly, so it was either > document the issue at least a bit, or leave it completely undocumented.
Agreed, but b) really sounds like an invitation (or even justification) for doing the wrong thing, let's removed it.