On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
> On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
> 
> > > Sorry I don't see the argument for the claim it is relying on the
> > > internals
> > > with the re-positioned drm_sched_fini call. In that case it is
> > > fully
> > > compliant with:
> > > 
> > > /**
> > >  * drm_sched_fini - Destroy a gpu scheduler
> > >  *
> > >  * @sched: scheduler instance
> > >  *
> > >  * Tears down and cleans up the scheduler.
> > >  *
> > >  * This stops submission of new jobs to the hardware through
> > >  * drm_sched_backend_ops.run_job(). Consequently,
> > > drm_sched_backend_ops.free_job()
> > >  * will not be called for all jobs still in
> > > drm_gpu_scheduler.pending_list.
> > >  * There is no solution for this currently. Thus, it is up to the
> > > driver to
> > > make
> > >  * sure that:
> > >  *
> > >  *  a) drm_sched_fini() is only called after for all submitted jobs
> > >  *     drm_sched_backend_ops.free_job() has been called or that
> > >  *  b) the jobs for which drm_sched_backend_ops.free_job() has not
> > > been
> > > called
> > >  *
> > >  * FIXME: Take care of the above problem and prevent this function
> > > from
> > > leaking
> > >  * the jobs in drm_gpu_scheduler.pending_list under any
> > > circumstances.
> > > 
> > > ^^^ recommended solution b).
> > 
> > This has been introduced recently with commit baf4afc58314
> > ("drm/sched: Improve
> > teardown documentation") and I do not agree with this. The scheduler
> > should
> > *not* make any promises about implementation details to enable
> > drivers to abuse
> > their knowledge about component internals.
> > 
> > This makes the problem *worse* as it encourages drivers to rely on
> > implementation details, making maintainability of the scheduler even
> > worse.
> > 
> > For instance, what if I change the scheduler implementation, such
> > that for every
> > entry in the pending_list the scheduler allocates another internal
> > object for
> > ${something}? Then drivers would already fall apart leaking those
> > internal
> > objects.
> > 
> > Now, obviously that's pretty unlikely, but I assume you get the idea.
> > 
> > The b) paragraph in drm_sched_fini() should be removed for the given
> > reasons.
> > 
> > AFAICS, since the introduction of this commit, driver implementations
> > haven't
> > changed in this regard, hence we should be good.
> > 
> > So, for me this doesn't change the fact that every driver
> > implementation that
> > just stops the scheduler at an arbitrary point of time and tries to
> > clean things
> > up manually relying on knowledge about component internals is broken.
> 
> To elaborate on that, this documentation has been written so that we at
> least have *some* documentation about the problem, instead of just
> letting new drivers run into the knife.
> 
> The commit explicitly introduced the FIXME, marking those two hacky
> workarounds as undesirable.
> 
> But back then we couldn't fix the problem quickly, so it was either
> document the issue at least a bit, or leave it completely undocumented.

Agreed, but b) really sounds like an invitation (or even justification) for
doing the wrong thing, let's removed it.

Reply via email to