On 22/04/2025 15:52, Danilo Krummrich wrote:
On Tue, Apr 22, 2025 at 04:16:48PM +0200, Philipp Stanner wrote:
On Tue, 2025-04-22 at 16:08 +0200, Danilo Krummrich wrote:
On Tue, Apr 22, 2025 at 02:39:21PM +0100, Tvrtko Ursulin wrote:
Sorry I don't see the argument for the claim it is relying on the
internals
with the re-positioned drm_sched_fini call. In that case it is
fully
compliant with:
/**
* drm_sched_fini - Destroy a gpu scheduler
*
* @sched: scheduler instance
*
* Tears down and cleans up the scheduler.
*
* This stops submission of new jobs to the hardware through
* drm_sched_backend_ops.run_job(). Consequently,
drm_sched_backend_ops.free_job()
* will not be called for all jobs still in
drm_gpu_scheduler.pending_list.
* There is no solution for this currently. Thus, it is up to the
driver to
make
* sure that:
*
* a) drm_sched_fini() is only called after for all submitted jobs
* drm_sched_backend_ops.free_job() has been called or that
* b) the jobs for which drm_sched_backend_ops.free_job() has not
been
called
*
* FIXME: Take care of the above problem and prevent this function
from
leaking
* the jobs in drm_gpu_scheduler.pending_list under any
circumstances.
^^^ recommended solution b).
This has been introduced recently with commit baf4afc58314
("drm/sched: Improve
teardown documentation") and I do not agree with this. The scheduler
should
*not* make any promises about implementation details to enable
drivers to abuse
their knowledge about component internals.
This makes the problem *worse* as it encourages drivers to rely on
implementation details, making maintainability of the scheduler even
worse.
For instance, what if I change the scheduler implementation, such
that for every
entry in the pending_list the scheduler allocates another internal
object for
${something}? Then drivers would already fall apart leaking those
internal
objects.
Now, obviously that's pretty unlikely, but I assume you get the idea.
The b) paragraph in drm_sched_fini() should be removed for the given
reasons.
AFAICS, since the introduction of this commit, driver implementations
haven't
changed in this regard, hence we should be good.
So, for me this doesn't change the fact that every driver
implementation that
just stops the scheduler at an arbitrary point of time and tries to
clean things
up manually relying on knowledge about component internals is broken.
To elaborate on that, this documentation has been written so that we at
least have *some* documentation about the problem, instead of just
letting new drivers run into the knife.
The commit explicitly introduced the FIXME, marking those two hacky
workarounds as undesirable.
But back then we couldn't fix the problem quickly, so it was either
document the issue at least a bit, or leave it completely undocumented.
Agreed, but b) really sounds like an invitation (or even justification) for
doing the wrong thing, let's removed it.
IMO it is better to leave it. Regardless of whether it was added because
some driver is actually operating like that, it does describe a
_currently_ workable option to avoid memory leaks. Once a better method
is there, ie. FIXME is addressed, then it can be removed or replaced.
Regards,
Tvrtko