sched: Warn if pending list is not empty

Tvrtko Ursulin Tue, 22 Apr 2025 03:39:38 -0700


On 22/04/2025 07:06, Philipp Stanner wrote:

On Thu, 2025-04-17 at 17:08 +0100, Tvrtko Ursulin wrote:


On 17/04/2025 15:48, Danilo Krummrich wrote:

On Thu, Apr 17, 2025 at 03:20:44PM +0100, Tvrtko Ursulin wrote:


On 17/04/2025 13:11, Danilo Krummrich wrote:

On Thu, Apr 17, 2025 at 12:27:29PM +0100, Tvrtko Ursulin wrote:


On 17/04/2025 08:45, Philipp Stanner wrote:

On Mon, 2025-04-07 at 17:22 +0200, Philipp Stanner wrote:


Problem exactly is that jobs can outlive the entities and the
scheduler,
while some userspace may have a dma fence reference to the
job via sync
file. This new callback would not solve it for xe, but if
everything
required was reference counted it would.


I think you're mixing up the job and the dma_fence here, if a
job outlives the
scheduler, it clearly is a bug, always has been.

AFAIK, Xe reference counts it's driver specific job structures
*and* the driver
specific scheduler structure, such that drm_sched_fini() won't
be called before
all jobs have finished.


Yes, sorry, dma fence. But it is not enough to postpone
drm_sched_fini until
the job is not finished. Problem is exported dma fence holds the
pointer to
drm_sched_fence (and so oopses in
drm_sched_fence_get_timeline_name on
fence->sched->name) *after* job had finished and driver was free
to tear
everything down.


Well, that's a bug in drm_sched_fence then and independent from the
other topic.
Once the finished fence in a struct drm_sched_fence has been
signaled it must
live independent of the scheduler.

The lifetime of the drm_sched_fence is entirely independent from
the scheduler
itself, as you correctly point out.


Connection (re. independent or not) I made was *if* drm_sched would
be
reference counted, would that satisfy both the requirement to keep
working drm_sched_fence_get_timeline_name and to allow a different
flavour of the memory leak fix.

I agree drm_sched_fence_get_timeline_name can also be fixed by
removing
the fence->sched dereference and losing the (pretty) name.
Historically
there has been a lot of trouble with those names so maybe that would
be
acceptable.

Revoking s_fence->sched on job completion as an alternative does not
sound feasible.

To further complicate matters, I suspect rmmod gpu-sched.ko is also
something which would break exported fences since that would remove
the
fence ops. But that is solvable by module_get/put().


Is it a common kernel policy to be immune against crazy people just
calling rmmod on central, shared kernel infrastructure?

We cannot protect people from everything, especially from themselves.

I would say if DRM supports driver unbind, and it does, then we need todecide on the module unload. Otherwise it is possible and allows for badthings to happen.

Even if solution might be to just perma-raise the module refcount, orsomething. It does not feel valid to dismiss the discussion straightaway. Hence I have senthttps://lore.kernel.org/dri-devel/20250418164246.72426-1-tvrtko.ursu...@igalia.com/to discuss on this topic separately.

Starting to reference count things to keep the whole scheduler etc.
alive as
long as the drm_sched_fence lives is not the correct solution.


To catch up on why if you could dig out the links to past discussions
it
would be helpful.

I repeat how there is a lot of attractiveness to reference counting.
Already mentioned memory leak, s_fence oops, and also not having to
clear job->entity could be useful for things like tracking per entity
submission stats (imagine CFS like scheduling, generic scheduling DRM
cgroup controller). So it would be good for me to hear what pitfalls
were identified in this space.


Reference counting does _appear_ attractive, but our (mostly internal)
attempts and discussions revealed that it's not feasable.

There was an RFC by me about it last year, but it was incomplete and
few people reacted:

https://lore.kernel.org/dri-devel/20240903094531.29893-2-pstan...@redhat.com/

When you try to implement refcounting, you quickly discover that it's
not enough if submitted jobs refcount the scheduler.

Right, I did imply in my initial reply that "everything" would need tobe reference counted.

If the scheduler is refcounted, this means that it can now outlive the
driver. This means that it can continue calling into the driver with
run_job(), free_job(), timedout_job() and so on.

Since this would fire 500 million UAFs, you, hence, now would have to
guard *all* drivers' callbacks in some way against the driver being
unloaded. You might even end up having to refcount thinks like entire
modules, depending on the driver.


So effectively, with refcounting, you would fix a problem in central
infrastructure (the scheduler) in all users (the driver) – therefore,
you would not fix the problem at all, but just work around the
scheduler being broken in the drivers. IOW, everything would be exactly
as it is now, where everyone works around the problem in their own way.
Nouveau uses its redundant pending list, Xe refcounts and amdgpu does……
IDK, things.

Xe _fails_ to handle it as demonstrated in the above linked RFC. I don'tthink any driver can actually handle it on their own. Hence what I amdiscussing is completely opposite of "not fixing the problem at all" -it is about what if anything can we do to fix it robustly.

Another problem is the kernel workflow and kernel politics. The
solution I propose here allows for phasing the problem out, first in
Nouveau, then step by step others. It's the only minimalist backward-
compatible solution I can see.

It is two separate things. Memory leaks in this series and the refcounting angle as a potential fix for _both_ memory leaks _and_ useafter free problems.

That said, if you have the capacity to look deeper into a refcount
solution, feel free to go for it. But you'd have to find a way to solve
it in a centralized manner, otherwise we could just leave everything
be.


I will leave discussion for my linked RFC.

Multiple solutions have been discussed already, e.g. just wait
for the pending
list to be empty, reference count the scheduler for every
pending job. Those all
had significant downsides, which I don't see with this
proposal.

I'm all for better ideas though -- what do you propose?


I think we need to brainstorm both issues and see if there is a
solution
which solves them both, with bonus points for being elegant.


The problems are not related. As mentioned above, once signaled a
drm_sched_fence must not depend on the scheduler any longer.

diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index 6b72278c4b72..ae3152beca14 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -1465,6 +1465,10 @@ void drm_sched_fini(struct
drm_gpu_scheduler
*sched)
      sched->ready = false;
      kfree(sched->sched_rq);
      sched->sched_rq = NULL;
+
+ if (!list_empty(&sched->pending_list))
+ dev_err(sched->dev, "%s: Tearing down scheduler
while jobs are pending!\n",
+ __func__);


It isn't fair to add this error since it would out of the
blue start firing
for everyone expect nouveau, no? Regardless if there is a
leak or not.


I think it is pretty fair to warn when detecting a guaranteed
bug, no?

If drm_sched_fini() is call while jobs are still on the
pending_list, they won't
ever be freed, because all workqueues are stopped.


Is it a guaranteed bug for drivers are aware of the
drm_sched_fini()
limitation and are cleaning up upon themselves?


How could a driver clean up on itself (unless the driver holds its
own list of
pending jobs)?

Once a job is in flight (i.e. it's on the pending_list) we must
guarantee that
free_job() is called by the scheduler, which it can't do if we call
drm_sched_fini() before the pending_list is empty.

In other words if you apply the series up to here would it
trigger for
nouveau?


No, because nouveau does something very stupid, i.e. replicate the
pending_list.


Ah okay I see it now, it waits for all jobs to finish before calling
drm_sched_fini(). For some reason I did not think it was doing that
given the cover letter starts with how that is a big no-no.

Reportedly it triggers for the mock scheduler which also has no
leak.


That sounds impossible. How do you ensure you do *not* leak memory
when you tear
down the scheduler while it still has pending jobs? Or in other
words, who calls
free_job() if not the scheduler itself?


Well the cover letter says it triggers so it is possible. :)


Where does it say that?


"""
Tvrtko's unit tests also run as expected (except for the new warning
print in patch 3), which is not surprising since they don't provide the
callback.
"""

I mean, the warning would trigger if a driver is leaking jobs, which is
clearly a bug (and since a list is leaked, it might then even be a
false-negative in kmemleak).

It's actually quite simple:
    1. jobs are being freed by free_job()
    2. If you call drm_sched_fini(), the work item for free_job() gets
       deactivated.
    3. Depending on the load (race), you leak the jobs.

That's not that much of a problem for hardware schedulers, but firmware
schedulers who tear down scheduler instances all the time could leak
quite some amount of memory over time.

I can't see why a warning print would ever be bad there. Even Christian
wants it.

Question I raised is if there are other drivers which manage to clean upeverything correctly (like the mock scheduler does), but trigger thatwarning. Maybe there are not and maybe mock scheduler is the only falsepositive.


Regards,

Tvrtko

Mock scheduler also tracks the pending jobs itself, but different
from
nouveau it does not wait for jobs to finish and free worker to
process
them all, but having stopped the "hw" backend it cancels them and
calls
the free_job vfunc directly.

Going back to the topic of this series, if we go with a solution
along
the lines of the proposed, I wonder if it would be doable without
mandating that drivers keep a list parallel to pending_list. Instead
have a vfunc DRM scheduler would call to cancel job at a time from
*its*
pending list. It would go nicely together with
prepare/run/timedout/free.

Would it allow getting rid of the new state machinery and just
cancelling and freeing in one go directly from drm_sched_fini()?

Regards,

Tvrtko

Also, I asked in my initial reply if we have a list of which of
the current
drivers suffer from memory leaks. Is it all or some etc.


Not all, but quite some I think. The last time I looked (which is
about a year
ago) amdgpu for instance could leak memory when you unbind the
driver while
enough jobs are in flight.

Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty

Reply via email to