sched: Prevent teardown waitque from blocking too long

Tvrtko Ursulin Fri, 16 May 2025 08:48:59 -0700


On 16/05/2025 13:00, Danilo Krummrich wrote:

On Fri, May 16, 2025 at 12:35:52PM +0100, Tvrtko Ursulin wrote:


On 16/05/2025 11:54, Danilo Krummrich wrote:

On Fri, May 16, 2025 at 11:19:50AM +0100, Tvrtko Ursulin wrote:


On 16/05/2025 10:53, Danilo Krummrich wrote:

On Fri, May 16, 2025 at 10:33:30AM +0100, Tvrtko Ursulin wrote:

On 24/04/2025 10:55, Philipp Stanner wrote:

+        * @kill_fence_context: kill the fence context belonging to this 
scheduler


Which fence context would that be? ;)


There's one one per ring and a scheduler instance represents a single ring. So,
what should be specified here?


I was pointing out the fact not all drivers are 1:1 sched:entity.


I'm well aware, but how is that relevant? Entities don't have an associated
fence context, but a GPU Ring (either hardware or software) has, which a
scheduler instance represents.


Aha! Well.. how it is relevant and do entities not have an associated fence
context? Well, entity->fence_context.. that was my first association this
whole time. Never it crossed my mind this is talking about the hardware
fence context. Proof in the pudding naming should be improved.


It says "fence context belonging to this scheduler", which should be
unambiguous, however I agree that we could mark out the difference even more.

Cool, I had tunnel vision due working with entity->fence_context a lotand this just had misfortune to re-use the same name.

But I also don't think there is a requirement for fences returned from
->run_job() to have a single context. Which again makes it not the best
naming.


It's implied by the fact that a scheduler instance represents a ring. Having
multiple fence contexts per ring doesn't make any sense.

But it's indeed not written down -- we should do that then.

Would you do it in code or just in docs? I don't see a real benefit toit to be honest, since nothing depends on anything apart that it is afence which will signal when job is done. But I am not aware of anythingwhere it would be a problem either. One to run past driver maintainers Iguess.

In the callback the driver should neither tear down an entity, nor the whole
scheduler, hence we shouldn't call it like that. sched_kill() is therefore
misleading as well.


  ->sched_exit()? ->sched_stop()? ->sched_cleanup()?


I think this all would throw up questions like "What does {exit,stop,cleanup}
mean in this context?". And the answer would be "kill the fence context of the
ring represented by the scheduler".

I think we want a name that represents that without an indirection that we have
to define.

Well fence_kill_context wasn't self-descriptive to me neither so thereis that too. In other words some kerneldoc will be needed anyway and acallback to call while tearing something down is pretty standard stuff.So I don't think it is a big deal to explain what that callback should do.

It should be named after what it actually does (or should do). Feel free to
propose a different name that conforms with that.

We also probably want some commentary on the topic of indefinite (or very
long at least) blocking a thread exit / SIGINT/TERM/KILL time.


You mean in case the driver does implement the callback, but does *not* properly
tear down the fence context? So, you ask for describing potential consequences
of drivers having bugs in the implementation of the callback? Or something else?


I was proposing the kerneldoc for the vfunc should document the callback
must not block, or if blocking is unavoidable, either document a guideline
on how long is acceptable. Maybe even enforce a limit in the scheduler core
itself.


Killing the fence context shouldn't block.


Cool. And maybe convert the wait_event to wait_event_timeout with a warning
to be robust.


That would make sense if it could deadlock, but even if the driver does nothing
it should terminate eventually. The rule that we always rely on is that we
guarantee throughout the kernel that fences are signalled eventually.

Given it is an opt-in fair enough. (Some drivers don't have automaticfence expiration, but then again they don't have this callback either.And once they start adding it, there will be kerneldoc to say callbackmust not block.)


Regards,

Tvrtko

Re: [PATCH v2 2/6] drm/sched: Prevent teardown waitque from blocking too long

Reply via email to