sched: Add fair scheduling policy

Danilo Krummrich Thu, 15 Jan 2026 15:39:48 -0800

On Thu Jan 15, 2026 at 2:00 PM CET, Tvrtko Ursulin wrote:
> On 15/01/2026 11:44, Danilo Krummrich wrote:
>> On Thu Jan 15, 2026 at 9:28 AM CET, Tvrtko Ursulin wrote:
>>>
>>> On 14/01/2026 22:13, Danilo Krummrich wrote:
>>>> On Fri Dec 19, 2025 at 2:53 PM CET, Tvrtko Ursulin wrote:
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_rq.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_rq.c
>>>>> index 2d1f579d8352..2fde309d02a6 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_rq.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_rq.c
>>>>> @@ -16,6 +16,35 @@ drm_sched_entity_compare_before(struct rb_node *a, 
>>>>> const struct rb_node *b)
>>>>>           return ktime_before(ea->oldest_job_waiting, 
>>>>> eb->oldest_job_waiting);
>>>>>    }
>>>>>    
>>>>> +static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
>>>>> +{
>>>>> + enum drm_sched_priority prio = DRM_SCHED_PRIORITY_INVALID;
>>>>> + struct rb_node *rb;
>>>>> +
>>>>> + lockdep_assert_held(&rq->lock);
>>>>> +
>>>>> + rb = rb_first_cached(&rq->rb_tree_root);
>>>>> + if (rb) {
>>>>> +         struct drm_sched_entity *entity =
>>>>> +                 rb_entry(rb, typeof(*entity), rb_tree_node);
>>>>> +
>>>>> +         /*
>>>>> +          * The normal locking order is entity then run-queue so taking
>>>>> +          * the entity lock here would be a locking inversion for the
>>>>> +          * case when the current head of the run-queue is different from
>>>>> +          * the one we already have locked. The unlocked read is fine
>>>>> +          * though, because if the priority had just changed it is no big
>>>>> +          * deal for our algorithm, but just a transient reachable only
>>>>> +          * by drivers with userspace dynamic priority changes API. Equal
>>>>> +          * in effect to the priority change becoming visible a few
>>>>> +          * instructions later.
>>>>> +          */
>>>>> +         prio = READ_ONCE(entity->priority);
>>>>
>>>> I still think that we should address the root cause of the lock inversion
>>>> problem instead.
>>>>
>>>> I previously mentioned that I can take a look at this beginning of this 
>>>> year,
>>>> which I can do soon.
>>>>
>>>> In the meantime, can you please explain what's the problem with this 
>>>> specific
>>>> case? This function is only ever called from 
>>>> drm_sched_rq_remove_fifo_locked()
>>>> and drm_sched_rq_update_fifo_locked(), which already seem to hold both 
>>>> locks.
>>>
>>> The entity which is locked is likely not the same as entity at the head
>>> of the run-queue from either call chains.
>>>
>>> In one case we have just removed the locked entity from the run-queue,
>>> while in the other tree has been re-balanced so a different entity may
>>> have taken the head position.
>> 
>> Ick! That makes it even worse because this would mean that even if we would 
>> be
>> able to take the entity lock here, this is also prone to lock inversion 
>> between
>> entities.
>
> Not sure what you mean by also an even worse? It is the same entity->rq 
> vs rq->entity lock order inversion. What other inversion is there?


drm_sched_rq_update_prio() is only ever called in call paths where the rq lock
and *some* entity lock is already taken. Now, you are saying that the entity we
update the priority for might be a different one -- *might* be. This has two
implications, if this entity would not have a lock inversion with the rq and we
could take the lock:

  (1) It could be that we try to take the same entity lock again.

  (2) In another call path it could happen that the order of the two entity
      locks is inverted.

Maybe there is some subtlety why this could actually not happen. But even if,
building on top of more and more subtleties doesn't sound like a great
maintainance approach. :)

>> I.e. that is a huge indicator that it is even more necessary to revisit 
>> locking
>> design in general.
>> 
>>> Also to note is 99% of cases entity->priority is invariant. Only amdgpu
>>> allows for change of priority post entity creation. So for the rest
>>> locking would not gain anything.
>>>
>>> Even for amdgpu the unlocked read is not very relevant, since the only
>>> thing this is used is to determine the run-queue insertion position of a
>>> re-joining entity. So worst thing that could happen, if userspace thread
>>> would race set priority with the job worker picking the next job, is to
>>> *one time* pick a different job.
>> 
>> I get that; it is less that dealing with the priority field by itself is a 
>> huge
>> issue we can't handle, it is more that the above workaround clearly points 
>> out a
>> (locking) design issue, which we should not ignore. It's not only about code
>> the code working or being correct, it's also about maintainability.
>> 
>> (Even though I'm aware that DRM scheduler maintainability is a bit the DRM
>> equivalent of the infamous 3x+1 problem. :)
>> 
>>> Also to address the root cause of the lock inversion would IMHO be to
>>> re-design the whole scheduler and this specific function here does not
>>> seem should be the trigger for that.
>> 
>> I'm not sure it is that bad, let me take a look in the next days and see what
>> options we have.
>
> Okay but I am sure you know there are memory barriers, RCU, SPSC queue, 
> completions, worker management, and at least two locks, and everything 
> is interdependent.

I am, and I couldn't describe the maintainance burden of the scheduler any
better. :)

> This series at least removes a bunch of code without making things more
> complicated and so can be a good base for further rework. If you suggest to
> hold it until all of the above is resolved it will be a few more years easily.

Let me explain a bit where I'm coming from:

>From a maintainer perspective - without saying that things are either black or
white - I'm assessing contributors and contributions on whether the intention is
to improve the infrastructure in terms of design and maintainability or whether
the main intention is to "just" get a certain feature merged.

While both are valuable contributions that are appreciated, contributions that
have a tendency into the latter direction, have to be seen with more sceptisism.

Especially when the code base already has known design issues, bulding features
on top is very dangerous. Not necessarily because the resulting code might be
wrong or because of regressions, etc. But because it most likely increases the
difficulty resolving the existing issues -- in the worst case leading to a dead
end.

Having that said, I am not saying that you "just" try to get your feature merged
no matter what. Quite the opposite, I very much appreciate your contributions to
the scheduler and recognize the cleanup work you are doing.

However, for this series I require you to acknowledge that, even if correct,
some of this code introduces additional workarounds due to existing design
issues, locking and synchronization subtleties that should be solved in better
ways.

I have no objections going ahead with this series if we are on the same page
regarding this and you are willing to also help out fixing those underlying
design issues, locking and synchronization quirks, etc. subsequently.

But if you are more leaning in the direction of "I don't see an issue overall,
the code is correct!" I do have concerns.

Improving the situation with the scheduler is the fundamental reason why Philipp
and me were stepping up as maintainers, Philipp being more of the active part
(doing a great job) and me working more in the background, helping and mentoring
him.

Believe me when I say that we want this to move forward, but we also have to
ensure that we are not making a step into the direction of increasing the
difficulty to solve the underlying issues.

So, again, if we are on the same page with this, no objections from my side.

- Danilo

Re: [PATCH v5 09/28] drm/sched: Add fair scheduling policy

Reply via email to