On 28/07/2025 10:28, Pierre-Eric Pelloux-Prayer wrote:
Le 24/07/2025 à 16:19, Tvrtko Ursulin a écrit :
GPUs generally don't implement preemption and DRM scheduler definitely
does not support it at the front end scheduling level. This means
execution quanta can be quite long and is controlled by userspace,
consequence of which is picking the "wrong" entity to run can have a
larger negative effect than it would have with a virtual runtime based CPU
scheduler.

Another important consideration is that rendering clients often have
shallow submission queues, meaning they will be entering and exiting the
scheduler's runnable queue often.

Relevant scenario here is what happens when an entity re-joins the
runnable queue with other entities already present. One cornerstone of the
virtual runtime algorithm is to let it re-join at the head and depend on
the virtual runtime accounting to sort out the order after an execution
quanta or two.

However, as explained above, this may not work fully reliably in the GPU
world. Entity could always get to overtake the existing entities, or not,
depending on the submission order and rbtree equal key insertion
behaviour.

We can break this latching by adding some randomness for this specific
corner case.

If an entity is re-joining the runnable queue, was head of the queue the
last time it got picked, and there is an already queued different entity
of an equal scheduling priority, we can break the tie by randomly choosing
the execution order between the two.

For randomness we implement a simple driver global boolean which selects
whether new entity will be first or not. Because the boolean is global and
shared between all the run queues and entities, its actual effect can be
loosely called random. Under the assumption it will not always be the same
entity which is re-joining the queue under these circumstances.

Another way to look at this is that it is adding a little bit of limited
random round-robin behaviour to the fair scheduling algorithm.

Net effect is a significant improvemnt to the scheduling unit tests which
check the scheduling quality for the interactive client running in
parallel with GPU hogs.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@igalia.com>
Cc: Christian König <christian.koe...@amd.com>
Cc: Danilo Krummrich <d...@kernel.org>
Cc: Matthew Brost <matthew.br...@intel.com>
Cc: Philipp Stanner <pha...@kernel.org>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-pra...@amd.com>
---
  drivers/gpu/drm/scheduler/sched_rq.c | 10 ++++++++++
  1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/scheduler/sched_rq.c b/drivers/gpu/drm/ scheduler/sched_rq.c
index d16ee3ee3653..087a6bdbb824 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -147,6 +147,16 @@ drm_sched_entity_restore_vruntime(struct drm_sched_entity *entity,
               * Higher priority can go first.
               */
              vruntime = -us_to_ktime(rq_prio - prio);
+        } else {
+            static const int shuffle[2] = { -100, 100 };
+            static bool r = 0;
+
+            /*
+             * For equal priority apply some randomness to break
+             * latching caused by submission patterns.
+             */
+            vruntime = shuffle[r];
+            r ^= 1;

I don't understand why this is needed at all?

I suppose this is related to how drm_sched_entity_save_vruntime saves a relative vruntime (= entity rejoins with a 0 runtime would be impossible otherwise) but I don't understand this either.

Two things (and a bit more) to explain here for the record. And as agreed off-line I need to add some more code comments for this are in the next respin.

First the saving of "vruntime - min_runtime" when entity exits the run-queue.

That is a core CFS concept AFAIU which enables the relative position of the entity to be restored once it re-enters the rq.

It only applies on the scenario when the picked entity was not the head of the queue, due the actual head being not runnable due a dependency.

If the picked entity then leaves the queue and re-joins, this relative vruntime is used to put it back where it was relative to the unready entity (which may have became ready by now and so it needs to be picked next and not overtaken so easily.)

It has to be the relative vruntime that is preserved, ie. entity which re-enters cannot simply keep its previous vruntime, since by then that could lag significantly behind the vruntime of other active entities, which in turn would mean the re-joining entity could be head of the queue for a long time.

Second part is the special case from the quoted patch and that only applies to entities which are re-joining the queue after having been picked from the head _and_ there is another entity in the rq.

By the nature of the CFS algorithm the re-joining entity continues with the vruntime assigned from the current rq min_vruntime. Which puts two entities with the same vruntime at the head of the queue and the actual picking order influenced by the submit order (FIFO) and rbtree sort order (did not check). But in any case it is not desirable for all the description of GPU scheduling weaknesses from the commit text (this patch).

For this special case there are three sub-paths:

1. Re-joining entity is higher scheduling prio -> we pull its vruntime a tiny bit ahead of the min_vruntime so it runs first.

2. Lower re-joining prio -> the opposite of the above - we explicitly prevent it overtaking the higher priority head.

 3. Equal prio -> apply some randomness as to which one runs first.

Idea being avoidance of any "latching" of the execution order based on submission patterns. Which kind of applies a little bit of round/random-robin for this very specific case of equal priority entity re-joining at the top of the queue.

Regards,

Tvrtko

Reply via email to