On 02/04/2025 12:53, Christian König wrote:
Am 02.04.25 um 10:26 schrieb Tvrtko Ursulin:
On 02/04/2025 07:49, Christian König wrote:

First of all, impressive piece of work.

Thank you!

I am not super happy though, since what would be much better is some sort of a 
CFS. But to do that would require to crack the entity GPU time tracking 
problem. That I tried two times so far and failed to find a generic, elegant 
and not too intrusive solution.

When the numbers you posted hold true I think this solution here is perfectly 
sufficient. Keep in mind that GPU submission scheduling is a bit more 
complicated then classic I/O scheduling.

E.g. you have no idea how much work a GPU submission was until it is completed, 
for an other I/O transfer you know pre-hand how many bytes are transferred. 
That makes tracking things far more complicated.


Lets look at the results:

1. Two normal priority deep queue clients.

These ones submit one second worth of 8ms jobs. As fast as they can, no
dependencies etc. There is no difference in runtime between FIFO and qddl but
the latter allows both clients to progress with work more evenly:

https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png

(X axis is time, Y is submitted queue-depth, hence lowering of qd corresponds
    with work progress for both clients, tested with both schedulers 
separately.)

This was basically the killer argument why we implemented FIFO in the first 
place. RR completely sucked on fairness when you have many clients submitting 
many small jobs.

Looks like that the deadline scheduler is even better than FIFO in that regard, 
but I would also add a test with (for example) 100 clients doing submissions at 
the same time.

I can try that. So 100 clients with very deep submission queues? How deep? 
Fully async? Or some synchronicity and what kind?

Not deep queues, more like 4-8 jobs maximum for each. Send all submissions 
roughly at the same time and with the same priority.

When you have 100 entities each submission from each entity should have ~99 
other submissions in between them.

Record the minimum and maximum of that value and you should have a good 
indicator how well the algorithm performs.

You can then of course start to make it more complicated, e.g. 50 entities who 
have 8 submissions, each taking 4ms and 50 other entities who have 4 
submissions, each taking 8ms.

I started working on this "many clients test" and so far did not find a definitive pattern to show a significant difference between different schedulers.

2. Same two clients but one is now low priority.

https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png

Normal priority client is a solid line, low priority dotted. We can see how FIFO
completely starves the low priority client until the normal priority is fully
done. Only then the low priority client gets any GPU time.

In constrast, qddl allows some GPU time to the low priority client.

3. Same clients but now high versus normal priority.

Similar behaviour as in the previous one with normal a bit less de-prioritised
relative to high, than low was against normal.

https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png

4. Heavy load vs interactive client.

Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed by a
2.5ms wait.

Interactive client emites a 10% GPU load in the format of 1x 1ms job followed
by a 9ms wait.

This simulates an interactive graphical client used on top of a relatively heavy
background load but no GPU oversubscription.

Graphs show the interactive client only and from now on, instead of looking at
the client's queue depth, we look at its "fps".

https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png

We can see that qddl allows a slighty higher fps for the interactive client
which is good.

The most interesting question for this is what is the maximum frame time?

E.g. how long needs the user to wait for a response from the interactive client 
at maximum?

I did a quick measure of those metrics, for this workload only.

Measured time from submit of the first job in the group (so frame), to time 
last job in a group finished, and then subtracted the expected jobs duration to 
get just the wait plus overheads latency.

Five averaged runs:

     min    avg    max     [ms]
FIFO    2.5    13.14    18.3
qddl    3.2    9.9    16.6

So it is a bit better in max, more so in max latencies. Question is how 
representative is this synthetic workload of the real world.

Well if I'm not completely mistaken that is 9,2% better on max and nearly 24,6% 
better on average, the min time is negligible as far as I can see.

That is more than a bit better. Keep in mind that we usually deal with 
interactive GUIs and background worker use cases  which benefits a lot of that 
stuff.

The improvement on that one and some other ones is nice indeed. I still haven't found a regressing workload myself but Pierre-Eric has something involving viewperf. We tried to simulate that via unit test but so far did not get it to behave as bad with qddl as the real thing.

I am parking this work for a week or so now but will come back to it.

If in the meantime people want to experiment, my latest code is here:

https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-deadline?ref_type=heads

There is also a CFS like very early experiment at:

https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-cfs?ref_type=heads

If that one works out it could perhaps be even better but it is far from finished.

Regards,

Tvrtko

5. Low priority GPU hog versus heavy-interactive.

Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait.
Interactive client: 1x 0.5ms job followed by a 10ms wait.

https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png

No difference between the schedulers.

6. Last set of test scenarios will have three subgroups.

In all cases we have two interactive (synchronous, single job at a time) clients
with a 50% "duty cycle" GPU time usage.

Client 1: 1.5ms job + 1.5ms wait (aka short bursty)
Client 2: 2.5ms job + 2.5ms wait (aka long bursty)

a) Both normal priority.

https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png
https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png

Both schedulers favour the higher frequency duty cycle with qddl giving it a
little bit more which should be good for interactivity.

b) Normal vs low priority.

https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png
https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png

Qddl gives a bit more to the normal than low.

c) High vs normal priority.

https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png
https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png

Again, qddl gives a bit more share to the higher priority client.

On the overall qddl looks like a potential improvement in terms of fairness,
especially avoiding priority starvation. There do not appear to be any
regressions with the tested workloads.

As before, I am looking for feedback, ideas for what kind of submission
scenarios to test. Testers on different GPUs would be very welcome too.

And I should probably test round-robin at some point, to see if we are maybe
okay to drop unconditionally, it or further work improving qddl would be needed.

v2:
   * Fixed many rebase errors.
   * Added some new patches.
   * Dropped single shot dependecy handling.

v3:
   * Added scheduling quality unit tests.
   * Refined a tiny bit by adding some fairness.
   * Dropped a few patches for now.

Cc: Christian König <christian.koe...@amd.com>
Cc: Danilo Krummrich <d...@redhat.com>
Cc: Matthew Brost <matthew.br...@intel.com>
Cc: Philipp Stanner <pstan...@redhat.com>
Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-pra...@amd.com>
Cc: Michel Dänzer <michel.daen...@mailbox.org>

Tvrtko Ursulin (14):
    drm/sched: Add some scheduling quality unit tests
    drm/sched: Avoid double re-lock on the job free path
    drm/sched: Consolidate drm_sched_job_timedout
    drm/sched: Clarify locked section in drm_sched_rq_select_entity_fifo
    drm/sched: Consolidate drm_sched_rq_select_entity_rr
    drm/sched: Implement RR via FIFO
    drm/sched: Consolidate entity run queue management
    drm/sched: Move run queue related code into a separate file
    drm/sched: Add deadline policy
    drm/sched: Remove FIFO and RR and simplify to a single run queue
    drm/sched: Queue all free credits in one worker invocation
    drm/sched: Embed run queue singleton into the scheduler
    drm/sched: De-clutter drm_sched_init
    drm/sched: Scale deadlines depending on queue depth

   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |   6 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  27 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |   5 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h     |   8 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c   |   8 +-
   drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c       |   8 +-
   drivers/gpu/drm/scheduler/Makefile            |   2 +-
   drivers/gpu/drm/scheduler/sched_entity.c      | 121 ++--
   drivers/gpu/drm/scheduler/sched_fence.c       |   2 +-
   drivers/gpu/drm/scheduler/sched_internal.h    |  17 +-
   drivers/gpu/drm/scheduler/sched_main.c        | 581 ++++--------------
   drivers/gpu/drm/scheduler/sched_rq.c          | 188 ++++++
   drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
   .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 +++++++++++++++++
   include/drm/gpu_scheduler.h                   |  17 +-
   15 files changed, 962 insertions(+), 579 deletions(-)
   create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c
   create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c





Reply via email to