Am 02.04.25 um 10:26 schrieb Tvrtko Ursulin:
> On 02/04/2025 07:49, Christian König wrote:
>>
>> First of all, impressive piece of work.
>
> Thank you!
>
> I am not super happy though, since what would be much better is some sort of 
> a CFS. But to do that would require to crack the entity GPU time tracking 
> problem. That I tried two times so far and failed to find a generic, elegant 
> and not too intrusive solution.

When the numbers you posted hold true I think this solution here is perfectly 
sufficient. Keep in mind that GPU submission scheduling is a bit more 
complicated then classic I/O scheduling.

E.g. you have no idea how much work a GPU submission was until it is completed, 
for an other I/O transfer you know pre-hand how many bytes are transferred. 
That makes tracking things far more complicated.

>
>>> Lets look at the results:
>>>
>>> 1. Two normal priority deep queue clients.
>>>
>>> These ones submit one second worth of 8ms jobs. As fast as they can, no
>>> dependencies etc. There is no difference in runtime between FIFO and qddl 
>>> but
>>> the latter allows both clients to progress with work more evenly:
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/normal-normal.png
>>>
>>> (X axis is time, Y is submitted queue-depth, hence lowering of qd 
>>> corresponds
>>>    with work progress for both clients, tested with both schedulers 
>>> separately.)
>>
>> This was basically the killer argument why we implemented FIFO in the first 
>> place. RR completely sucked on fairness when you have many clients 
>> submitting many small jobs.
>>
>> Looks like that the deadline scheduler is even better than FIFO in that 
>> regard, but I would also add a test with (for example) 100 clients doing 
>> submissions at the same time.
>
> I can try that. So 100 clients with very deep submission queues? How deep? 
> Fully async? Or some synchronicity and what kind?

Not deep queues, more like 4-8 jobs maximum for each. Send all submissions 
roughly at the same time and with the same priority.

When you have 100 entities each submission from each entity should have ~99 
other submissions in between them.

Record the minimum and maximum of that value and you should have a good 
indicator how well the algorithm performs.

You can then of course start to make it more complicated, e.g. 50 entities who 
have 8 submissions, each taking 4ms and 50 other entities who have 4 
submissions, each taking 8ms.

>
>>> 2. Same two clients but one is now low priority.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/normal-low.png
>>>
>>> Normal priority client is a solid line, low priority dotted. We can see how 
>>> FIFO
>>> completely starves the low priority client until the normal priority is 
>>> fully
>>> done. Only then the low priority client gets any GPU time.
>>>
>>> In constrast, qddl allows some GPU time to the low priority client.
>>>
>>> 3. Same clients but now high versus normal priority.
>>>
>>> Similar behaviour as in the previous one with normal a bit less 
>>> de-prioritised
>>> relative to high, than low was against normal.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/high-normal.png
>>>
>>> 4. Heavy load vs interactive client.
>>>
>>> Heavy client emits a 75% GPU load in the format of 3x 2.5ms jobs followed 
>>> by a
>>> 2.5ms wait.
>>>
>>> Interactive client emites a 10% GPU load in the format of 1x 1ms job 
>>> followed
>>> by a 9ms wait.
>>>
>>> This simulates an interactive graphical client used on top of a relatively 
>>> heavy
>>> background load but no GPU oversubscription.
>>>
>>> Graphs show the interactive client only and from now on, instead of looking 
>>> at
>>> the client's queue depth, we look at its "fps".
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/heavy-interactive.png
>>>
>>> We can see that qddl allows a slighty higher fps for the interactive client
>>> which is good.
>>
>> The most interesting question for this is what is the maximum frame time?
>>
>> E.g. how long needs the user to wait for a response from the interactive 
>> client at maximum?
>
> I did a quick measure of those metrics, for this workload only.
>
> Measured time from submit of the first job in the group (so frame), to time 
> last job in a group finished, and then subtracted the expected jobs duration 
> to get just the wait plus overheads latency.
>
> Five averaged runs:
>
>     min    avg    max     [ms]
> FIFO    2.5    13.14    18.3
> qddl    3.2    9.9    16.6
>
> So it is a bit better in max, more so in max latencies. Question is how 
> representative is this synthetic workload of the real world.

Well if I'm not completely mistaken that is 9,2% better on max and nearly 24,6% 
better on average, the min time is negligible as far as I can see.

That is more than a bit better. Keep in mind that we usually deal with 
interactive GUIs and background worker use cases  which benefits a lot of that 
stuff.

Regards,
Christian.

>
> Regards,
>
> Tvrtko
>
>>> 5. Low priority GPU hog versus heavy-interactive.
>>>
>>> Low priority client: 3x 2.5ms jobs client followed by a 0.5ms wait.
>>> Interactive client: 1x 0.5ms job followed by a 10ms wait.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/lowhog-interactive.png
>>>
>>> No difference between the schedulers.
>>>
>>> 6. Last set of test scenarios will have three subgroups.
>>>
>>> In all cases we have two interactive (synchronous, single job at a time) 
>>> clients
>>> with a 50% "duty cycle" GPU time usage.
>>>
>>> Client 1: 1.5ms job + 1.5ms wait (aka short bursty)
>>> Client 2: 2.5ms job + 2.5ms wait (aka long bursty)
>>>
>>> a) Both normal priority.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-short.png
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-long.png
>>>
>>> Both schedulers favour the higher frequency duty cycle with qddl giving it a
>>> little bit more which should be good for interactivity.
>>>
>>> b) Normal vs low priority.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-normal.png
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-normal-low-low.png
>>>
>>> Qddl gives a bit more to the normal than low.
>>>
>>> c) High vs normal priority.
>>>
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-high.png
>>> https://people.igalia.com/tursulin/drm-sched-qddl/5050-high-normal-normal.png
>>>
>>> Again, qddl gives a bit more share to the higher priority client.
>>>
>>> On the overall qddl looks like a potential improvement in terms of fairness,
>>> especially avoiding priority starvation. There do not appear to be any
>>> regressions with the tested workloads.
>>>
>>> As before, I am looking for feedback, ideas for what kind of submission
>>> scenarios to test. Testers on different GPUs would be very welcome too.
>>>
>>> And I should probably test round-robin at some point, to see if we are maybe
>>> okay to drop unconditionally, it or further work improving qddl would be 
>>> needed.
>>>
>>> v2:
>>>   * Fixed many rebase errors.
>>>   * Added some new patches.
>>>   * Dropped single shot dependecy handling.
>>>
>>> v3:
>>>   * Added scheduling quality unit tests.
>>>   * Refined a tiny bit by adding some fairness.
>>>   * Dropped a few patches for now.
>>>
>>> Cc: Christian König <christian.koe...@amd.com>
>>> Cc: Danilo Krummrich <d...@redhat.com>
>>> Cc: Matthew Brost <matthew.br...@intel.com>
>>> Cc: Philipp Stanner <pstan...@redhat.com>
>>> Cc: Pierre-Eric Pelloux-Prayer <pierre-eric.pelloux-pra...@amd.com>
>>> Cc: Michel Dänzer <michel.daen...@mailbox.org>
>>>
>>> Tvrtko Ursulin (14):
>>>    drm/sched: Add some scheduling quality unit tests
>>>    drm/sched: Avoid double re-lock on the job free path
>>>    drm/sched: Consolidate drm_sched_job_timedout
>>>    drm/sched: Clarify locked section in drm_sched_rq_select_entity_fifo
>>>    drm/sched: Consolidate drm_sched_rq_select_entity_rr
>>>    drm/sched: Implement RR via FIFO
>>>    drm/sched: Consolidate entity run queue management
>>>    drm/sched: Move run queue related code into a separate file
>>>    drm/sched: Add deadline policy
>>>    drm/sched: Remove FIFO and RR and simplify to a single run queue
>>>    drm/sched: Queue all free credits in one worker invocation
>>>    drm/sched: Embed run queue singleton into the scheduler
>>>    drm/sched: De-clutter drm_sched_init
>>>    drm/sched: Scale deadlines depending on queue depth
>>>
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |   6 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  27 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h       |   5 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_trace.h     |   8 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm_sdma.c   |   8 +-
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_xcp.c       |   8 +-
>>>   drivers/gpu/drm/scheduler/Makefile            |   2 +-
>>>   drivers/gpu/drm/scheduler/sched_entity.c      | 121 ++--
>>>   drivers/gpu/drm/scheduler/sched_fence.c       |   2 +-
>>>   drivers/gpu/drm/scheduler/sched_internal.h    |  17 +-
>>>   drivers/gpu/drm/scheduler/sched_main.c        | 581 ++++--------------
>>>   drivers/gpu/drm/scheduler/sched_rq.c          | 188 ++++++
>>>   drivers/gpu/drm/scheduler/tests/Makefile      |   3 +-
>>>   .../gpu/drm/scheduler/tests/tests_scheduler.c | 548 +++++++++++++++++
>>>   include/drm/gpu_scheduler.h                   |  17 +-
>>>   15 files changed, 962 insertions(+), 579 deletions(-)
>>>   create mode 100644 drivers/gpu/drm/scheduler/sched_rq.c
>>>   create mode 100644 drivers/gpu/drm/scheduler/tests/tests_scheduler.c
>>>
>>
>

Reply via email to