On 02/04/2025 14:37, Christian König wrote:
Am 02.04.25 um 15:22 schrieb Michel Dänzer:
On 2025-04-02 14:00, Philipp Stanner wrote:
On Wed, 2025-04-02 at 12:58 +0200, Michel Dänzer wrote:
On 2025-04-02 12:46, Philipp Stanner wrote:
On Mon, 2025-03-31 at 21:16 +0100, Tvrtko Ursulin wrote:
Round-robin being the non-default policy and unclear how much it
is
used,
we can notice that it can be implemented using the FIFO data
structures if
we only invent a fake submit timestamp which is monotonically
increasing
inside drm_sched_rq instances.
So instead of remembering which was the last entity the scheduler
worker
picked, we can bump the picked one to the bottom of the tree,
achieving
the same round-robin behaviour.
Advantage is that we can consolidate to a single code path and
remove
a
bunch of code. Downside is round-robin mode now needs to lock on
the
job
pop path but that should not be visible.
Why did you decide to do it that way and then later remove RR &
FIFO
alltogether in patch 10, basically?
I think the far cleaner way for our development-process would be a
separate patch(-series) that *removes* RR completely. Advantages
are:
1. It should be relatively easy to do
2. It would simplify the existing code base independently of
what
happens with your RFC series here
3. Before changing everyone's scheduling policy to a completely
new,
deadline-based one, we could first be sure for a few release
cycles that everyone is now on FIFO, establishing common
ground.
4. We could CC every- and anyone who might use RR or might know
someone who does
5. If it turns out we screwed up and someone really relies on
RR, it
would be easy to revert.
I am not aware of any RR users and have, in past discussions, never
heard of any. So removing it is more tempting for the above
reasons.
https://gitlab.freedesktop.org/drm/amd/-/issues/2516 has a bunch of
RR users...
Right, there's a number of people complaining about the regression. But
what I'm interested in is: how did it evolve since then. Are there
distributions who set the module parameter? Does Steam do it? Or is it
individual users who work around the problem that way?
I know only of the latter.
https://gitlab.freedesktop.org/drm/amd/-/issues/2516#note_2679509
Tvrtko, you should probably prepare a branch on fdo and let the people
on that bug report test it.
If people are interested to test I have two branches. This qddl approach:
https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-deadline?ref_type=heads
And a very early CFS-like experiment:
https://gitlab.freedesktop.org/tursulin/kernel/-/commits/drm-sched-cfs?ref_type=heads
I would certainly be very interested to get feedback.
^ this comment for example seems to indicate that on newer Wayland
versions part of the problem has vanished?
That's about using the Wine wayland driver (which uses the Wayland protocol
directly) instead of the x11 driver (which uses the X11 protocol via Xwayland).
Xwayland not being involved can avoid at least some of the issues (in
particular, the scenario I described
inhttps://gitlab.freedesktop.org/drm/amd/-/issues/2516#note_2119750 can't
happen then). That doesn't solve the issues when Xwayland is involved though,
just avoids them.
From what I remember the bug condition is that the display server
submits works after the client has started rendering the next frame.
With RR that resulted in the display servers work to be picked first
while with FIFO the clients works would be picked first.
Arguable that RR picked the display server was just coincident while
with Tvrtko's work the fair queuing should make that perfectly intentional.
So this patch set here should (in theory) help a lot with that bug
report and finally make RR superfluous.
Disclaimer that it is way too generous to call what I made fair. It may
help with the above described case *if* the client has queue depth > 1,
which would then push out its deadline a bit further out so display
server could beat it even if submitted later. But it also may need
refining to account for currently executing jobs as the queue-depth too
(currently it does not). And also to perhaps base the deadline buckets
on some dynamically adjusting runtime measure rather than hardcoding them.
Regards,
Tvrtko