On Thu, 2025-12-11 at 12:55 +0100, Christian König wrote:
> On 12/11/25 12:32, Philipp Stanner wrote:
> > On Wed, 2025-12-10 at 09:08 -0800, Matthew Brost wrote:
> > > On Wed, Dec 10, 2025 at 02:06:15PM +0100, Philipp Stanner wrote:
> > > > On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote:
> > > > > On 12/10/25 10:58, Philipp Stanner wrote:
> > > > > > On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote:
> > > > > > > On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote:
> ...
> > > > > > 
> > > > > > I can tell you how I design it in our Rust jobqueue:
> > > > > > Drivers create jobs, and in submit_job() the pass ownership over the
> > > > > > job to the jobqueue – IOW after pushing a job, a driver can't 
> > > > > > access it
> > > > > > anymore. In the run_job() callback, the jobqueue either passes the 
> > > > > > job
> > > > > > back by value (ownership) or borrows the job to the driver so that 
> > > > > > it
> > > > > > can be copied (this is done so that the JQ can hypothetically do
> > > > > > resubmits).
> > > > > > 
> > > > > > This way there is no need for refcounting (in Rust / jobqueue).
> > > > > > 
> > > 
> > > See below. If you need to resubmit for any reason, where will the
> > > information for resubmission be stored? Likewise, if you want to drop
> > > additional references on fence signaling, how are you going to retrieve
> > > that?
> > 
> > Well yes, as I just stated, it is, unfortunately, always necessary to
> > have a list of running jobs. The jobs inside of it don't need to be
> > shared with the driver, though.
> Actually the initially implementation of the scheduler didn't had a list of 
> running jobs.
> 
> The original idea was that you just call run_job() which returns a reference 
> to the HW fence and then the scheduler installs a callback to this HW fence 
> which wakes up the scheduler again to push the next job.
> 
> The idea of the scheduler being responsible to track what's in flight on the 
> HW came much later and to be honest is actually not a functionality the 
> scheduler actually needs.

As far as I understand GPU driver design so far it's unavoidable to
have at the very least a list of finished fences in the scheduler, so
you can signal them once the hardware fences signal.

And at the very least once you have to do resubmissions you need to
remember which jobs / fences are in flight.

Unless it were possible to use the hardware_fence for everything.


> 
> > Jobqueue could do resubmits through help with that list, by running the
> > desired jobs again. While run_job() executes, jobs are loaned to the
> > driver, who only needs them temporarily, not permanently.
> > 
> > One could have that loaning in C, too, but would not be able to enforce
> > it.
> > 
> > > 
> > > > > > Maybe the core of the problem is not so much the lack of 
> > > > > > refcounting,
> > > > > > but the lack of ownership rules. Why even would the driver need the 
> > > > > > job
> > > > > > still after it got pushed? It should be fire-and-forget.
> > > > > 
> > > > > Yeah, that sounds sane to me as well and is exactly how it was 
> > > > > initially designed in the drm_scheduler as well.
> > > > > 
> > > > > The job is basically just the information the driver needs for the 
> > > > > submission which it gives to the scheduler on push, and the scheduler 
> > > > > gives back to the driver on pop.
> > > > > 
> > > > > The full execution time is represented by the scheduler fence and not 
> > > > > the job. And the scheduler fence is reference counted exactly because 
> > > > > of the reasons Mathew brought up here.
> > > > 
> > > > Would be interesting to hear where Xe would still need the job. If only
> > > > the backend_ops give a driver access to a job again after it got
> > > > pushed, then it should be safe.
> > > > 
> > > 
> > > Xe needs a subset of the job after submission to handle tasks like
> > > resubmission after a device reset.
> > > 
> > 
> > the job or the jobS?
> > 
> > Because you get the job that caused the timeout by the scheduler,
> > through timedout_job().
> > 
> > And the rest you need will soonish be obtainable through the new
> > iterator. So what else do you need?
> > 
> > 
> > >  It’s questionable whether we need
> > > this, as it shouldn’t happen in practice—only individual queues should
> > > fail with a working KMD and hardware. It likely doesn’t work anyway if
> > > queues have interdependencies. This is really an opportunistic approach.
> > > 
> > > However, we absolutely need this for VF migration resubmission. Again,
> > > this requires only a very small subset of driver job information. I
> > > believe it’s just the starting point in the ring, batch address(es), and
> > > a pointer to the driver-side queue object.
> > 
> > In Rust, I borrow the job to the driver. So if it really needs
> > something about it permanently, it can copy it into some object with a
> > decoupled life-time.
> > Or maybe have the job-struct's generic data-field contain something
> > refcounted, IDK.
> > 
> > 
> > 
> > > 
> > > We also build a reference-counting model around jobs, where the final
> > > put releases other objects and runtime power management references. This
> > > assumes that the job’s final put means the scheduler fence is signaled.
> > > Again, this is really just a small subset of information we need here.
> > > 
> > > So if we add hooks to store the subset of information Xe needs for
> > > everything above in the scheduler fence and a non-IRQ, pausable callback
> > > (i.e., won’t execute when the scheduler is stopped, like free_job), this
> > > could be made to work. We really don’t need about 90% of the information
> > > in the job and certainly nothing in the base object.
> > > 
> > > This would be major surgery, though. I suspect most drivers have a
> > > subset of information in the job that needs to stick around until it
> > > signals, so this means surgery across 11 drivers.
> > 
> > Not sure if that's worth it. My hope would more be that interested
> > users with firmware scheduling can switch to jobqueue and start over
> > with a fresh, clean design with proper memory life times.
> 
> That sounds reasonable to me. I was more than once at the point of wanting to 
> nuke the scheduler and starting from scratch again.
> 

Actually, since the 1950s, you don't want to rewrite software at all
anymore. Yet we again and again end up at that point.

Only works if you *really* get it right. Even with jobqueue that's
already becoming more complicated, because we might have to support
resubmits, which I hoped we could 100% avoid.

drm_sched's primary design mistake is having two job lists with
different life times AND on top of that resubmits.

> > > > > I'm absolutely not against reference counting, what I'm pushing back 
> > > > > is abusing the job object as something it was never designed for 
> > > > > while we already have an object which implements exactly the needed 
> > > > > functionality.
> > > 
> > > Well, oops. Having free_job called after the fence is signaled is how I
> > > arrived at the implementation in Xe. I agree that if we can move driver
> > > info into the scheduler fence, this could work for likely everyone.
> 
> Ok in that case I'm going to give that a try.
> 
> > > > > > > > > And my uneducated guess is that it's happening in amdgpu. It 
> > > > > > > > > seems a
> > > > > > > > > sched_job lives inside an amdgpu_job. Can the latter be freed 
> > > > > > > > > at other
> > > > > > > > > places than free_job()?
> > > > > > > > 
> > > > > > > > > > 
> > > > > > 
> > > > > > […]
> > > > > > 
> > > > > > > > > > It basically says to the driver that the job lifetime 
> > > > > > > > > > problems created by the scheduler is the driver problem and 
> > > > > > > > > > need to be worked around there.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > My POV still mostly is that (with the current design) the 
> > > > > > > > > driver must
> > > > > > > > > not use jobs after free_job() was invoked. And when that 
> > > > > > > > > happens is
> > > > > > > > > unpredictable.
> > > > > > > > > 
> > > > > > > 
> > > > > > > This is somewhat of an absurd statement from my point of view. I 
> > > > > > > have a
> > > > > > > valid job pointer, then I call another function (see above for an
> > > > > > > example of how drm_sched_start/stop is unsafe) and it disappears 
> > > > > > > behind
> > > > > > > my back.
> > > > > > > 
> > > > > > 
> > > > > > The statement is absurd because reality (the code) is absurd. We all
> > > > > > are basically Alice in Wonderland, running as fast as we can just to
> > > > > > remain on the same spot ^_^
> > > > > > 
> > > > > > What I am stating is not that this is *good*, but this is what it
> > > > > > currently is like. Whether we like it or not.
> > > > > > 
> > > > > > The misunderstanding you and I might have is that for me jobs 
> > > > > > having to
> > > > > > be refcounted is not a reality until it's reflected in code,
> > > > > > documentation and, ideally, drivers.
> > > > > > 
> > > 
> > > I agree this part is badly misdesigned. In the timedout job callback,
> > > you’re handed a job, and if you perform certain actions, it can just
> > > disappear— even all the way back to the caller of timedout_job. That’s
> > > not great. Then we have this free_guilty mechanism to avoid it
> > > disappearing, but sometimes it still does, which is horrible.
> > 
> > Who makes it disappear, the driver callback? Because that free_guilty
> > mechanism is what frees jobs in the first place.
> 
> I think so yes.

The driver must not free the job in timedout_job. That's completely
against the design, be it bad or good. free_job() is the point where
you can free. That's why it's called that.

> 
> > The more you think about it, the more astonished you become how this
> > could ever have been designed and merged that way. There was no clean
> > design anywhere, neither with APIs, nor life times, nor locking.
> 
> Yeah, which is exactly the reason why I said I'm not going to maintain that 
> stuff.
> 
> I mean I still feel guilty that it came this far, but yeah.

You've said so frequently, don't worry too much. I just bring it up out
of surprise because, you know, some design mistakes are unavoidable.
It's just very difficult, GPU resubmits, hangs and all that. But other
things and certain hacks (drm_sched_fini()) are so *obviously* broken
that you wonder..


P.

> 
> > > > > > >  The safe way to handle this is to take a local reference before
> > > > > > > doing anything that could make it disappear. That is much more
> > > > > > > reasonable than saying, “we have five things you can do in the
> > > > > > > scheduler, and if you do any of them it isn’t safe to touch the 
> > > > > > > job
> > > > > > > afterward.”
> > > > > > 
> > > > > > Yeah, but that's drm_sched being drm_scheddy. Before I documented it
> > > > > > there were also these implicit refcounting rules in run_job(), where
> > > > > > the driver needs to take the reference for the scheduler for it to 
> > > > > > be
> > > > > > non-racy.
> > > > > > 
> > > 
> > > Yes, agreed. This is my fault for not being more responsible in fixing
> > > issues rather than just backing away from these really scary parts of
> > > the code (e.g., drm_sched_stop, drm_sched_start,
> > > drm_sched_resubmit_jobs, etc.) and doing something sane in Xe by using
> > > only a subset of the scheduler.
> > 
> > It's a bit like writing C++: no one can ever agree which feature subset
> > is safe to use.
> > 
> > That's why we want to do a fresh restart for firmware-schedulers, since
> > they allow us to drastically simplify things. Timeout? Close the ring.
> > Job-pushing is fire and forget. Resets? Rings aren't shared.
> 
> +1
> 
> > > > > > It also wasn't documented for a long time that drm_sched (through
> > > > > > spsc_queue) will explode if you don't use entities with a single
> > > > > > producer thread.
> > > > > 
> > > > > That is actually documented, but not on the scheduler but rather the 
> > > > > dma_fence.
> > > > > 
> > > 
> > > spsc - "Single producer, Single consumer". So it is in the name.
> > 
> > Ah, NTOTM.
> > 
> > What's obvious for one party is a mystery to someone with a different
> > mind. I recognized the meaning after months, after work one day.
> > 
> > But don't get me started on that queue……
> > 
> > > 
> > > > > And that you can only have a single producer is a requirement 
> > > > > inherited from the dma_fence and not scheduler specific at all.
> > > > 
> > > > What does dma_fence have to do with it? It's about the spsc_queue being
> > > > racy like mad. You can access and modify dma_fence's in parallel
> > > > however you want – they are refcounted and locked.
> > > > 
> > > > 
> > > > P.
> > > > 
> > > > > 
> > > > > > drm_sched got here because of gross design mistakes, lots of hacks 
> > > > > > for
> > > > > > few drivers, and, particularly, a strange aversion¹ against writing
> > > > > > documentation. If Xe came, back in the day, to the conclusion that 
> > > > > > job
> > > > > > lifetimes are fundamentally broken and that the objectively correct 
> > > > > > way
> > > > > > to solve this is refcounting, then why wasn't that pushed into
> > > > > > drm_sched back then?
> > > > > > 
> > > > > > > 
> > > > > > > > > To be unfair, we already have strange refcount expectations 
> > > > > > > > > already.
> > > > > > > > > But I sort of agree that there is no objectively good 
> > > > > > > > > solution in
> > > > > > > > > sight.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Let me respin to my documentation series and upstream 
> > > > > > > > > > > that soonish,
> > > > > > > > > > > than we can build on top of that.
> > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > P.
> > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > 2. Move reference counting to the base DRM scheduler 
> > > > > > > > > > > > job object, provide a
> > > > > > > > > > > >    vfunc for the final job put, and document usage. 
> > > > > > > > > > > > Medium invasive.
> > > > > > > > > > 
> > > > > > > > > > I strongly think that reference counting the job object 
> > > > > > > > > > just because the scheduler needs it is a bad idea.
> > > > > > > > > > 
> > > > > > > > > > With that we are just moving the hot potato from one side 
> > > > > > > > > > of our mouth to the other without really solving the issue.
> > > > > > > > > > 
> > > > > > > 
> > > > > > > See above—I can’t say I agree with this assessment. I think the 
> > > > > > > lack of
> > > > > > > reference counting is exactly the problem. I don’t really 
> > > > > > > understand the
> > > > > > > pushback on a very well-understood concept (reference counting) in
> > > > > > > Linux. I would sign up to fix the entire subsystem if we go this 
> > > > > > > route.
> > > > > > > 
> > > > > > > > > > If a driver like XE needs that for some reason than that is 
> > > > > > > > > > perfectly fine.
> > > > > > > > > 
> > > > > > > > > Nouveau doesn't need it either.
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > > 3. Move job (driver) side tracking to the scheduler 
> > > > > > > > > > > > fence and let it
> > > > > > > > > > > >    control the lifetime. Very invasive.
> > > > > > > > > > 
> > > > > > > > > > Thinking about it more that is actually not so much of a 
> > > > > > > > > > problem.
> > > > > > > > > > 
> > > > > > > > > > Let me try to code something together by the end of next 
> > > > > > > > > > week or so.
> > > > > > > > > 
> > > > > > > > > The hero Gotham needs :)
> > > > > > > > > 
> > > > > > > 
> > > > > > > Are you sure about this one? I think unless the problems around
> > > > > > > drm_sched_start/stop and free_guilty are fixed, my feeling is this
> > > > > > > entire thing is still badly broken for anyone who wants to use 
> > > > > > > those.
> > > > > > > 
> > > > > > > To sum up this whole email: I strongly disagree with option #3, 
> > > > > > > but if
> > > > > > > that is the consensus, I will, of course, support the effort.
> > > > > > 
> > > > > > 
> > > > > > I would like to discuss those topics with Danilo, too, who returns 
> > > > > > from
> > > > > > LPC soonish. Also to get some more insights into Nouveau and our 
> > > > > > use-
> > > > > > cases.
> > > > > > 
> > > > > > My suggestion is that we pick the conversation up again soonish.
> > > > > > Christmas is around the corner, and I suppose we can't fix this all 
> > > > > > up
> > > > > > in 2025 anyways, so we might want to give it a fresh re-spin in '26.
> > > > > 
> > > > > Since we finally found the root cause I'm all in postponing that till 
> > > > > next year.
> > > > > 
> > > 
> > > Ok, glad we found the root cause. I’m not as opposed to option #3 as
> > > stated—this was a bit of angry typing—but if we go in that direction, we
> > > really need clear rules, for example:
> > > 
> > > - A job cannot be referenced driver-side after the initial
> > >   drm_sched_entity_push_job call, aside from a single run_job callback.
> > 
> > That's what the current code and documentation demand, yes.
> > 
> > >   Maybe run_job is actually replaced by a scheduler fence input?
> > 
> > fence input?
> 
> ops->schedule(job) ?
> 
> > > - Drivers can attach information to the scheduler fence and control its
> > >   lifetime.
> > > - Drivers can iterate over pending scheduler fences when the scheduler
> > >   is stopped.
> > 
> > That sounds as if we're about to make a mistake with the job-iterator.
> > 
> > > - Drivers receive a pausable callback in a non-IRQ context when the
> > >   scheduler fence signals.
> > > 
> > > etc...
> > > 
> > > Again, this is a pretty major change. I personally wouldn’t feel
> > > comfortable hacking 11 drivers—10 of which aren’t mine—to do something
> > > like this. Refcounting the job would be less invasive and would make the
> > > existing hairball of code safe.
> > 
> > See my firmware-scheduler comment above. The issue is that getting
> > systems with lax rules back under control in hindsight is 10x as
> > expensive as carefully designing strict rules from the get-go.
> > 
> > My experience so far is that a maintainer's primary job is actually
> > keeping APIs consistent and forcing people to document everything
> > properly.
> 
> I would rather say that a maintainers job is to reject bad ideas and push for 
> good ones.
> 
> Christian.
> 
> > 
> > 
> > P.

Reply via email to