On Wed, 2025-12-10 at 09:08 -0800, Matthew Brost wrote: > On Wed, Dec 10, 2025 at 02:06:15PM +0100, Philipp Stanner wrote: > > On Wed, 2025-12-10 at 13:47 +0100, Christian König wrote: > > > On 12/10/25 10:58, Philipp Stanner wrote: > > > > On Tue, 2025-12-09 at 17:58 -0800, Matthew Brost wrote: > > > > > On Tue, Dec 09, 2025 at 03:19:40PM +0100, Christian König wrote: > > > .. > > > > > > > > My educated guess is that drm_sched_stop() inserted the job > > > > > > > > back into the pending list, but I still have no idea how it is > > > > > > > > possible that free_job is running after the scheduler is > > > > > > > > stopped. > > > > > > > > > > I believe I found your problem, referencing amdgpu/amdgpu_device.c > > > > > here. > > > > > > > > > > 6718 if (job) > > > > > 6719 ti = amdgpu_vm_get_task_info_pasid(adev, > > > > > job->pasid); > > > > > > WTF! There it is! Thanks a lot for pointing that out! > > No problem. > > > > > scripts/decode_stacktrace.sh should be able to find the exact location > > of a UAF. Requires manual debugging with the kernel build tree at > > hands, though. So that's difficult in CIs. > > > > I wasn't aware of this. Usually I do below after rebuilding kernel with > debug symbols. > > gdb <offending_file.o> > list *(<last stack trace line>) > > Here I just saw 'last stack trace line' was at the very end of > amdgpu_device_gpu_recover and happend to spot this. > > > > > > > > > > > > > >
[…] > > > > > > > > I can tell you how I design it in our Rust jobqueue: > > > > Drivers create jobs, and in submit_job() the pass ownership over the > > > > job to the jobqueue – IOW after pushing a job, a driver can't access it > > > > anymore. In the run_job() callback, the jobqueue either passes the job > > > > back by value (ownership) or borrows the job to the driver so that it > > > > can be copied (this is done so that the JQ can hypothetically do > > > > resubmits). > > > > > > > > This way there is no need for refcounting (in Rust / jobqueue). > > > > > > See below. If you need to resubmit for any reason, where will the > information for resubmission be stored? Likewise, if you want to drop > additional references on fence signaling, how are you going to retrieve > that? Well yes, as I just stated, it is, unfortunately, always necessary to have a list of running jobs. The jobs inside of it don't need to be shared with the driver, though. Jobqueue could do resubmits through help with that list, by running the desired jobs again. While run_job() executes, jobs are loaned to the driver, who only needs them temporarily, not permanently. One could have that loaning in C, too, but would not be able to enforce it. > > > > > Maybe the core of the problem is not so much the lack of refcounting, > > > > but the lack of ownership rules. Why even would the driver need the job > > > > still after it got pushed? It should be fire-and-forget. > > > > > > Yeah, that sounds sane to me as well and is exactly how it was initially > > > designed in the drm_scheduler as well. > > > > > > The job is basically just the information the driver needs for the > > > submission which it gives to the scheduler on push, and the scheduler > > > gives back to the driver on pop. > > > > > > The full execution time is represented by the scheduler fence and not the > > > job. And the scheduler fence is reference counted exactly because of the > > > reasons Mathew brought up here. > > > > Would be interesting to hear where Xe would still need the job. If only > > the backend_ops give a driver access to a job again after it got > > pushed, then it should be safe. > > > > Xe needs a subset of the job after submission to handle tasks like > resubmission after a device reset. > the job or the jobS? Because you get the job that caused the timeout by the scheduler, through timedout_job(). And the rest you need will soonish be obtainable through the new iterator. So what else do you need? > It’s questionable whether we need > this, as it shouldn’t happen in practice—only individual queues should > fail with a working KMD and hardware. It likely doesn’t work anyway if > queues have interdependencies. This is really an opportunistic approach. > > However, we absolutely need this for VF migration resubmission. Again, > this requires only a very small subset of driver job information. I > believe it’s just the starting point in the ring, batch address(es), and > a pointer to the driver-side queue object. In Rust, I borrow the job to the driver. So if it really needs something about it permanently, it can copy it into some object with a decoupled life-time. Or maybe have the job-struct's generic data-field contain something refcounted, IDK. > > We also build a reference-counting model around jobs, where the final > put releases other objects and runtime power management references. This > assumes that the job’s final put means the scheduler fence is signaled. > Again, this is really just a small subset of information we need here. > > So if we add hooks to store the subset of information Xe needs for > everything above in the scheduler fence and a non-IRQ, pausable callback > (i.e., won’t execute when the scheduler is stopped, like free_job), this > could be made to work. We really don’t need about 90% of the information > in the job and certainly nothing in the base object. > > This would be major surgery, though. I suspect most drivers have a > subset of information in the job that needs to stick around until it > signals, so this means surgery across 11 drivers. Not sure if that's worth it. My hope would more be that interested users with firmware scheduling can switch to jobqueue and start over with a fresh, clean design with proper memory life times. > > > > > > > I'm absolutely not against reference counting, what I'm pushing back is > > > abusing the job object as something it was never designed for while we > > > already have an object which implements exactly the needed functionality. > > Well, oops. Having free_job called after the fence is signaled is how I > arrived at the implementation in Xe. I agree that if we can move driver > info into the scheduler fence, this could work for likely everyone. > > > > > > > > And my uneducated guess is that it's happening in amdgpu. It > > > > > > > seems a > > > > > > > sched_job lives inside an amdgpu_job. Can the latter be freed at > > > > > > > other > > > > > > > places than free_job()? > > > > > > > > > > > > > > > > > > > > > > […] > > > > > > > > > > > > It basically says to the driver that the job lifetime problems > > > > > > > > created by the scheduler is the driver problem and need to be > > > > > > > > worked around there. > > > > > > > > > > > > > > > > > > > > > > My POV still mostly is that (with the current design) the driver > > > > > > > must > > > > > > > not use jobs after free_job() was invoked. And when that happens > > > > > > > is > > > > > > > unpredictable. > > > > > > > > > > > > > > > > > This is somewhat of an absurd statement from my point of view. I have > > > > > a > > > > > valid job pointer, then I call another function (see above for an > > > > > example of how drm_sched_start/stop is unsafe) and it disappears > > > > > behind > > > > > my back. > > > > > > > > > > > > > The statement is absurd because reality (the code) is absurd. We all > > > > are basically Alice in Wonderland, running as fast as we can just to > > > > remain on the same spot ^_^ > > > > > > > > What I am stating is not that this is *good*, but this is what it > > > > currently is like. Whether we like it or not. > > > > > > > > The misunderstanding you and I might have is that for me jobs having to > > > > be refcounted is not a reality until it's reflected in code, > > > > documentation and, ideally, drivers. > > > > > > I agree this part is badly misdesigned. In the timedout job callback, > you’re handed a job, and if you perform certain actions, it can just > disappear— even all the way back to the caller of timedout_job. That’s > not great. Then we have this free_guilty mechanism to avoid it > disappearing, but sometimes it still does, which is horrible. Who makes it disappear, the driver callback? Because that free_guilty mechanism is what frees jobs in the first place. The more you think about it, the more astonished you become how this could ever have been designed and merged that way. There was no clean design anywhere, neither with APIs, nor life times, nor locking. > > > > > > The safe way to handle this is to take a local reference before > > > > > doing anything that could make it disappear. That is much more > > > > > reasonable than saying, “we have five things you can do in the > > > > > scheduler, and if you do any of them it isn’t safe to touch the job > > > > > afterward.” > > > > > > > > Yeah, but that's drm_sched being drm_scheddy. Before I documented it > > > > there were also these implicit refcounting rules in run_job(), where > > > > the driver needs to take the reference for the scheduler for it to be > > > > non-racy. > > > > > > Yes, agreed. This is my fault for not being more responsible in fixing > issues rather than just backing away from these really scary parts of > the code (e.g., drm_sched_stop, drm_sched_start, > drm_sched_resubmit_jobs, etc.) and doing something sane in Xe by using > only a subset of the scheduler. It's a bit like writing C++: no one can ever agree which feature subset is safe to use. That's why we want to do a fresh restart for firmware-schedulers, since they allow us to drastically simplify things. Timeout? Close the ring. Job-pushing is fire and forget. Resets? Rings aren't shared. > > > > > It also wasn't documented for a long time that drm_sched (through > > > > spsc_queue) will explode if you don't use entities with a single > > > > producer thread. > > > > > > That is actually documented, but not on the scheduler but rather the > > > dma_fence. > > > > > spsc - "Single producer, Single consumer". So it is in the name. Ah, NTOTM. What's obvious for one party is a mystery to someone with a different mind. I recognized the meaning after months, after work one day. But don't get me started on that queue…… > > > > And that you can only have a single producer is a requirement inherited > > > from the dma_fence and not scheduler specific at all. > > > > What does dma_fence have to do with it? It's about the spsc_queue being > > racy like mad. You can access and modify dma_fence's in parallel > > however you want – they are refcounted and locked. > > > > > > P. > > > > > > > > > drm_sched got here because of gross design mistakes, lots of hacks for > > > > few drivers, and, particularly, a strange aversion¹ against writing > > > > documentation. If Xe came, back in the day, to the conclusion that job > > > > lifetimes are fundamentally broken and that the objectively correct way > > > > to solve this is refcounting, then why wasn't that pushed into > > > > drm_sched back then? > > > > > > > > > > > > > > > > To be unfair, we already have strange refcount expectations > > > > > > > already. > > > > > > > But I sort of agree that there is no objectively good solution in > > > > > > > sight. > > > > > > > > > > > > > > > > > > > > > > > > > Let me respin to my documentation series and upstream that > > > > > > > > > soonish, > > > > > > > > > than we can build on top of that. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. Move reference counting to the base DRM scheduler job > > > > > > > > > > object, provide a > > > > > > > > > > vfunc for the final job put, and document usage. Medium > > > > > > > > > > invasive. > > > > > > > > > > > > > > > > I strongly think that reference counting the job object just > > > > > > > > because the scheduler needs it is a bad idea. > > > > > > > > > > > > > > > > With that we are just moving the hot potato from one side of > > > > > > > > our mouth to the other without really solving the issue. > > > > > > > > > > > > > > > > > > See above—I can’t say I agree with this assessment. I think the lack > > > > > of > > > > > reference counting is exactly the problem. I don’t really understand > > > > > the > > > > > pushback on a very well-understood concept (reference counting) in > > > > > Linux. I would sign up to fix the entire subsystem if we go this > > > > > route. > > > > > > > > > > > > > If a driver like XE needs that for some reason than that is > > > > > > > > perfectly fine. > > > > > > > > > > > > > > Nouveau doesn't need it either. > > > > > > > > > > > > > > > > > > > > > > > > > 3. Move job (driver) side tracking to the scheduler fence > > > > > > > > > > and let it > > > > > > > > > > control the lifetime. Very invasive. > > > > > > > > > > > > > > > > Thinking about it more that is actually not so much of a > > > > > > > > problem. > > > > > > > > > > > > > > > > Let me try to code something together by the end of next week > > > > > > > > or so. > > > > > > > > > > > > > > The hero Gotham needs :) > > > > > > > > > > > > > > > > > Are you sure about this one? I think unless the problems around > > > > > drm_sched_start/stop and free_guilty are fixed, my feeling is this > > > > > entire thing is still badly broken for anyone who wants to use those. > > > > > > > > > > To sum up this whole email: I strongly disagree with option #3, but if > > > > > that is the consensus, I will, of course, support the effort. > > > > > > > > > > > > I would like to discuss those topics with Danilo, too, who returns from > > > > LPC soonish. Also to get some more insights into Nouveau and our use- > > > > cases. > > > > > > > > My suggestion is that we pick the conversation up again soonish. > > > > Christmas is around the corner, and I suppose we can't fix this all up > > > > in 2025 anyways, so we might want to give it a fresh re-spin in '26. > > > > > > Since we finally found the root cause I'm all in postponing that till > > > next year. > > > > > Ok, glad we found the root cause. I’m not as opposed to option #3 as > stated—this was a bit of angry typing—but if we go in that direction, we > really need clear rules, for example: > > - A job cannot be referenced driver-side after the initial > drm_sched_entity_push_job call, aside from a single run_job callback. That's what the current code and documentation demand, yes. > Maybe run_job is actually replaced by a scheduler fence input? fence input? > - Drivers can attach information to the scheduler fence and control its > lifetime. > - Drivers can iterate over pending scheduler fences when the scheduler > is stopped. That sounds as if we're about to make a mistake with the job-iterator. > - Drivers receive a pausable callback in a non-IRQ context when the > scheduler fence signals. > > etc... > > Again, this is a pretty major change. I personally wouldn’t feel > comfortable hacking 11 drivers—10 of which aren’t mine—to do something > like this. Refcounting the job would be less invasive and would make the > existing hairball of code safe. See my firmware-scheduler comment above. The issue is that getting systems with lax rules back under control in hindsight is 10x as expensive as carefully designing strict rules from the get-go. My experience so far is that a maintainer's primary job is actually keeping APIs consistent and forcing people to document everything properly. P. > > Matt > > > > Christian. > > > > > > > > > > > > > > > Greetings, > > > > P. > > > > > > > > > > > > > > > > [¹] The strangest aversion in drm_sched, however, is the one against > > > > locking. Locks were only taken when *absolutely* necessary. It's as if > > > > the entire code was designed by some gamers who want to show their > > > > youtube subscribers that they can get 1fps more by changing RAM timings > > > > in the bios. > > > > drm_sched might be the component in the kernel using the most > > > > synchronization mechanisms: Spinlocks, RCU, atomic integers, atomic > > > > instructions, refcounting, READ_ONCE(), just accessing locklessly… even > > > > Paul McKenney would get wide eyes here. The only thing we're still > > > > missing is seqlocks and rw_locks, but maybe we can add those /s > > > > > > > > That's likely sth we can't get repaired at all anymore. > > > > >
