On Wed, Feb 11, 2026 at 04:00:59PM +0100, Boris Brezillon wrote: Jumping in here as I was tagged in this thread… a lot gets through.
Randomly picking a point to reply. > On Wed, 11 Feb 2026 15:38:32 +0100 > "Danilo Krummrich" <[email protected]> wrote: > > > On Wed Feb 11, 2026 at 12:12 PM CET, Boris Brezillon wrote: > > > On Wed, 11 Feb 2026 12:00:30 +0100 > > > "Danilo Krummrich" <[email protected]> wrote: > > >> I.e. sharing a workqueue between JobQs is fine, but we have to ensure > > >> they can't > > >> be used for anything else. > > > > > > Totally agree with that, and that's where I was going with this special > > > DmaFenceWorkqueue wrapper/abstract, that would only accept > > > scheduling MaySignalDmaFencesWorkItem objects. > > > > Not sure if it has to be that complicated (for a first shot). At least for > > the > > JobQ it would probably be enough to have a helper to create a new, let's > > say, > > struct JobQueueWorker that encapsulates a (reference counted) workqueue, but > > does not give access to it outside of jobq.rs. > > Except we need to schedule some work items that are in the > DMA-signaling path but not directly controlled by the jobq.rs > implementation (see [1] for the post-execution work we schedule in > panthor). > > The two options I can think of are: > > 1. Add a an unsafe interface to schedule work items on the wq attached > to JobQ. Safety requirements in that case being compliance with the > DMA-fence signalling rules. For (1), use lockdep to enforce these rules. I have a patch for this [1]. Something like this is probably what everyone needs—jobqueue can either create a workqueue with this annotation or enforce that the one being passed in already has it. I turned this on for all Xe workqueues in the signaling path and immediately found a few bugs, and I know the dma-fence rules pretty well, so this is clearly useful. I think user-scheduling work on the submit work item is valid. The primary case in Xe is control-plane messages (e.g., queue suspend/resume, teardown, toggling queue priority in firmware, etc.). You don’t want to race with submission while manipulating queue state, so you order this work on the workqueue. Could you do this with a lock? Probably. But then you’d have to audit every point that issues a control-plane message to make sure you can take that lock. There’s also the hazard where a control message is issued in IRQ context but you need a mutex to manipulate the queue (In Xe this the mutex to send firmware commands). For example, I’ve implemented fence deadlines in Xe [2], which fire control-plane messages in IRQ context. Another example is a job dropping a ref to the queue in IRQ context, and that being the final reference that triggers teardown. I don’t do the later yet in Xe, but it should be possible to drop your last queue ref when a dma-fence signals (i.e., no free_job work — just a put in the dma-fence signaling IRQ handler) if jobqueue is designed correctly. I’m also not sure timeouts are supposed to work in jobqueue, but if you need to stop/start the jobqueue to ensure you have full control over your queue (e.g., new jobqueues aren’t racing), then you likely need a second workqueue so you can stop the submit one, or you might be able to get away with a mutex. This also applies to users scheduling workqueue operations on this—such as global resets or migrating a VF—which stop all jobqueue instances to perform fixups. These global events can’t race with jobs timing out either, since multiple entities can’t be stop/starting jobqueue instances at the same time without breaking things. This is why, in Xe, all job timeouts and all global events are scheduled on a single workqueue instance shared among all DRM sched instances. This has worked quite well, so I’d strongly recommend carrying this part of DRM sched forward into whatever succeeds it. I have a fairly detailed write-up of the Xe scheduler design [3] — it’s a little stale, but it should describe how a subset of DRM sched works very well to implement complex driver-side scheduling requirements. A whole other subset of DRM sched is horrid, so I’d recommend taking the good ideas from DRM sched (queue stop/start, workqueue-based ordering, finished fences, job tracking to completion) and using those in jobqueue, while dropping the bad ones (no real object-lifetime rules, no ownership rules, no refcounting, wild teardown flows, wild dma-fence callback manipulation, etc.) and not carrying those forward. Some of DRM sched’s very bad ideas appear to be in jobqueue as well. I’d reconsider those, but I won’t harp on the design at this point. Matt [1] https://patchwork.freedesktop.org/patch/682491/?series=156283&rev=1 [2] https://patchwork.freedesktop.org/patch/696820/?series=159479&rev=2 [3] https://patchwork.freedesktop.org/patch/669007/?series=153000&rev=3 > 2. The thing I was describing before, where we add the concept of > DmaFenceWorkqueue that can only take MaySignalDmaFencesWorkItem. We > can then have a DmaFenceWorkqueue that's global, and pass it to the > JobQueue so it can use it for its own work item. > > We could start with option 1, sure, but since we're going to need to > schedule post-execution work items that have to be considered part of > the DMA-signalling path, I'd rather have these concepts clearly defined > from the start. > > Mind if I give this DmaFenceWorkqueue/MaySignalDmaFencesWorkItem a try > to see what it looks like a get the discussion going from there > (hopefully it's just a thin wrapper around a regular > Workqueue/WorkItem, with an extra dma_fence_signalling annotation in > the WorkItem::run() path), or are you completely against the idea? > > [1]https://elixir.bootlin.com/linux/v6.19-rc5/source/drivers/gpu/drm/panthor/panthor_sched.c#L1913
