On Wed, 3 Jun 2026 21:43:05 -0300
Daniel Almeida <[email protected]> wrote:

> > On 3 Jun 2026, at 14:14, Boris Brezillon <[email protected]> 
> > wrote:
> > 
> > On Wed, 3 Jun 2026 13:41:02 -0300
> > Daniel Almeida <[email protected]> wrote:
> >   
> >>> +    /// Called when the fence is signaled.
> >>> +    ///
> >>> +    /// This is called from the fence signaling path, which may be in 
> >>> interrupt
> >>> +    /// context or with locks held, which is why `self` is only 
> >>> borrowed, so that
> >>> +    /// it cannot drop. Implementations must not sleep or perform
> >>> +    /// long-running operations.
> >>> +    ///
> >>> +    /// An implementation likely wants to inform itself (e.g., through a 
> >>> work item)
> >>> +    /// within this callback that the associated [`FenceCbRegistration`] 
> >>> can now be
> >>> +    /// dropped.
> >>> +    fn called(&mut self);    
> >> 
> >> This is a central point. We ideally would want this to consume self, 
> >> because we
> >> may want to move things out of the callback.    
> > 
> > This one comes from me. The rationale being that ::called() is called
> > from an atomic context, and the resources attached to the callback data
> > might require acquiring other sleeping locks to be released, and
> > sometimes you don't even notice immediately because said resources are
> > refcounted, and the lock is only acquired when you happen to be the
> > last owner. Yes, those can be caught at runtime if the C side is
> > properly annotated with might_sleep(), but that's not always the case.
> > 
> > If we defer the drop of the data only when the FenceCb is
> > dropped/recycled, we're at least not constrained by this "runs in
> > atomic context" thing.
> >   
> 
> This design does not solve it, because one can quite trivially get around this
> restriction using Option<T> as I said. If your point is “don’t run any drop() 
> here”,
> then &mut self doesn’t do it.

My bad, I thought you were talking about some Option<T> in
FenceCbRegistration<T> (there was one at some point, but it's gone now),
but you're talking about having an Option<X> inside the T. Yes, there's
indeed nothing preventing a drop on X in that path, and it's just as
bad as passing the fence back as value to the callback in that case.

> 
> >> 
> >> Consider a fence design where signal() consumes self. Now consider this:
> >> 
> >> ```
> >> impl FenceCb for MyCallback {
> >> fn called(&mut self) {
> >>   // Can't move the fence out, so we have to put an Option<T> just to be 
> >> able
> >>   // to move.
> >>   if let Some(f) = self.some_fence.take() {
> >>     f.signal();
> >>   }
> >> }
> >> ```
> >> 
> >> This used to be the case when our version of the job queue used the "proxy
> >> fence" design:
> >> 
> >> 
> >> ```
> >> // Callback on the hw fence
> >> impl FenceCb for MyCallback {
> >> fn called(&mut self) {
> >>   if let Some(f) = self.submit_fence.take() {
> >>     f.signal();
> >>   }  
> > 
> > I'm pretty sure lockdep won't like it anyway, because this is nested
> > locking of the same lock class. For such proxies, we'll need to teach
> > lockdep about the nesting like has been recently done on
> > dma_fence_array & co. But I'm digressing.  
> 
> Yeah, but this is more about resource transfer in general, not
> this pattern specifically.
> 
> I agree that this has issues, and yes, lockdep complained back
> then :)

The thing is, there's so many aspects that could go wrong because of the
context this callback is called in. Nested locking is one of them,
the fact we can't sleep is another. And with rust it's even worse,
because of the implicit drops that will happen when you take ownership
of resources (taking sleeping locks to remove resources from a dataset
for instance).

So, by passing self by value to the ::callback(), you're basically
telling users "hey, BTW, don't forget to defer the drop to some
workqueue if you think it's not atomic-safe". And how can users know
that the thing they're about to drop can be dropped in atomic context?
They basically have to audit the ::drop() of all the resources they
embed in their type implementing FenceCb. Not only that, but they also
have to design the thing so the deferral of this ::drop() doesn't
allocate, because, obviously, allocating in atomic context is
tricky/fallible. AFAIK, none of this can be spot at compile-time (I
remember Gary/Danilo mentioning that we could teach the klint about
some of these rules). This would leave us with runtime checks like
might_sleep(), but most of the C putters (xxx_put(object)) don't have
might_sleep() in the path where the decref doesn't lead to a refcnt=0
situation.

TLDR; Call this PTSD if you want, but this is the sort of bugs I
struggled with on the C side, and I can predict that the exact same
will happen in rust drivers if we expose the FenceCb as it is designed
here and we don't have a way to check the soundness of the FenceCb
implementations at compile time.

The other option (the one I've been advocating for from the start), is
to not let drivers implement FenceCb (make it private), but instead
have a bunch of implementations that we know are safe. Here's a list of
implementations that I think would unblock most of the drivers use
cases:

- wakeup a thread
- complete a completion object
- schedule a WorkItem
- schedule a kthread_worker (once we get a proper rust abstraction for
  that)

It doesn't mean we can't have optimized FenceCb implementations that do
a lot more in the callback() path instead of deferring to a
workqueue/thread, but at least those would have to be implemented in
dma_fence.rs, and the dma_fence.rs maintainers can then carefully audit
the code as part of the review process, which we know is not really the
case when changes touch drivers code only.

FWIW, I think the FenceProxy design you were describing falls into
this "must be carefully audited" bucket, and should be implemented in
dma_fence.rs.

> 
> >   
> >> }
> >> ```
> >> 
> >> Although this is not the case anymore, since we phased out this design 
> >> given
> >> Christian's recent work. Still, we should ideally not require Option<T> 
> >> here in
> >> general just to make resource transfer possible.  
> > 
> > I see. OTOH, don't we need to make this inner data movable if we want
> > to cancel the FenceCb before the fence is signaled anyway? And that's
> > most certainly a case we have in the teardown path.  
> 
> Can you expand a bit on what you mean here?

Never mind, I was confusing two different iterations of the code here.
I thought the Option<T> you were mentioning was in
FenceCbRegistration<T>, with some explicit ::cancel() function that
would return Option<T> so the user can get its resources back when it
cancels the registration, and also know whether the callback was called
or not. But this is all gone now, and all we can do is drop the
registration, which will automatically drop the inner T.

Reply via email to