On Thu, 2026-06-04 at 10:15 +0200, Boris Brezillon wrote:
> On Wed, 3 Jun 2026 21:43:05 -0300
> Daniel Almeida <[email protected]> wrote:
> 
> > > On 3 Jun 2026, at 14:14, Boris Brezillon <[email protected]> 
> > > wrote:
> > > 
> > > On Wed, 3 Jun 2026 13:41:02 -0300
> > > Daniel Almeida <[email protected]> wrote:
> > >   
> > > > > +    /// Called when the fence is signaled.
> > > > > +    ///
> > > > > +    /// This is called from the fence signaling path, which may be 
> > > > > in interrupt
> > > > > +    /// context or with locks held, which is why `self` is only 
> > > > > borrowed, so that
> > > > > +    /// it cannot drop. Implementations must not sleep or perform
> > > > > +    /// long-running operations.
> > > > > +    ///
> > > > > +    /// An implementation likely wants to inform itself (e.g., 
> > > > > through a work item)
> > > > > +    /// within this callback that the associated 
> > > > > [`FenceCbRegistration`] can now be
> > > > > +    /// dropped.
> > > > > +    fn called(&mut self);    
> > > > 
> > > > This is a central point. We ideally would want this to consume self, 
> > > > because we
> > > > may want to move things out of the callback.    
> > > 
> > > This one comes from me. The rationale being that ::called() is called
> > > from an atomic context, and the resources attached to the callback data
> > > might require acquiring other sleeping locks to be released, and
> > > sometimes you don't even notice immediately because said resources are
> > > refcounted, and the lock is only acquired when you happen to be the
> > > last owner. Yes, those can be caught at runtime if the C side is
> > > properly annotated with might_sleep(), but that's not always the case.
> > > 
> > > If we defer the drop of the data only when the FenceCb is
> > > dropped/recycled, we're at least not constrained by this "runs in
> > > atomic context" thing.
> > >   
> > 
> > This design does not solve it, because one can quite trivially get around 
> > this
> > restriction using Option<T> as I said. If your point is “don’t run any 
> > drop() here”,
> > then &mut self doesn’t do it.
> 
> My bad, I thought you were talking about some Option<T> in
> FenceCbRegistration<T> (there was one at some point, but it's gone now),
> but you're talking about having an Option<X> inside the T. Yes, there's
> indeed nothing preventing a drop on X in that path, and it's just as
> bad as passing the fence back as value to the callback in that case.

Then maybe we should just pass it by value and require implementation
of an unsafe trait on `T`, whose safety-requirements demand that this
must be save to drop from atomic context.

> 
> > 
> > > > 
> > > > Consider a fence design where signal() consumes self. Now consider this:
> > > > 
> > > > ```
> > > > impl FenceCb for MyCallback {
> > > > fn called(&mut self) {
> > > >   // Can't move the fence out, so we have to put an Option<T> just to 
> > > > be able
> > > >   // to move.
> > > >   if let Some(f) = self.some_fence.take() {
> > > >     f.signal();
> > > >   }
> > > > }
> > > > ```
> > > > 
> > > > This used to be the case when our version of the job queue used the 
> > > > "proxy
> > > > fence" design:
> > > > 
> > > > 
> > > > ```
> > > > // Callback on the hw fence
> > > > impl FenceCb for MyCallback {
> > > > fn called(&mut self) {
> > > >   if let Some(f) = self.submit_fence.take() {
> > > >     f.signal();
> > > >   }  
> > > 
> > > I'm pretty sure lockdep won't like it anyway, because this is nested
> > > locking of the same lock class. For such proxies, we'll need to teach
> > > lockdep about the nesting like has been recently done on
> > > dma_fence_array & co. But I'm digressing.  
> > 
> > Yeah, but this is more about resource transfer in general, not
> > this pattern specifically.
> > 
> > I agree that this has issues, and yes, lockdep complained back
> > then :)
> 
> The thing is, there's so many aspects that could go wrong because of the
> context this callback is called in. Nested locking is one of them,
> the fact we can't sleep is another. And with rust it's even worse,
> because of the implicit drops that will happen when you take ownership
> of resources (taking sleeping locks to remove resources from a dataset
> for instance).

Doesn't that have to be a problem for much of Rust infrastructure? How
do other parties solve that?

> 
> So, by passing self by value to the ::callback(), you're basically
> telling users "hey, BTW, don't forget to defer the drop to some
> workqueue if you think it's not atomic-safe". And how can users know
> that the thing they're about to drop can be dropped in atomic context?
> They basically have to audit the ::drop() of all the resources they
> embed in their type implementing FenceCb. Not only that, but they also
> have to design the thing so the deferral of this ::drop() doesn't
> allocate, because, obviously, allocating in atomic context is
> tricky/fallible. AFAIK, none of this can be spot at compile-time (I
> remember Gary/Danilo mentioning that we could teach the klint about
> some of these rules). This would leave us with runtime checks like
> might_sleep(), but most of the C putters (xxx_put(object)) don't have
> might_sleep() in the path where the decref doesn't lead to a refcnt=0
> situation.
> 
> TLDR; Call this PTSD if you want, but this is the sort of bugs I
> struggled with on the C side, and I can predict that the exact same
> will happen in rust drivers if we expose the FenceCb as it is designed
> here and we don't have a way to check the soundness of the FenceCb
> implementations at compile time.

My guess would be that the existence of unsafe-traits is the admission
of Rust that this just cannot be guaranteed by design.

If a driver cannot know whether this or that is safe to drop, then it
would have to defer it's dropping. Or would there be cases where this
also doesn't work?

> 
> The other option (the one I've been advocating for from the start), is
> to not let drivers implement FenceCb (make it private), but instead
> have a bunch of implementations that we know are safe. Here's a list of
> implementations that I think would unblock most of the drivers use
> cases:
> 
> - wakeup a thread
> - complete a completion object
> - schedule a WorkItem
> - schedule a kthread_worker (once we get a proper rust abstraction for
>   that)
> 
> It doesn't mean we can't have optimized FenceCb implementations that do
> a lot more in the callback() path instead of deferring to a
> workqueue/thread, but at least those would have to be implemented in
> dma_fence.rs, and the dma_fence.rs maintainers can then carefully audit
> the code as part of the review process, which we know is not really the
> case when changes touch drivers code only.

Pragmatically speaking, if the common cases are trivial, then the
drivers will get them right, because those critical primitives are
already atomic-safe.

And a non-common case will have to be implemented in the driver
anyways, so we'd have to allow for that.



P.

Reply via email to