On 3/17/26 16:21, Boris Brezillon wrote:
> On Tue, 17 Mar 2026 15:48:25 +0100
> "Christian König" <[email protected]> wrote:
> 
>> In case of a refcounting bug dma_fence_release() can be called before the
>> fence was even signaled.
>>
>> Previously the dma_fence framework then force signaled the fence to make
>> sure to unblock waiters, but that can potentially lead to random memory
>> corruption when the DMA operation continues. So be more defensive here and
>> pick the lesser evil.
>>
>> Instead of force signaling the fence set an error code on the fence,
>> re-initialize the refcount to something large and taint the kernel.
>>
>> This will leak memory and eventually can cause a deadlock when the fence
>> is never signaled, but at least we won't run into an use after free or
>> random memory corruption.
>>
>> Signed-off-by: Christian König <[email protected]>
>> ---
>>  drivers/dma-buf/dma-fence.c | 18 ++++++++++++++----
>>  1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/dma-buf/dma-fence.c b/drivers/dma-buf/dma-fence.c
>> index 1826ba73094c..8bf07685a053 100644
>> --- a/drivers/dma-buf/dma-fence.c
>> +++ b/drivers/dma-buf/dma-fence.c
>> @@ -593,14 +593,24 @@ void dma_fence_release(struct kref *kref)
>>              /*
>>               * Failed to signal before release, likely a refcounting issue.
>>               *
>> -             * This should never happen, but if it does make sure that we
>> -             * don't leave chains dangling. We set the error flag first
>> -             * so that the callbacks know this signal is due to an error.
>> +             * This should never happen, but if try to be defensive and take
>> +             * the lesser evil. Initialize the refcount to something large,
>> +             * but not so large that it can overflow.
>> +             *
>> +             * That will leak memory and could deadlock if the fence never
>> +             * signals, but at least it doesn't cause an use after free or
>> +             * random memory corruption.
>> +             *
>> +             * Also taint the kernel to note that it is rather unreliable to
>> +             * continue.
>>               */
>>              dma_fence_lock_irqsave(fence, flags);
>>              fence->error = -EDEADLK;
>> -            dma_fence_signal_locked(fence);
>> +            refcount_set(&fence->refcount.refcount, INT_MAX);
> 
> I'm not convinced this is useful. If we leak the object, no one should
> have a ref to release anyway. This does raise a question though. The
> case we're trying to protect against is fence_callback being registered
> to this fence and waiting for an event to signal another proxy fence.

Not quite. The real problematic case is that it is necessary to wait for a 
fence to signal with tons of memory management locks held.

So it can be that a simple memory allocation cycles back and depends on the 
fence to signal.

> How can the refcnt drop to zero in that case? Isn't the proxy supposed
> to own a ref on the fence. Before we go further, I'd like to understand
> what we're trying to do.

Well we are in C here, so its simply coding errors. An unecessary 
dma_fence_put() in an error path is enough to trigger this.

> The original discussion that led you to write this patch was about
> detecting when a fence emitter/producer would leave unsignalled fences
> behind, and the problem we have is when such unsignalled fences have
> observers waiting for a "signalled" event. If the refcnt drops to zero
> and the fence is released, we're already passed that point,
> unfortunately.

Well that is not quite correct.

The most common problem is that we have unbalanced 
dma_fence_get()/dma_fence_put() and we end up in dma_fence_release() before the 
issuer of the dma_fence has a chance to signal it.

See the main purpose of DMA fences is to prevent releasing memory back into the 
core memory management before the DMA operation is completed.

So when a DMA fence signals to early it means that the HW is still writing to 
that memory but we already potentially re-using the memory ending in random 
memory corruption.

UAF issues are harmless compared to that.

Regards,
Christian.

> It can be that:
> 
> - the fence was never exposed -> this is fine
> - the fence was exposed but never observed -> this is broken, because if
>   it had been observed it would have led to a deadlock
> - the fence was exposed, observed for some time, but the observer got
>   bored, stopped waiting and:
>   * decided to go and execute its stuff anyway -> use-before-ready
>     situation
>   * gave up -> kinda okay, but we should still consider the fence
>     emitter broken
> - the fence observer registered a callback but didn't take a ref on the
>   object -> this is potential UAF on the dma_fence, which can also lead
>   to a VRAM/system-mem UAF if the emitter drops the dma_fence without
>   signalling, because of the auto-signal you're getting rid of in this
>   patch.  But the latter is just a side effect of the dma_fence UAF,
>   which I'm not convinced we should try to protect against.
> 
>>              dma_fence_unlock_irqrestore(fence, flags);
>> +            rcu_read_unlock();
>> +            add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
>> +            return;
>>      }
>>  
>>      ops = rcu_dereference(fence->ops);
> 

Reply via email to