> BTW, is there a piece of doc explaining the rational behind this
> dma_fence contract, or is it just the usual informal knowledge shared
> among DRM devs over IRC/email threads :-) ?
> 
> To be honest, I'm a bit unhappy with this "it's part of the dma_fence
> contract" explanation, because I have a hard time remembering all the
> details that led to these set of rules myself, so I suspect it's even
> harder for new comers to reason about this. To me, it's one of the
> reasons people fail to understand/tend to forget what the
> problems/limitations are, and end up ignoring them (intentionally or
> not).
> 
> FWIW, this is what I remember, but I'm sure there's more:
> 
> 1. dma_fence must signal in finite time, so unbounded waits in the
>    fence signalling path path is not good, and that's what happens with
>    GFP_KERNEL allocations
> 2. if you're blocked in your GPU fault handler, that means you can't
>    process further faults happening on other contexts
> 3. GPU drivers are actively participating in the memory reclaim
>    process, which leads to deadlocks if the memory allocation in the
>    fault handler is waiting on the very same GPU job fence that's
>    waiting for its memory allocation to be satisfied
> 
> I'd really love if someone (Sima, Alyssa and/or Christian?) could sum it
> up, so I can put the outcome of this discussion in some kernel doc
> entry (or maybe it'd be better if this was one of you submitting a
> patch for that ;-)). If it's already documented somewhere, I'll just
> have to eat my hat and accept your RTFM answer :-).

https://www.kernel.org/doc/html/next/driver-api/dma-buf.html#dma-fence-cross-driver-contract

Specifically

  Drivers are allowed to call dma_fence_wait() from their shrinker
  callbacks. This means any code required for fence completion cannot
  allocate memory with GFP_KERNEL.

Concretely:

* Job requires memory allocation to signal a fence
* We're in a low memory situation, so the shrinker is invoked
* The shrinker can't free memory until the job finishes
* Deadlock!

Possibly we could relax the contract to let us reclaim non-graphics
memory, but that's not my department.

Reply via email to