On Fri, 2025-04-11 at 16:10 +0200, Philipp Stanner wrote: > On Fri, 2025-04-11 at 15:06 +0200, Christian König wrote: > > Am 11.04.25 um 14:44 schrieb Philipp Stanner: > > > On Fri, 2025-04-11 at 13:05 +0200, Christian König wrote: > > > > Am 11.04.25 um 11:29 schrieb Philipp Stanner: > > > > > > > > > [SNIP] > > > > > > > > > > It could be, however, that at the same moment > > > > > nouveau_fence_signal() is > > > > > removing that entry, holding the appropriate lock. > > > > > > > > > > So we have a race. Again. > > > > > > > > > > > > > Ah, yes of course. If signaled is called with or without the > > > > lock is > > > > actually undetermined. > > > > > > > > > > > > > > > > > > You see, fixing things in Nouveau is difficult :) > > > > > It gets more difficult if you want to clean it up "properly", > > > > > so it > > > > > conforms to rules such as those from dma_fence. > > > > > > > > > > I have now provided two fixes that both work, but you are not > > > > > satisfied > > > > > with from the dma_fence-maintainer's perspective. I > > > > > understand > > > > > that, > > > > > but please also understand that it's actually not my primary > > > > > task > > > > > to > > > > > work on Nouveau. I just have to fix this bug to move on with > > > > > my > > > > > scheduler work. > > > > > > > > > > > > > Well I'm happy with whatever solution as long as it works, but > > > > as > > > > far as I can see the approach with the callback simply doesn't. > > > > > > > > You just can't drop the fence reference for the list from the > > > > callback. > > > > > > > > > > > > > > > > > > So if you have another idea, feel free to share it. But I'd > > > > > like to > > > > > know how we can go on here. > > > > > > > > > > > > > Well the fence code actually works, doesn't it? The problem is > > > > rather that setting the error throws a warning because it > > > > doesn't > > > > expect signaled fences on the pending list. > > > > > > > > Maybe we should fix that instead. > > > The fence code works as the author intended, but I would be happy > > > if it > > > were more explicitly documented. > > > > > > Regarding the WARN_ON: It occurs in dma_fence_set_error() because > > > there > > > is an attempt to set an error code on a signaled fence. I don't > > > think > > > that should be "fixed", it works as intended: You must not set an > > > error > > > code of a fence that was already signaled. > > > > > > The reason seems to be that once a fence is signaled, a third > > > party > > > might evaluate the error code. > > > > Yeah, more or less correct. The idea is you can't declare an > > operation as having an error after the operation has already > > completed. > > > > Because everyone will just wait for the completion and nobody > > checks > > the status again after that. > > > > > > > > But I think this wasn't wat you meant with "fix". > > > > The idea was to avoid calling dma_fence_set_error() on already > > signaled fences. Something like this: > > > > @@ -90,7 +90,7 @@ nouveau_fence_context_kill(struct > > nouveau_fence_chan *fctx, int error) > > while (!list_empty(&fctx->pending)) { > > fence = list_entry(fctx->pending.next, > > typeof(*fence), head); > > > > - if (error) > > + if (error & !dma_fence_is_signaled_locked(&fence- > > > base)) > > dma_fence_set_error(&fence->base, error); > > > > if (nouveau_fence_signal(fence)) > > > > That would also improve the handling quite a bit since we now don't > > set errors on fences which are already completed even if we haven't > > realized that they are already completed yet. > > > > > In any case, there must not be signaled fences in nouveau's > > > pending- > > > list. They must be removed immediately once they signal, and this > > > must > > > not race. > > > > Why actually? As far as I can see the pending list is not for the > > unsignaled fences, but rather the pending interrupt processing. > > That's a list of fences that are "in the air", i.e., whose jobs are > currently being processed by the hardware. Once a job is done, its > fence must be removed. > > > > > So having signaled fences on the pending list is perfectly > > possible. > > It is possible, and that is a bug. The list is used by > nouveau_fence_context_kill() to kill still pending jobs. It shall not > try to kill and set error codes for fences that are already signaled.
@Danilo: We have now 2 possible solutions for the firing WARN_ON floating. Version A (Christian) Check in nouveau_fence_context_kill() whether a fence is already signaled before setting an error. Version B (Me) This patch series here. Make sure that in Nouveau, only nouveau_fence_signal() signals fences. Both should do the trick. Please share a maintainer-preference so I can move on here. Thx P. > > > > Anyways, forget about the "remove callbacks solution" it actually > causes a MASSIVE performance regression. No idea why, AFAICS the fast > path is only ever evaluated in nouveau_fence_done(), but maybe I > missed > something. > > Will re-iterate next week… > > > P. > > > > > > Regards, > > Christian. > > > > > > > > > > > > > > > > > > > > > > > I'm running out of ideas. What I'm wondering if we couldn't > > > > > just > > > > > remove > > > > > performance hacky fastpath functions such as > > > > > nouveau_fence_is_signaled() completely. It seems redundant to > > > > > me. > > > > > > > > > > > > > That would work for me as well. > > > I'll test this approach. Seems a bit like the nuclear approach, > > > but > > > if > > > it works we'd at least clean up a lot of this mess. > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Or we might add locking to it, but IDK what was achieved with > > > > > RCU > > > > > here. > > > > > In any case it's definitely bad that Nouveau has so many > > > > > redundant > > > > > and > > > > > half-redundant mechanisms. > > > > > > > > > > > > > Yeah, agree messing with the locks even more won't help us > > > > here. > > > > > > > > Regards, > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > P. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Replace the call to dma_fence_is_signaled() with > > > > > > > > > > nouveau_fence_base_is_signaled(). > > > > > > > > > > > > > > > > > > > > Cc: <sta...@vger.kernel.org> # 4.10+, precise > > > > > > > > > > commit > > > > > > > > > > not > > > > > > > > > > to > > > > > > > > > > be > > > > > > > > > > determined > > > > > > > > > > Signed-off-by: Philipp Stanner <pha...@kernel.org> > > > > > > > > > > --- > > > > > > > > > > drivers/gpu/drm/nouveau/nouveau_fence.c | 2 +- > > > > > > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git > > > > > > > > > > a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > > > > > > > b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > > > > > > > index 7cc84472cece..33535987d8ed 100644 > > > > > > > > > > --- a/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > > > > > > > +++ b/drivers/gpu/drm/nouveau/nouveau_fence.c > > > > > > > > > > @@ -274,7 +274,7 @@ nouveau_fence_done(struct > > > > > > > > > > nouveau_fence > > > > > > > > > > *fence) > > > > > > > > > > nvif_event_block(&fctx- > > > > > > > > > > > event); > > > > > > > > > > spin_unlock_irqrestore(&fctx- > > > > > > > > > > >lock, > > > > > > > > > > flags); > > > > > > > > > > } > > > > > > > > > > - return dma_fence_is_signaled(&fence- > > > > > > > > > > >base); > > > > > > > > > > + return > > > > > > > > > > test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, > > > > > > > > > > &fence- > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > base.flags); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > static long > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >