Hi On Thu, 2025-08-21 at 16:59 +0200, Christian König wrote: > On 21.08.25 16:06, Thomas Hellström wrote: > > > What are you referring to? > > > > https://lore.kernel.org/intel-xe/a004736315d77837172418eb196d5b5f80b74e6c.ca...@linux.intel.com/ > > Thanks, that one never made it into my inbox as far as I can see. > > > A couple of questions on the design direction here: > > > > IIRC both xe and i915 has checks to consider objects with a 0 gem > > refcount as zombies requiring special treatment or skipping, when > > encountered in TTM callbacks. We need to double-check that. > > I think I've found all of those. The one in i915 were actually not > TTM specific but try to catch the same problem on the GEM refcount. > > > But I wonder, > > first this practice of resurrecting refcounts seem a bit unusual, I > > wonder if we can get rid of that somehow? > > I was also going back on forth if that is a good idea or not as well. > > The usual solution to such kinds of issues is to use two reference > counts, so that you got a multi stage cleanup approach. E.g. backing > store and object, like what mm_struct is using as well. > > The problem was simply that TTM/GEM ended up having *four* reference > counts for the same object, each was doing something different and > they didn't worked well together at all. > > > Furthermore, it seems the problem with drm_exec is related only to > > the > > LRU walk. What about adding a struct completion to the object, that > > is > > signaled when the object has freed its final backing-store. The LRU > > walk would then check if the object is a zombie, and if so just > > wait on > > the struct completion. (Need of course to carefully set up locking > > orders). Then we wouldn't need to resurrect the gem refcount, nor > > use > > drm_exec locking for zombies. > > I had a similar idea, waiting is already possible by waiting for the > BOs work item. > > But I abandoned that idea because I couldn't see how we could solve > the locking. > > > We would still need some form of refcounting while waiting on the > > struct completion, but if we restricted the TTM refcount to *only* > > be > > used internally for that sole purpose, and also replaced the final > > ttm_bo_put() with the ttm_bo_finalize() that you suggest we > > wouldn't > > need to resurrect that refcount since it wouldn't drop to zero > > until > > the object is ready for final free. > > > > Ideas, comments? > > Ideally I think we would use the handle_count as backing store the > drm_gem_object->refcount as structure reference. > > But that means a massive rework of the GEM handling/drivers/TTM. > > Alternative we could just grab a reference to a unsignaled fence when > we encounter a dead BO on the LRU. > > What do you think of that idea?
I think to be able to *guarantee* exhaustive eviction, we need 1) all unfreed resources to sit on an LRU, and 2) everything on the LRU needs to be able to have something to wait for. A fence can't really guarantee 2), but it's close. There is a time- interval in betwen where the last fence signals and we take the resource from the LRU and free it. A struct completion can be made to signal when the resource is freed. I think the locking restriction in the struct completion case (the struct completion is likely waited for under a dma-resv), is that nothing except the object destructor may take an individualized resv of a zombie gem object whose refcount has gone to zero. The destructor should use an asserted trylock only to make lockdep happy. The struct completion also needs a refcount to avoid destroying it while there are waiters. So what do you think about starting out with a fence, and if / when that appears not to be sufficient, we have a backup plan to move to a struct completion? Thomas > > Regards, > Christian.