On 5/29/25 01:20, Dave Chinner wrote: > On Thu, May 29, 2025 at 07:53:55AM +1000, Dave Airlie wrote: >> On Wed, 28 May 2025 at 17:20, Christian König <christian.koe...@amd.com> >> wrote: >>> >>> Hi guys, >>> >>> On 5/27/25 01:49, Dave Chinner wrote: >>>> I disagree - specifically ordered memcg traversal is not something >>>> that the list_lru implementation is currently doing, nor should it >>>> be doing. >>> >>> I realized over night that I didn't fully explored a way of getting both >>> advantages. And we actually don't need list_lru for that. >>> >>> So here is a side question: >>> >>> Is it possible to just have a per cgroup counter on how many pages a cgroup >>> released back to a particular pool? E.g. something which is added up to the >>> same counter on the parent when a cgroup is released. >>> >>> Background is that the pages are not distinguishable from each other, e.g. >>> they are not cache hot or cold or anything like this. So it doesn't matter >>> which pages a cgroup has released but only how many. >>> >>> If it would be possible to get such a counter then it would be like just a >>> few lines of code to add the isolation and still get the advantage of >>> sharing released pages between different cgroups. >> >> I think numa is the only possible distinction I can see between pages >> here, even uncached GPU access will be slower to further away numa >> nodes,
Yeah, we have gone a bit back and forth about which priority things should have internally in the past and settled on this: 1. uncached and WC requests *must* be fulfilled. This is a technical necessity. 2. Allocating from the requested NUMA node should be fulfilled as much as possible. Performance really goes south if it isn't. 3. Allocating memory in large chunks is really nice to have. Gives up to 30% performance improvements in some use cases, but it's still better to use smaller pages from the right NUMA node than larger pages from the wrong one. >> But indeed this might be a workable idea, just make something that >> does what list_lru does but just for the counters, and keep the pages >> in a single pool. > > If you only want NUMA aware LRU + reclaim/reuse without memcg > awareness, list_lru supports that configuration. Use list_lru_init() > for numa-aware LRU infrastructure, list_lru_init_memcg() should only > be used if need memcg awareness in the LRU. Oh, that would be really useful! Currently our NUMA support in the ttm_pool is basically just a hack which relies on intimate knowledge of the only device using it. > THere are various caches that use this config e.g. the XFS buffer > cache and dquot caches because they are global caches whose contents > is shared across all cgroups. The shrinker associated with them is > configured only as SHRINKER_NUMA_AWARE so that reclaim is done > per-node rather than as a single global LRU.... Yeah, that is pretty much exactly what we need as far as I can see. Thanks, Christian. > > -Dave.