On 6/24/25 17:30, Christoph Berg wrote: > Re: Tomas Vondra >> If it's a reliable fix, then I guess we can do it like this. But won't >> that be a performance penalty on everyone? Or does the system split the >> array into 16-element chunks anyway, so this makes no difference? > > There's still the overhead of the syscall itself. But no idea how > costly it is to have this 16-step loop in user or kernel space. > > We could claim that on 32-bit systems, shared_buffers would be smaller > anyway, so there the overhead isn't that big. And the step size should > be larger (if at all) on 64-bit. > >> Anyway, maybe we should start by reporting this to the kernel people. Do >> you want me to do that, or shall one of you take care of that? I suppose >> that'd be better, as you already wrote a fix / know the code better. > > Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2 >
Thanks! Now we wait ... Attached is a minor tweak of the valgrind suppresion rules, to add the two places touching the memory. I was hoping I could add a single rule for pg_numa_touch_mem_if_required, but that does not work - it's a macro, not a function. So I had to add one rule for both functions, querying the NUMA. That's a bit disappointing, because it means it'll hide all other failues (of Memcheck:Addr8 type) in those functions. Perhaps it'd be be better to turn pg_numa_touch_mem_if_required into a proper (inlined) function, at least with USE_VALGRIND defined. Something like the v2 patch - needs more testing to ensure the inlined function doesn't break the touching or something silly like that. regards -- Tomas Vondra
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp index 7ea464c8094..36bf3253f76 100644 --- a/src/tools/valgrind.supp +++ b/src/tools/valgrind.supp @@ -180,3 +180,22 @@ Memcheck:Cond fun:PyObject_Realloc } + +# Querying NUMA node for shared memory requires touching the memory so +# that it gets allocated in the process. But we'll touch memory backing +# buffers, but that memory may be marked as noaccess for buffers that +# are not pinned. So just ignore that, we're not really accessing the +# buffers, for both places querying the NUMA status. +{ + pg_buffercache_numa_pages + Memcheck:Addr8 + fun:pg_buffercache_numa_pages + fun:ExecMakeTableFunctionResult +} + +{ + pg_get_shmem_allocations_numa + Memcheck:Addr8 + fun:pg_get_shmem_allocations_numa + fun:ExecMakeTableFunctionResult +}
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h index 40f1d324dcf..3b9a5b42898 100644 --- a/src/include/port/pg_numa.h +++ b/src/include/port/pg_numa.h @@ -24,9 +24,22 @@ extern PGDLLIMPORT int pg_numa_get_max_node(void); * This is required on Linux, before pg_numa_query_pages() as we * need to page-fault before move_pages(2) syscall returns valid results. */ +#ifdef USE_VALGRIND + +static inline void +pg_numa_touch_mem_if_required(uint64 tmp, char *ptr) +{ + volatile uint64 ro_volatile_var pg_attribute_unused(); + ro_volatile_var = *(volatile uint64 *) ptr; +} + +#else + #define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ ro_volatile_var = *(volatile uint64 *) ptr +#endif + #else #define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \ diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp index 7ea464c8094..6b9a8998f82 100644 --- a/src/tools/valgrind.supp +++ b/src/tools/valgrind.supp @@ -180,3 +180,14 @@ Memcheck:Cond fun:PyObject_Realloc } + +# Querying NUMA node for shared memory requires touching the memory so +# that it gets allocated in the process. But we'll touch memory backing +# buffers, but that memory may be marked as noaccess for buffers that +# are not pinned. So just ignore that, we're not really accessing the +# buffers, for all places querying the NUMA status. +{ + pg_numa_touch_mem_if_required + Memcheck:Addr8 + fun:pg_numa_touch_mem_if_required +}