Hi Jakub
On 7/24/25 10:01 AM, Jakub Wartak wrote:
On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <m...@packi.ch> wrote:
Hi!
On 4/7/25 11:27 PM, Tomas Vondra wrote:
I've pushed all three parts of v29, with some additional corrections
(picked lower OIDs, bumped catversion, fixed commit messages).
While building the PG18 beta1/2 packages I noticed that in our build
containers the selftest for pg_buffercache_numa and numa failed. It
seems that libnuma was available and pg_numa_init/numa_available returns
no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
yielding the following error when accessing
pg_buffercache_numa/pg_shmem_allocations_numa:
ERROR: failed NUMA pages inquiry: Operation not permitted
The man-page of move_pages lead me to believe that this is because of
the missing capability CAP_SYS_NICE on the process but I couldn't prove
that theory with the attached patch.
The patch did make the tests pass but also disabled NUMA permanently on
a vanilla Debian VM and that is certainly not wanted. It may well be
that my understanding of checking capabilities and how they work is
incomplete. I also think that adding a new dependency for the reason of
just checking the capability is probably a bit of an overkill, maybe we
can check if we can access move_pages once without an error before
treating it as one?
I'd be happy to debug this further but I have limited access to our
build-infra, I should be able to sneak in commands during the build though.
Hi Patrick,
So is it because the container was started without CAP_SYS_NICE so
even root -> postgres is not having this cap? In my book container
would be rather small and certainly single container wouldn't be
spanning multiple CPU sockets, so I would just disable libnuma, anyway
if I do on regular VM:
> [...]
This is just for the build-env but it runs the selftest and this fails
then. The containers this is running in prod is a totally different
setup and there the numa calls actually work. Disabling it may be an
option but it would be nice to detect that we can't access it at runtime.
Can you provide exact details about this container technology?
We use podman to set everything up.
Can you provide /usr/sbin/capsh --print just before starting PG there?
Maybe this is more cgroup/cpuset somehow related too?
Here is the output, it seems that cap_sys_nice is missing from the
bounding set:
+ /usr/sbin/capsh --print
Current: =
Bounding set
=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =
Current IAB:
!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
secure-noroot: no (unlocked)
secure-no-suid-fixup: no (unlocked)
secure-keep-caps: no (unlocked)
secure-no-ambient-raise: no (unlocked)
uid=2000(buildkite-agent) euid=2000(buildkite-agent)
gid=2000(buildkite-agent)
groups=2000(buildkite-agent)
Guessed mode: HYBRID (4)
Anyway, there is a simpler way to make the tests pass if that's what
you are after. We do have
contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
to match outputs in pg_buffercache_numa.out OR (!)
pg_buffercache_numa_1.out. We could just handle this edge case by
adding pg_buffercache_numa_2.out too probably (which would just
contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
Operation not permitted")
Ah, didn't know that was a possibility. Until this sees more usage than
just querying the state, this may be a nice workaround. If this is more
wide-spread we probably need something a bit more robust for the
detection. I already patch out the tests for our build-env so for me
it's "solved" but that is certainly not a proper solution.
Just FYI, I'll be on PTO so I won't have access to the build-env in the
next two weeks.
Thanks,
Patrick