As I have promised to Andres on the Discord hacking server some time ago, I'm attaching the very brief (and potentially way too rushed) draft of the first step into NUMA observability on PostgreSQL that was based on his presentation [0]. It might be rough, but it is to get us started. The patches were not really even basically tested, they are more like input for discussion - rather than solid code - to shake out what should be the proper form of this.
Right now it gives: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+------- | 16127 6 | 256 1 | 1 Changes since the version posted on Discord: 1. libnuma to centralize dependency in the build process (to be future proof; gives opportunity to use e.g. numa_set_localalloc()). BTW: why is a specific autoconf version (2.69) required? 2. per-page get_mempolicy(2) syscall was changed to 1x call of migrate_pages(2) by Bertrand 3. enhancement to support huge pages (with the above) and code to reduce no of pages for inquiry by doing DB block <-> OS memory pages mapping. This is a bit hard for me and I'm pretty sure it could be done somewhat better. Some other points: a. plenty of FIXMEs inside and I bet I could screw-up the void *ptr calculations , but we somehow need to support scenarios like BLCKSZ=2k .. 32kB @ page sizes 4kB,2M,16M b. I don't think it makes sense to expose users to bitmaps or int[] arrays, so there's no support showing that potentially 1 DB block spans 2 OS memory pages (I think it should be rare!) c. we probably should switch to numa_move_pages(3) from libnuma, right? d. earlier Andres wrote: > IME using pg_buffercache_pages() is often too expensive due to the per-row > overhead. I think we'd probably want a number-of-pages-per-numa-node function > that does the grouping in C. Compare how fast pg_buffercache_summary() is to > doing the grouping in SQL when using larger shared_buffers settings. I think it doesn't make a lot of sense to introduce *new* pg_buffercache_numa_usage_summary() for this, if we can go straight for pg_shmallocations_numa view instead, shouldn't we? It will give a much better picture for everything else for free. Patches and co-authors are more than welcome! -J. [0] - https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf
0001-Extend-pg_buffercache-to-also-show-NUMA-zone-id-allo.patch
Description: Binary data
0001-Add-optional-dependency-to-libnuma-for-basic-NUMA-aw.patch
Description: Binary data