Draft for basic NUMA observability

Jakub Wartak Fri, 07 Feb 2025 06:33:23 -0800

As I have promised to Andres on the Discord hacking server some time
ago, I'm attaching the very brief (and potentially way too rushed)
draft of the first step into NUMA observability on PostgreSQL that was
based on his presentation [0]. It might be rough, but it is to get us
started. The patches were not really even basically tested, they are
more like input for discussion - rather than solid code - to shake out
what should be the proper form of this.

Right now it gives:

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
--------------+-------
              | 16127
            6 |   256
            1 |     1

Changes since the version posted on Discord:

1. libnuma to centralize dependency in the build process (to be future
proof; gives opportunity to use e.g. numa_set_localalloc()). BTW: why
is a specific autoconf version (2.69) required?
2. per-page get_mempolicy(2) syscall was changed to 1x call of
migrate_pages(2) by Bertrand
3. enhancement to support huge pages (with the above) and code to
reduce no of pages for inquiry by doing DB block <-> OS memory pages
mapping. This is a bit hard for me and I'm pretty sure it could be
done somewhat better.

Some other points:
a. plenty of FIXMEs inside and I bet I could screw-up the void *ptr
calculations , but we somehow need to support scenarios like BLCKSZ=2k
.. 32kB @ page sizes 4kB,2M,16M
b. I don't think it makes sense to expose users to bitmaps or int[]
arrays, so there's no support showing that potentially 1 DB block
spans 2 OS memory pages (I think it should be rare!)
c. we probably should switch to numa_move_pages(3) from libnuma, right?
d. earlier Andres wrote:
> IME using pg_buffercache_pages() is often too expensive due to the per-row 
> overhead. I think we'd probably want a number-of-pages-per-numa-node function
> that does the grouping in C. Compare how fast pg_buffercache_summary() is to 
> doing the grouping in SQL when using larger shared_buffers settings.
I think it doesn't make a lot of sense to introduce *new*
pg_buffercache_numa_usage_summary() for this, if we can go straight
for pg_shmallocations_numa view instead, shouldn't we? It will give a
much better picture for everything else for free.

Patches and co-authors are more than welcome!

-J.

[0] - 
https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

0001-Extend-pg_buffercache-to-also-show-NUMA-zone-id-allo.patch
Description: Binary data

0001-Add-optional-dependency-to-libnuma-for-basic-NUMA-aw.patch
Description: Binary data

Draft for basic NUMA observability

Reply via email to