On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot <bertranddrouvot...@gmail.com> wrote:
Hi Bertrand, Thanks for playing with this! > Which makes me wonder if using numa_move_pages()/move_pages is the right > approach. Would be curious to know if you observe the same behavior though. You are correct, I'm observing identical behaviour, please see attached. > Forcing the allocation to happen inside a monitoring function is decidedly > not great. We probably would need to split it to some separate and new view within the pg_buffercache extension, but that is going to be slow, yet still provide valid results. In the previous approach that get_mempolicy() was allocating on 1st access, but it was slow not only because it was allocating but also because it was just 1 syscall per 1x addr (yikes!). I somehow struggle to imagine how e.g. scanning (really allocating) a 128GB buffer cache in future won't cause issues - that's like 16-17mln (* 2) syscalls to be issued when not using move_pages(2) Another thing is that numa_maps(5) won't help us a lot too (not enough granularity). > But maybe we could use get_mempolicy() only on "valid" buffers i.e > ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts? Different perspective: I wanted to use the same approach in the new pg_shmemallocations_numa, but that won't cut it there. The other idea that came to my mind is to issue move_pages() from the backend that has already used all of those pages. That literally mean on of the below ideas: 1. from somewhere like checkpointer / bgwriter? 2. add touching memory on backend startup like always (sic!) 3. or just attempt to read/touch memory addr just before calling move_pages(). E.g. this last options is just two lines: if(os_page_ptrs[blk2page+j] == 0) { + volatile uint64 touch pg_attribute_unused(); os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); + touch = *(uint64 *)os_page_ptrs[blk2page+j]; } and it seems to work while still issuing much less syscalls with move_pages() across backends, well at least here. Frankly speaking I do not know which path to take with this, maybe that's good enough? -J.
postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+------- | 16149 4 | 59 0 | 59 6 | 58 2 | 59 (5 rows) postgres=# create table xx as select generate_series(1, 1000000); SELECT 1000000 postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+------- | 14095 4 | 572 0 | 572 6 | 571 2 | 572 -2 | 2 (6 rows) postgres=# show shared_buffers ; shared_buffers ---------------- 128MB (1 row) postgres=# select pg_backend_pid(); pg_backend_pid ---------------- 14121 (1 row) ## and now from 14121: postgres=# select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id; NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000 numa_zone_id | count --------------+------- | 13439 4 | 46 0 | 48 6 | 47 2 | 47 -2 | 2757 ## also, this wont give us detailed addr <-> specific NUMA node info: postgres@jw-test3:~$ grep --color /dev/zero /proc/14121/numa_maps 7f5dd6004000 interleave:0-7 file=/dev/zero\040(deleted) dirty=6829 mapmax=7 active=0 N0=853 N1=853 N2=855 N3=853 N4=852 N5=854 N6=854 N7=855 kernelpagesize_kB=4