Re: Draft for basic NUMA observability

Jakub Wartak Mon, 17 Feb 2025 04:02:45 -0800

On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot...@gmail.com> wrote:


Hi Bertrand,

Thanks for playing with this!

> Which makes me wonder if using numa_move_pages()/move_pages is the right 
> approach. Would be curious to know if you observe the same behavior though.

You are correct, I'm observing identical behaviour, please see attached.

> Forcing the allocation to happen inside a monitoring function is decidedly 
> not great.

We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results. In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)

Another thing is that numa_maps(5) won't help us a lot too (not enough
granularity).

> But maybe we could use get_mempolicy() only on "valid" buffers i.e 
> ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages().  E.g. this last options is just two lines:

if(os_page_ptrs[blk2page+j] == 0) {
+    volatile uint64 touch pg_attribute_unused();
    os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+    touch = *(uint64 *)os_page_ptrs[blk2page+j];
}

and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.

Frankly speaking I do not know which path to take with this, maybe
that's good enough?

-J.

postgres=# select numa_zone_id, count(*) from pg_buffercache group by 
numa_zone_id;
NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
--------------+-------
              | 16149
            4 |    59
            0 |    59
            6 |    58
            2 |    59
(5 rows)

postgres=# create table xx as select generate_series(1, 1000000);
SELECT 1000000
postgres=# select numa_zone_id, count(*) from pg_buffercache group by 
numa_zone_id;
NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
--------------+-------
              | 14095
            4 |   572
            0 |   572
            6 |   571
            2 |   572
           -2 |     2
(6 rows)

postgres=# show shared_buffers ;
 shared_buffers
----------------
 128MB
(1 row)

postgres=# select pg_backend_pid();
 pg_backend_pid
----------------
          14121
(1 row)


## and now from 14121:

postgres=# select numa_zone_id, count(*) from pg_buffercache group by 
numa_zone_id;
NOTICE:  os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
 numa_zone_id | count
--------------+-------
              | 13439
            4 |    46
            0 |    48
            6 |    47
            2 |    47
           -2 |  2757

## also, this wont give us detailed addr <-> specific NUMA node info:
postgres@jw-test3:~$ grep --color /dev/zero /proc/14121/numa_maps
7f5dd6004000 interleave:0-7 file=/dev/zero\040(deleted) dirty=6829 mapmax=7 
active=0 N0=853 N1=853 N2=855 N3=853 N4=852 N5=854 N6=854 N7=855 
kernelpagesize_kB=4

Re: Draft for basic NUMA observability

Reply via email to