Re: Draft for basic NUMA observability

Jakub Wartak Wed, 26 Feb 2025 00:38:54 -0800

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <and...@anarazel.de> wrote:


> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing  multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres@postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
                      name                      |    off    | ?column?
 | allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
 Proc Header                                    | 147114112 |
147114240 |            128
 Proc Array                                     | 147274752 |
147275392 |            640
 KnownAssignedXids                              | 147275392 |
147310848 |          35456
 KnownAssignedXidsValid                         | 147310848 |
147319808 |           8960
 Backend Status Array                           | 147319808 |
147381248 |          61440

postgres@postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
       name        | numa_zone_id | numa_size
-------------------+--------------+-----------
 KnownAssignedXids |            0 |   2097152
 Proc Array        |            0 |   2097152
 Proc Header       |            0 |   2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.

Added.

> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages)               hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)               hugepages=on        INVALID RESULTS (-2)
patch(touchpages)                hugepages=off        CORRECT RESULT
patch(touchpages)                hugepages=on        CORRECT RESULT
patch(-touchpages)+MAP_POPULATE  hugepages=off        INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE  hugepages=on        INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)

Done.

> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> >          SANITIZER_FLAGS: -fsanitize=address
> >          PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > +      # FIXME: use or not the libnuma?
> > +      #      --with-libnuma \
> > +      #
> >        # Normally, the "relation segment" code basically has no coverage in 
> > our
> >        # tests, because we (quite reasonably) don't generate tables large
> >        # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.

Fixed.

> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql 
> > b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache.  I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.

... it's just me cutting corners :^), fixed now.

[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?

Yup, fixed.

> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> >                       else
> >                               fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr)        ((Block) (BufferBlocks + ((Size) 
> > (bufHdr)->buf_id) * BLCKSZ))
> > +                     blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

> > +                     j = 0;
> > +                     do {
> > +                             /*
> > +                              * Many buffers can point to the same page, 
> > but we want to
> > +                              * query just first address.
> > +                              *
> > +                              * In order to get reliable results we also 
> > need to touch memory pages
> > +                              * so that inquiry about NUMA zone doesn't 
> > return -2.
> > +                              */
> > +                             if(os_page_ptrs[blk2page+j] == 0) {
> > +                                     volatile uint64 touch 
> > pg_attribute_unused();
> > +                                     os_page_ptrs[blk2page+j] = (char 
> > *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > +                                     touch = *(uint64 
> > *)os_page_ptrs[blk2page+j];
> > +                             }
> > +                             j++;
> > +                     } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

> > +#ifdef USE_LIBNUMA
> > +             if(query_numa) {
> > +                     /* According to numa(3) it is required to initialize 
> > library even if that's no-op. */
> > +                     if(numa_available() == -1) {
> > +                             pg_buffercache_mark_numa_invalid(fctx, 
> > NBuffers);
> > +                             elog(NOTICE, "libnuma initialization failed, 
> > some NUMA data might be unavailable.");;
> > +                     } else {
> > +                             /* Amortize the number of pages we need to 
> > query about */
> > +                             if(numa_move_pages(0, os_page_count, 
> > os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > +                                     elog(ERROR, "failed NUMA pages 
> > inquiry status");
> > +                             }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?

Both fixed.

-J.

IMPACT OF pg_numa_touch_mem_if_required() from pg_numa.h: so we are touching 
like ~16k * 8kB buffers =~ 128MB (close to pgbench_accounts size):
        postgres=# \dt+ pgbench_accounts
                                                                                
          List of tables
         Schema |       Name       | Type  |  Owner   | Persistence | Access 
method |  Size  | Description
        
--------+------------------+-------+----------+-------------+---------------+--------+-------------
         public | pgbench_accounts | table | postgres | permanent   | heap      
    | 128 MB |

        pgbench -i -s 10 ~~ 128MB size, and "select relfilenode, count(*) from 
pg_buffercache where relfilenode is not null group by relfilenode  order by 2 
desc limit 3;" shows:
         relfilenode | count
        -------------+-------
                   28345 | 16402 // <-- pgbench_accounts
                        1249 |    46
                        2619 |    44

EXPECTED RESULT: at least 16k+ buffers split into multiple NUMA zones


patch(-touchpages)               hugepages=off          INVALID RESULTS (-2)

        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;'
[..]
        done in 2.26 s (drop tables 0.12 s, create tables 0.01 s, client-side 
generate 1.40 s, vacuum 0.18 s, primary keys 0.54 s).
         huge_pages_status
        -------------------
         off
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507568
                                0 |     98
                                2 |     98
                           -2 |  16524


patch(-touchpages)               hugepages=on           INVALID RESULTS (-2)

        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;'
[..]
        done in 2.54 s (drop tables 0.20 s, create tables 0.02 s, client-side 
generate 1.57 s, vacuum 0.21 s, primary keys 0.55 s).
         huge_pages_status
        -------------------
         on
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507490
                                0 |    256
                                2 |    256
                           -2 |  15774
                                1 |    512

patch(touchpages)                hugepages=off          CORRECT RESULT 
 
        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;'
[..]
        done in 2.42 s (drop tables 0.12 s, create tables 0.01 s, client-side 
generate 1.55 s, vacuum 0.19 s, primary keys 0.54 s).
         huge_pages_status
        -------------------
         off
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507589
                                0 |   8349
                                2 |   8350
    ^^ it's interesting (different split without huge with huge pages)

patch(touchpages)                hugepages=on           CORRECT RESULT

        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;' -c '\l+'
[..]
        done in 2.21 s (drop tables 0.12 s, create tables 0.01 s, client-side 
generate 1.41 s, vacuum 0.19 s, primary keys 0.47 s).
         huge_pages_status
        -------------------
         on
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507602
                                3 |   4096
                                0 |   4352
                                2 |   4096
                                1 |   4142
        ^^ it's interesting (different split without huge with huge pages)
 

patch(-touchpages)+MAP_POPULATE   hugepages=off         INVALID RESULTS (-2)

        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;' -c '\l+'
[..]
        done in 2.25 s (drop tables 0.13 s, create tables 0.01 s, client-side 
generate 1.45 s, vacuum 0.19 s, primary keys 0.48 s).
         huge_pages_status
        -------------------
         off
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507594
                                3 |     98
                           -2 |  16498
                                1 |     98
        (4 rows)


patch(-touchpages)+MAP_POPULATE   hugepages=on          INVALID RESULTS (-2)

        postgres@jw-test3:~$ numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart ; 
/usr/pgsql18numa/bin/pgbench  -i -s 10; /usr/pgsql18numa/bin/psql -c 'show 
huge_pages_status;' -c 'select numa_zone_id, count(*) from pg_buffercache_numa 
group by numa_zone_id;' -c '\l+'
[..]
        done in 4.89 s (drop tables 0.12 s, create tables 0.01 s, client-side 
generate 1.50 s, vacuum 1.32 s, primary keys 1.93 s).
         huge_pages_status
        -------------------
         on
        (1 row)

         numa_zone_id | count
        --------------+--------
                                  | 507596
                                0 |    256
                                2 |    256
                           -2 |  15616
                                1 |    564
        (5 rows)


        cross-check that MAP_POPULATE was used above:
                postgres@jw-test3:~$ strace -ffe mmap numactl --interleave=all 
/usr/pgsql18numa/bin/pg_ctl -c -D /tmp/numa -l logfile restart 2>&1 | grep 
MAP_SHARED
                mmap(NULL, 27028, PROT_READ, MAP_SHARED, 3, 0) = 0x7f954aaa1000
                [pid 15686] mmap(NULL, 27028, PROT_READ, MAP_SHARED, 3, 0) = 
0x7fecd3d46000
                [pid 15686] mmap(NULL, 4454350848, PROT_READ|PROT_WRITE, 
MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB, -1, 0) = 0x7febc7200000

                postgres@jw-test3:~$ uname -r
                6.10.11+bpo-cloud-amd64

v4-0003-Add-pg_shmem_numa_allocations.patch
Description: Binary data

v4-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patch
Description: Binary data

v4-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patch
Description: Binary data

Re: Draft for basic NUMA observability

Reply via email to