Re: Draft for basic NUMA observability

Tomas Vondra Thu, 03 Apr 2025 04:52:38 -0700

On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <[email protected]> wrote:
> 
> Hi Tomas,
> 
>> OK, so you agree the commit messages are complete / correct?
> 
> Yes.
> 
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
> 
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
> 
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
> 
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
>


IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
> 
> Hm, it seems we always go with suffix "_somethingnew":
> 
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
> 
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
> 

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
> 
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
> 
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
> 
> There's also the problem that reading/joining could be inconsistent
> and even slower.
> 

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.

>> I wonder if we should have a bitmap of nodes for the buffer (but then
>> what if there are multiple pages from the same node?), or maybe just an
>> array of nodes, with one element per page.
> 
> AFAIR this has been discussed back in end of January, and the
> conclusion was more or less - on Discord - that everything sucks
> (bitmaps, BIT() datatype, arrays,...) either from implementation or
> user side, but apparently arrays [] would suck the least from
> implementation side. So we could probably do something like up to
> node_max_nodes():
> buffer1:..:{0, 2, 0, 0}
> buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
> buffer3:..:{0, 0, 0, 2}
> 
> Other idea is JSON or even simple string with numa_node_id<->count:
> buffer1:..:"1=2"
> buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
> buffer3:..:"3=2"
> 
> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().

Attached is v22, with some minor review comments:

1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?

2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).

3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.


regards

-- 
Tomas Vondra

From 22b6296ee914f8445be5eebf53b994196064d0d3 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 2 Apr 2025 12:29:22 +0200
Subject: [PATCH v22 1/7] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.

A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.

The libnuma library is not available on 32-bit builds (there's no shared
object for i386), so we disable it in that case. The i386 is very memory
limited anyway, even with PAE, so NUMA is mostly irrelevant.

On Linux we use move_pages(2) syscall for speed instead of
get_mempolicy(2).

Author: Jakub Wartak <[email protected]>
Co-authored-by: Bertrand Drouvot <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 .cirrus.tasks.yml                   |   2 +
 configure                           | 187 ++++++++++++++++++++++++++++
 configure.ac                        |  14 +++
 doc/src/sgml/func.sgml              |  13 ++
 doc/src/sgml/installation.sgml      |  21 ++++
 meson.build                         |  23 ++++
 meson_options.txt                   |   3 +
 src/Makefile.global.in              |   6 +-
 src/backend/utils/misc/guc_tables.c |   2 +-
 src/include/catalog/pg_proc.dat     |   4 +
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_numa.h          |  40 ++++++
 src/include/storage/pg_shmem.h      |   1 +
 src/makefiles/meson.build           |   3 +
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   1 +
 src/port/pg_numa.c                  | 120 ++++++++++++++++++
 17 files changed, 442 insertions(+), 2 deletions(-)
 create mode 100644 src/include/port/pg_numa.h
 create mode 100644 src/port/pg_numa.c

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-libnuma \
             --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
             -Dllvm=disabled \
             --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
             -DPERL=perl5.36-i386-linux-gnu \
+            -Dlibnuma=disabled \
             build-32
         EOF
 
diff --git a/configure b/configure
index 3c19e7e60ec..bea359812d4 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
 XML2_CFLAGS
 XML2_CONFIG
 with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
 LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
@@ -872,6 +875,7 @@ with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
+with_libnuma
 with_libxml
 with_libxslt
 with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
 LIBURING_LIBS
 LIBCURL_CFLAGS
 LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
 XML2_CONFIG
 XML2_CFLAGS
 XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
+  --with-libnuma          build with libnuma for NUMA awareness
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
   --with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
               C compiler flags for LIBCURL, overriding pkg-config
   LIBCURL_LIBS
               linker flags for LIBCURL, overriding pkg-config
+  LIBNUMA_CFLAGS
+              C compiler flags for LIBNUMA, overriding pkg-config
+  LIBNUMA_LIBS
+              linker flags for LIBNUMA, overriding pkg-config
   XML2_CONFIG path to xml2-config utility
   XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
   XML2_LIBS   linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
 fi
 
 
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+  withval=$with_libnuma;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_numa_numa_available=yes
+else
+  ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+  LIBS="-lnuma $LIBS"
+
+else
+  as_fn_error $? "library 'libnuma' is required for NUMA awareness" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+    pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+    pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+        else
+	        LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBNUMA_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+	LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
 #
 # XML
 #
diff --git a/configure.ac b/configure.ac
index 65db0673f8a..fc8dfa87567 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
 fi
 
 
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma for NUMA awareness],
+              [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA awareness support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+  AC_CHECK_LIB(numa,    numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA awareness])])
+  PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
 #
 # XML
 #
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 2488e9ba998..4bb60e9e080 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_numa_available</primary>
+        </indexterm>
+        <function>pg_numa_available</function> ()
+        <returnvalue>boolean</returnvalue>
+       </para>
+       <para>
+        Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..5f0486bb335 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,16 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-libnuma">
+       <term><option>--with-libnuma</option></term>
+       <listitem>
+        <para>
+         Build with libnuma support for basic NUMA support.
+         Only supported on platforms for which the libnuma library is implemented.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-liburing">
        <term><option>--with-liburing</option></term>
        <listitem>
@@ -2645,6 +2655,17 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-libnuma-meson">
+      <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with libnuma support for basic NUMA support.
+        Only supported on platforms for which the libnuma library is implemented.
+        The default for this option is auto.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/meson.build b/meson.build
index e8b872d29ad..e3f3ab0f335 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
 endif
 
 
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+  # via pkg-config
+  libnuma = dependency('numa', required: libnumaopt)
+  if not libnuma.found()
+    libnuma = cc.find_library('numa', required: libnumaopt)
+  endif
+  if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+    libnuma = not_found_dep
+  endif
+  if libnuma.found()
+    cdata.set('USE_LIBNUMA', 1)
+  endif
+else
+  libnuma = not_found_dep
+endif
+
 
 ###############################################################
 # Library: liburing
@@ -3242,6 +3263,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  libnuma,
   liburing,
   libxml,
   lz4,
@@ -3898,6 +3920,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'libnuma': libnuma,
       'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..8675e1b5d87 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('libnuma', type: 'feature', value: 'auto',
+  description: 'NUMA awareness support')
+
 option('liburing', type : 'feature', value: 'auto',
   description: 'io_uring support, for asynchronous I/O')
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 265fd1b2cfe..92bd85cbed2 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_libnuma	= @with_libnuma@
 with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBNUMA_CFLAGS		= @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS		= @LIBNUMA_LIBS@
+
 LIBURING_CFLAGS		= @LIBURING_CFLAGS@
 LIBURING_LIBS		= @LIBURING_LIBS@
 
@@ -250,7 +254,7 @@ CPP = @CPP@
 CPPFLAGS = @CPPFLAGS@
 PG_SYSROOT = @PG_SYSROOT@
 
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
 
 ifdef PGXS
 override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int	ssl_renegotiation_limit;
  */
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
-static int	huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
 /*
  * These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index a28a15993a2..63859661951 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8518,6 +8518,10 @@
   proargnames => '{name,off,size,allocated_size}',
   prosrc => 'pg_get_shmem_allocations' },
 
+{ oid => '9685', descr => 'Is NUMA compilation available?',
+  proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+  proargtypes => '', prosrc => 'pg_numa_available' },
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 2ac61575883..b7144cbf32f 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,6 +676,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
 /* Define to build with io_uring support. (--with-liburing) */
 #undef USE_LIBURING
 
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..314cff94dbc
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *	  Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * 	src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+	ro_volatile_var = *(uint64 *)ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+	do {} while(0)
+
+#endif
+
+#endif							/* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
 
   'ICU_LIBS',
 
+  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
   'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
@@ -232,6 +234,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'libnuma': libnuma,
   'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_localeconv_r.o \
+	pg_numa.o \
 	pg_popcount_aarch64.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_localeconv_r.c',
+  'pg_numa.c',
   'pg_popcount_aarch64.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * 		Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum		pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+	int			r = numa_available();
+
+	return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+	return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+	return numa_max_node();
+}
+
+#else
+
+Datum		pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+	/* We state that NUMA is not available */
+	return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+	return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+	return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+	Size		os_page_size;
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	Assert(IsUnderPostmaster);
+	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+	if (huge_pages_status == HUGE_PAGES_ON)
+		GetHugePageSize(&os_page_size, NULL);
+
+	return os_page_size;
+}
-- 
2.49.0

From 818d26ab70af3f7d1a653fae64c02b3d140ae880 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 3 Apr 2025 11:58:34 +0200
Subject: [PATCH v22 2/7] review

---
 configure                      | 2 +-
 doc/src/sgml/installation.sgml | 2 +-
 src/include/pg_config.h.in     | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/configure b/configure
index bea359812d4..969de1bbeb2 100755
--- a/configure
+++ b/configure
@@ -1594,7 +1594,7 @@ Optional Packages:
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
-  --with-libnuma          build with libnuma for NUMA awareness
+  --with-libnuma          build with libnuma support
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
   --with-system-tzdata=DIR
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index 5f0486bb335..ca6fbab065a 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1161,7 +1161,7 @@ build-postgresql:
        <listitem>
         <para>
          Build with libnuma support for basic NUMA support.
-         Only supported on platforms for which the libnuma library is implemented.
+         Only supported on platforms for which the <productname>libnuma</productname> library is implemented.
         </para>
        </listitem>
       </varlistentry>
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index b7144cbf32f..a31289cd1da 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -676,7 +676,7 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
-/* Define to 1 to build with NUMA awareness support. (--with-libnuma) */
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
 #undef USE_LIBNUMA
 
 /* Define to build with io_uring support. (--with-liburing) */
-- 
2.49.0

From 89461994d2fb48dabfe38f3690e33f22e760e3ed Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v22 3/7] pg_buffercache: split pg_buffercache_pages into parts

Split pg_buffercache_pages() into multiple smaller functions, to allow
reuse in future patches. This introduces three new functions:

- pg_buffercache_init_entries
- pg_buffercache_build_tuple
- get_buffercache_tuple

that help adding entries into a tuplestore, describing the contents of
the buffercache.

This is a preparation for future patches extending pg_buffercache, e.g.
to add NUMA observabitily.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 contrib/pg_buffercache/pg_buffercache_pages.c | 293 +++++++++---------
 1 file changed, 155 insertions(+), 138 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..ced4ec777a1 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -68,80 +68,171 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 
-Datum
-pg_buffercache_pages(PG_FUNCTION_ARGS)
+/*
+ * Helper routine for pg_buffercache_pages().
+ */
+static BufferCachePagesContext *
+pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
 {
-	FuncCallContext *funcctx;
-	Datum		result;
-	MemoryContext oldcontext;
 	BufferCachePagesContext *fctx;	/* User function context. */
+	MemoryContext oldcontext;
 	TupleDesc	tupledesc;
 	TupleDesc	expected_tupledesc;
-	HeapTuple	tuple;
 
-	if (SRF_IS_FIRSTCALL())
-	{
-		int			i;
+	/* Switch context when allocating stuff to be used in later calls */
+	oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
 
-		funcctx = SRF_FIRSTCALL_INIT();
+	/* Create a user function context for cross-call persistence */
+	fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+
+	/*
+	 * To smoothly support upgrades from version 1.0 of this extension
+	 * transparently handle the (non-)existence of the pinning_backends
+	 * column. We unfortunately have to get the result type for that... - we
+	 * can't use the result type determined by the function definition without
+	 * potentially crashing when somebody uses the old (or even wrong)
+	 * function definition though.
+	 */
+	if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
+		expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
+		elog(ERROR, "incorrect number of output arguments");
+
+	/* Construct a tuple descriptor for the result rows. */
+	tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+					   INT4OID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
+					   OIDOID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
+					   OIDOID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
+					   OIDOID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
+					   INT2OID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
+					   INT8OID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
+					   BOOLOID, -1, 0);
+	TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
+					   INT2OID, -1, 0);
+
+	if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
+						   INT4OID, -1, 0);
 
-		/* Switch context when allocating stuff to be used in later calls */
-		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+	fctx->tupdesc = BlessTupleDesc(tupledesc);
 
-		/* Create a user function context for cross-call persistence */
-		fctx = (BufferCachePagesContext *) palloc(sizeof(BufferCachePagesContext));
+	/* Allocate NBuffers worth of BufferCachePagesRec records. */
+	fctx->record = (BufferCachePagesRec *)
+		MemoryContextAllocHuge(CurrentMemoryContext,
+							   sizeof(BufferCachePagesRec) * NBuffers);
+
+	/* Set max calls and remember the user function context. */
+	funcctx->max_calls = NBuffers;
+	funcctx->user_fctx = fctx;
+
+	/* Return to original context when allocating transient memory */
+	MemoryContextSwitchTo(oldcontext);
+	return fctx;
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Save buffer cache information for a single buffer.
+ */
+static void
+pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+	BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+	bufHdr = GetBufferDescriptor(record_id);
+	/* Lock each buffer header before inspecting. */
+	buf_state = LockBufHdr(bufHdr);
+
+	bufRecord->bufferid = BufferDescriptorGetBuffer(bufHdr);
+	bufRecord->relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
+	bufRecord->reltablespace = bufHdr->tag.spcOid;
+	bufRecord->reldatabase = bufHdr->tag.dbOid;
+	bufRecord->forknum = BufTagGetForkNum(&bufHdr->tag);
+	bufRecord->blocknum = bufHdr->tag.blockNum;
+	bufRecord->usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
+	bufRecord->pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
+
+	if (buf_state & BM_DIRTY)
+		bufRecord->isdirty = true;
+	else
+		bufRecord->isdirty = false;
+
+	/* Note if the buffer is valid, and has storage created */
+	if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+		bufRecord->isvalid = true;
+	else
+		bufRecord->isvalid = false;
+
+	UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
+ * Helper routine for pg_buffercache_pages().
+ *
+ * Format and return a tuple for a single buffer cache entry.
+ */
+static Datum
+get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
+{
+	Datum		values[NUM_BUFFERCACHE_PAGES_ELEM];
+	bool		nulls[NUM_BUFFERCACHE_PAGES_ELEM];
+	HeapTuple	tuple;
+	BufferCachePagesRec *bufRecord = &(fctx->record[record_id]);
+
+	values[0] = Int32GetDatum(bufRecord->bufferid);
+	memset(nulls, false, NUM_BUFFERCACHE_PAGES_ELEM);
+
+	/*
+	 * Set all fields except the bufferid to null if the buffer is unused or
+	 * not valid.
+	 */
+	if (bufRecord->blocknum == InvalidBlockNumber || bufRecord->isvalid == false)
+		memset(&nulls[1], true, (NUM_BUFFERCACHE_PAGES_ELEM - 1) * sizeof(bool));
+	else
+	{
+		values[1] = ObjectIdGetDatum(bufRecord->relfilenumber);
+		values[2] = ObjectIdGetDatum(bufRecord->reltablespace);
+		values[3] = ObjectIdGetDatum(bufRecord->reldatabase);
+		values[4] = ObjectIdGetDatum(bufRecord->forknum);
+		values[5] = Int64GetDatum((int64) bufRecord->blocknum);
+		values[6] = BoolGetDatum(bufRecord->isdirty);
+		values[7] = Int16GetDatum(bufRecord->usagecount);
 
 		/*
-		 * To smoothly support upgrades from version 1.0 of this extension
-		 * transparently handle the (non-)existence of the pinning_backends
-		 * column. We unfortunately have to get the result type for that... -
-		 * we can't use the result type determined by the function definition
-		 * without potentially crashing when somebody uses the old (or even
-		 * wrong) function definition though.
+		 * unused for v1.0 callers, but the array is always long enough
 		 */
-		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
-			elog(ERROR, "return type must be a row type");
+		values[8] = Int32GetDatum(bufRecord->pinning_backends);
+	}
 
-		if (expected_tupledesc->natts < NUM_BUFFERCACHE_PAGES_MIN_ELEM ||
-			expected_tupledesc->natts > NUM_BUFFERCACHE_PAGES_ELEM)
-			elog(ERROR, "incorrect number of output arguments");
+	/* Build and return the tuple. */
+	tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+	return HeapTupleGetDatum(tuple);
+}
 
-		/* Construct a tuple descriptor for the result rows. */
-		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
-						   INT4OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "relfilenode",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "reltablespace",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 4, "reldatabase",
-						   OIDOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 5, "relforknumber",
-						   INT2OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 6, "relblocknumber",
-						   INT8OID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 7, "isdirty",
-						   BOOLOID, -1, 0);
-		TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
-						   INT2OID, -1, 0);
-
-		if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
-			TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
-							   INT4OID, -1, 0);
-
-		fctx->tupdesc = BlessTupleDesc(tupledesc);
-
-		/* Allocate NBuffers worth of BufferCachePagesRec records. */
-		fctx->record = (BufferCachePagesRec *)
-			MemoryContextAllocHuge(CurrentMemoryContext,
-								   sizeof(BufferCachePagesRec) * NBuffers);
-
-		/* Set max calls and remember the user function context. */
-		funcctx->max_calls = NBuffers;
-		funcctx->user_fctx = fctx;
-
-		/* Return to original context when allocating transient memory */
-		MemoryContextSwitchTo(oldcontext);
+Datum
+pg_buffercache_pages(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	BufferCachePagesContext *fctx;	/* User function context. */
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		int			i;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+		fctx = pg_buffercache_init_entries(funcctx, fcinfo);
 
 		/*
 		 * Scan through all the buffers, saving the relevant fields in the
@@ -152,36 +243,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		 * locks, so the information of each buffer is self-consistent.
 		 */
 		for (i = 0; i < NBuffers; i++)
-		{
-			BufferDesc *bufHdr;
-			uint32		buf_state;
-
-			bufHdr = GetBufferDescriptor(i);
-			/* Lock each buffer header before inspecting. */
-			buf_state = LockBufHdr(bufHdr);
-
-			fctx->record[i].bufferid = BufferDescriptorGetBuffer(bufHdr);
-			fctx->record[i].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
-			fctx->record[i].reltablespace = bufHdr->tag.spcOid;
-			fctx->record[i].reldatabase = bufHdr->tag.dbOid;
-			fctx->record[i].forknum = BufTagGetForkNum(&bufHdr->tag);
-			fctx->record[i].blocknum = bufHdr->tag.blockNum;
-			fctx->record[i].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
-			fctx->record[i].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
-
-			if (buf_state & BM_DIRTY)
-				fctx->record[i].isdirty = true;
-			else
-				fctx->record[i].isdirty = false;
-
-			/* Note if the buffer is valid, and has storage created */
-			if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
-				fctx->record[i].isvalid = true;
-			else
-				fctx->record[i].isvalid = false;
-
-			UnlockBufHdr(bufHdr, buf_state);
-		}
+			pg_buffercache_save_tuple(i, fctx);
 	}
 
 	funcctx = SRF_PERCALL_SETUP();
@@ -191,55 +253,10 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 
 	if (funcctx->call_cntr < funcctx->max_calls)
 	{
+		Datum		result;
 		uint32		i = funcctx->call_cntr;
-		Datum		values[NUM_BUFFERCACHE_PAGES_ELEM];
-		bool		nulls[NUM_BUFFERCACHE_PAGES_ELEM];
-
-		values[0] = Int32GetDatum(fctx->record[i].bufferid);
-		nulls[0] = false;
-
-		/*
-		 * Set all fields except the bufferid to null if the buffer is unused
-		 * or not valid.
-		 */
-		if (fctx->record[i].blocknum == InvalidBlockNumber ||
-			fctx->record[i].isvalid == false)
-		{
-			nulls[1] = true;
-			nulls[2] = true;
-			nulls[3] = true;
-			nulls[4] = true;
-			nulls[5] = true;
-			nulls[6] = true;
-			nulls[7] = true;
-			/* unused for v1.0 callers, but the array is always long enough */
-			nulls[8] = true;
-		}
-		else
-		{
-			values[1] = ObjectIdGetDatum(fctx->record[i].relfilenumber);
-			nulls[1] = false;
-			values[2] = ObjectIdGetDatum(fctx->record[i].reltablespace);
-			nulls[2] = false;
-			values[3] = ObjectIdGetDatum(fctx->record[i].reldatabase);
-			nulls[3] = false;
-			values[4] = ObjectIdGetDatum(fctx->record[i].forknum);
-			nulls[4] = false;
-			values[5] = Int64GetDatum((int64) fctx->record[i].blocknum);
-			nulls[5] = false;
-			values[6] = BoolGetDatum(fctx->record[i].isdirty);
-			nulls[6] = false;
-			values[7] = Int16GetDatum(fctx->record[i].usagecount);
-			nulls[7] = false;
-			/* unused for v1.0 callers, but the array is always long enough */
-			values[8] = Int32GetDatum(fctx->record[i].pinning_backends);
-			nulls[8] = false;
-		}
-
-		/* Build and return the tuple. */
-		tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
-		result = HeapTupleGetDatum(tuple);
 
+		result = get_buffercache_tuple(i, fctx);
 		SRF_RETURN_NEXT(funcctx, result);
 	}
 	else
-- 
2.49.0

From 76b537112dc94c3077a3058b0ff8361cdda1ec71 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 19 Mar 2025 09:34:56 +0100
Subject: [PATCH v22 4/7] Add pg_buffercache_numa view with NUMA node info

Introduces a new view pg_buffercache_numa, showing a NUMA memory node
for each individual buffer.

To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.

The size of a database block and OS memory page may differ. For example
the default block size (BLCKSZ) is 8KB, while the memory page is 4KB,
but it's also possible to make the block size smaller (e.g. 1KB).

XXX: Right now we just report NUMA node of the first page when dealing with
multiple pages per single buffer.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 contrib/pg_buffercache/Makefile               |   3 +-
 .../expected/pg_buffercache_numa.out          |  28 +++
 .../expected/pg_buffercache_numa_1.out        |   3 +
 contrib/pg_buffercache/meson.build            |   2 +
 .../pg_buffercache--1.5--1.6.sql              |  24 ++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 225 +++++++++++++++++-
 .../sql/pg_buffercache_numa.sql               |  20 ++
 doc/src/sgml/pgbuffercache.sgml               |  61 ++++-
 9 files changed, 360 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
 create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
 create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..2a33602537e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,7 +8,8 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+	pg_buffercache--1.5--1.6.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
 REGRESS = pg_buffercache
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..d4de5ea52fc
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,28 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+select count(*) = (select setting::bigint
+                   from pg_settings
+                   where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR:  permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
+  'pg_buffercache--1.5--1.6.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
@@ -34,6 +35,7 @@ tests += {
   'regress': {
     'sql': [
       'pg_buffercache',
+      'pg_buffercache_numa',
     ],
   },
 }
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..8c1e891eab2
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,24 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE OR REPLACE VIEW pg_buffercache_numa AS
+	SELECT P.* FROM pg_buffercache_numa_pages() AS P
+	(bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid,
+	 relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2,
+	 pinning_backends int4, node_id int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index ced4ec777a1..3460cf579f7 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,12 +11,13 @@
 #include "access/htup_details.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
 
 #define NUM_BUFFERCACHE_PAGES_MIN_ELEM	8
-#define NUM_BUFFERCACHE_PAGES_ELEM	9
+#define NUM_BUFFERCACHE_PAGES_ELEM	10
 #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
 #define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
 
@@ -46,6 +47,7 @@ typedef struct
 	 * because of bufmgr.c's PrivateRefCount infrastructure.
 	 */
 	int32		pinning_backends;
+	int32		numa_node_id;
 } BufferCachePagesRec;
 
 
@@ -64,12 +66,41 @@ typedef struct
  * relation node/tablespace/database/blocknum and dirty indicator.
  */
 PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
 PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+/*
+ * Helper routine to map Buffers into addresses that is used by
+ * pg_numa_query_pages(). Please see it's comment for explanation why we need to
+ * prepare pointers like this.
+ *
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ *
+ */
+#if 0
+static inline void
+pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
+								 Size os_page_size,
+								 void **os_page_ptrs)
+{
+
+	/* XXX: move it here? */
+}
+#endif
+
 /*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
+ *
+ * Allocates and returns new user function context based on SRF context
+ * (requires that functx to be initalized by SRF_FIRSTCALL_INIT()) and
+ * standard function call info.
  */
 static BufferCachePagesContext *
 pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
@@ -119,9 +150,12 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
 	TupleDescInitEntry(tupledesc, (AttrNumber) 8, "usage_count",
 					   INT2OID, -1, 0);
 
-	if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+	if (expected_tupledesc->natts >= NUM_BUFFERCACHE_PAGES_ELEM - 1)
 		TupleDescInitEntry(tupledesc, (AttrNumber) 9, "pinning_backends",
 						   INT4OID, -1, 0);
+	if (expected_tupledesc->natts == NUM_BUFFERCACHE_PAGES_ELEM)
+		TupleDescInitEntry(tupledesc, (AttrNumber) 10, "node_id",
+						   INT4OID, -1, 0);
 
 	fctx->tupdesc = BlessTupleDesc(tupledesc);
 
@@ -140,7 +174,7 @@ pg_buffercache_init_entries(FuncCallContext *funcctx, FunctionCallInfo fcinfo)
 }
 
 /*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
  *
  * Save buffer cache information for a single buffer.
  */
@@ -175,11 +209,13 @@ pg_buffercache_save_tuple(int record_id, BufferCachePagesContext *fctx)
 	else
 		bufRecord->isvalid = false;
 
+	bufRecord->numa_node_id = -1;
+
 	UnlockBufHdr(bufHdr, buf_state);
 }
 
 /*
- * Helper routine for pg_buffercache_pages().
+ * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
  *
  * Format and return a tuple for a single buffer cache entry.
  */
@@ -214,6 +250,7 @@ get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
 		 * unused for v1.0 callers, but the array is always long enough
 		 */
 		values[8] = Int32GetDatum(bufRecord->pinning_backends);
+		values[9] = Int32GetDatum(bufRecord->numa_node_id);
 	}
 
 	/* Build and return the tuple. */
@@ -263,6 +300,184 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inquiry about memory mappings.
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	BufferCachePagesContext *fctx;	/* User function context. */
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		int			i;
+		Size		os_page_size = 0;
+		void	  **os_page_ptrs = NULL;
+		int		   *os_pages_status = NULL;
+		uint64		os_page_query_count = 0;
+		int			pages_per_buffer = 0;
+		int			buffers_per_page = 0;
+
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		if (pg_numa_init() == -1)
+			elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+		fctx = pg_buffercache_init_entries(funcctx, fcinfo);
+
+		/*
+		 * Different database block sizes (4kB, 8kB, ..., 32kB) can be used,
+		 * while the OS may have different memory page sizes.
+		 *
+		 * To correctly map between them, we need to: 1. Determine the OS
+		 * memory page size 2. Calculate how many OS pages are used by all
+		 * buffer blocks 3. Calculate how many OS pages are contained within
+		 * each database block.
+		 *
+		 * This information is needed before calling move_pages() for NUMA
+		 * node id inquiry.
+		 */
+		os_page_size = pg_numa_get_pagesize();
+		buffers_per_page = os_page_size / BLCKSZ;
+		pages_per_buffer = BLCKSZ / os_page_size;
+
+		/*
+		 * How many addresses we are going to query (store) depends on the
+		 * relation between BLCKSZ : PAGESIZE.
+		 */
+		if (buffers_per_page > 1)
+			os_page_query_count = NBuffers;
+		else
+			os_page_query_count = NBuffers * pages_per_buffer;
+
+		elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
+			 NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
+
+		os_page_ptrs = palloc0(sizeof(void *) * os_page_query_count);
+		os_pages_status = palloc(sizeof(uint64) * os_page_query_count);
+
+		/*
+		 * If we ever get 0xff back from kernel inquiry, then we probably have
+		 * bug in our buffers to OS page mapping code here.
+		 *
+		 */
+		memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
+
+		if (firstNumaTouch)
+			elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+		/*
+		 * Scan through all the buffers, saving the relevant fields in the
+		 * fctx->record structure.
+		 *
+		 * We don't hold the partition locks, so we don't get a consistent
+		 * snapshot across all buffers, but we do grab the buffer header
+		 * locks, so the information of each buffer is self-consistent.
+		 *
+		 * This loop touches and stores addresses into os_page_ptrs[] as input
+		 * to one big big move_pages(2) inquiry system call. Basically we ask
+		 * for all memory pages for NBuffers.
+		 */
+		for (i = 0; i < NBuffers; i++)
+		{
+			int			j;
+			volatile uint64 touch pg_attribute_unused();
+
+			pg_buffercache_save_tuple(i, fctx);
+
+			/*
+			 * BLCKSZ >= PAGESIZE: If Buffer occupies more than one OS page we
+			 * query all OS pages for NUMA information. This wont run for
+			 * BLCKSZ < PAGESIZE.
+			 */
+			for (j = 0; j < pages_per_buffer; j++)
+			{
+				size_t		idx = (size_t) (i * pages_per_buffer) + j;
+
+				/* NBuffers starts from 1 */
+				os_page_ptrs[idx] = (char *) BufferGetBlock(i + 1) + (os_page_size * j);
+
+				/* Only need to touch memory once per backend process lifetime */
+				if (firstNumaTouch)
+					pg_numa_touch_mem_if_required(touch, os_page_ptrs[idx]);
+			}
+
+			/* otherwise BLCKSZ < PAGESIZE: one page hosts many Buffers */
+			if (buffers_per_page > 1)
+			{
+				/*
+				 * Altough we could query just once per each OS page, we do it
+				 * repeatably for each Buffer and hit the same address as
+				 * move_pages(2) requires page aligment. This is also
+				 * simplifies retrieval code later on.
+				 */
+				os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
+													 (char *) BufferGetBlock(i + 1));
+
+				/* Only need to touch memory once per backend process lifetime */
+				if (firstNumaTouch)
+					pg_numa_touch_mem_if_required(touch, os_page_ptrs[i]);
+			}
+
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		if (pg_numa_query_pages(0, os_page_query_count, os_page_ptrs, os_pages_status) == -1)
+			elog(ERROR, "failed NUMA pages inquiry: %m");
+
+		/*
+		 * Once we have our NUMA information we resolve memory pointers back
+		 * to Buffers
+		 */
+		for (i = 0; i < NBuffers; i++)
+		{
+			size_t		idx;
+
+			/*
+			 * Note: We could check for errors in os_pages_status and report
+			 * them. Again, a single DB block might span multiple NUMA nodes
+			 * if it crosses OS pages on node boundaries, but we only record
+			 * the node of the first page. This is a simplification but should
+			 * be sufficient for most analyses.
+			 */
+
+			if (buffers_per_page > 1)
+				idx = i;
+			else
+			{
+				/*
+				 * XXX: BLCKSZ < PAGESIZE: return the node id for this Buffer
+				 * based only on >> FIRST << OS page. We could do something
+				 * else with this.
+				 */
+				idx = i * pages_per_buffer;
+			}
+			fctx->record[i].numa_node_id = os_pages_status[idx];
+		}
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	/* Get the saved state */
+	fctx = funcctx->user_fctx;
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		Datum		result;
+		uint32		i = funcctx->call_cntr;
+
+		result = get_buffercache_tuple(i, fctx);
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+	{
+		firstNumaTouch = false;
+		SRF_RETURN_DONE(funcctx);
+	}
+}
+
 Datum
 pg_buffercache_summary(PG_FUNCTION_ARGS)
 {
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..2225b879f58
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,20 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+select count(*) = (select setting::bigint
+                   from pg_settings
+                   where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..315227bf0ce 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
  <para>
   This module provides the <function>pg_buffercache_pages()</function>
   function (wrapped in the <structname>pg_buffercache</structname> view),
-  the <function>pg_buffercache_summary()</function> function, the
+  <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+  <structname>pg_buffercache_numa</structname> view), the
+  <function>pg_buffercache_summary()</function> function, the
   <function>pg_buffercache_usage_counts()</function> function and
   the <function>pg_buffercache_evict()</function> function.
  </para>
@@ -42,6 +44,14 @@
   convenient use.
  </para>
 
+ <para>
+  The <function>pg_buffercache_numa_pages()</function> provides the same information
+  as <function>pg_buffercache_pages()</function> but is slower because it also
+  provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
+  The <structname>pg_buffercache_numa</structname> view wraps the function for
+  convenient use.
+ </para>
+
  <para>
   The <function>pg_buffercache_summary()</function> function returns a single
   row summarizing the state of the shared buffer cache.
@@ -200,6 +210,55 @@
   </para>
  </sect2>
 
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+  <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+  <para>
+   The definitions of the columns exposed are identical to the
+   <structname>pg_buffercache</structname> view, except that this one includes
+   one additional <structfield>node_id</structfield> column as defined in
+   <xref linkend="pgbuffercache-numa-columns"/>.
+  </para>
+
+  <table id="pgbuffercache-numa-columns">
+   <title><structname>pg_buffercache_numa</structname> Extra column</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>node_id</structfield> <type>integer</type>
+      </para>
+      <para>
+       <acronym>NUMA</acronym> node ID. NULL if the shared buffer
+       has not been used yet. On systems without <acronym>NUMA</acronym> support
+       this returns 0.
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+   to be paged-in, the first execution of this function can take a noticeable
+   amount of time. In all the cases (first execution or not), retrieving this
+   information is costly and querying the view at a high frequency is not recommended.
+  </para>
+
+ </sect2>
+
  <sect2 id="pgbuffercache-summary">
   <title>The <function>pg_buffercache_summary()</function> Function</title>
 
-- 
2.49.0

From 2df0a06b206dedfebd6f4ef7f00eed15edbdee53 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 3 Apr 2025 12:43:21 +0200
Subject: [PATCH v22 5/7] review

---
 contrib/pg_buffercache/pg_buffercache_pages.c | 31 +++++++++++++------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 3460cf579f7..dc200204478 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -90,7 +90,7 @@ pg_buffercache_numa_prepare_ptrs(int buffer_id, double pages_per_blk,
 								 Size os_page_size,
 								 void **os_page_ptrs)
 {
-
+	/* ??? */
 	/* XXX: move it here? */
 }
 #endif
@@ -343,14 +343,28 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		buffers_per_page = os_page_size / BLCKSZ;
 		pages_per_buffer = BLCKSZ / os_page_size;
 
+		/*
+		 * The pages and block size is expected to be 2^k, so one divides the
+		 * other (we don't know in which direction).
+		 */
+		Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+		/*
+		 * Either both counts are 1 (when the pages have the same size), or
+		 * exacly one of them is zero. Both can't be zero at the same time.
+		 */
+		Assert((buffers_per_page > 0) || (pages_per_buffer > 0));
+		Assert(((buffers_per_page == 1) && (pages_per_buffer == 1)) ||
+			   ((buffers_per_page == 0) || (pages_per_buffer == 0)));
+
 		/*
 		 * How many addresses we are going to query (store) depends on the
-		 * relation between BLCKSZ : PAGESIZE.
+		 * relation between BLCKSZ : PAGESIZE. We need at least one status
+		 * per buffer - if the memory page is larger than buffer, we still
+		 * query it for each buffer. With multiple memory pages per buffer,
+		 * we need that many entries.
 		 */
-		if (buffers_per_page > 1)
-			os_page_query_count = NBuffers;
-		else
-			os_page_query_count = NBuffers * pages_per_buffer;
+		os_page_query_count = Max(NBuffers, NBuffers * pages_per_buffer);
 
 		elog(DEBUG1, "NUMA: NBuffers=%d os_page_query_count=" UINT64_FORMAT " os_page_size=%zu buffers_per_page=%d pages_per_buffer=%d",
 			 NBuffers, os_page_query_count, os_page_size, buffers_per_page, pages_per_buffer);
@@ -361,7 +375,6 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		/*
 		 * If we ever get 0xff back from kernel inquiry, then we probably have
 		 * bug in our buffers to OS page mapping code here.
-		 *
 		 */
 		memset(os_pages_status, 0xff, sizeof(int) * os_page_query_count);
 
@@ -410,8 +423,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 				/*
 				 * Altough we could query just once per each OS page, we do it
 				 * repeatably for each Buffer and hit the same address as
-				 * move_pages(2) requires page aligment. This is also
-				 * simplifies retrieval code later on.
+				 * move_pages(2) requires page aligment. This also simplifies
+				 * retrieval code later on.
 				 */
 				os_page_ptrs[i] = (char *) TYPEALIGN(os_page_size,
 													 (char *) BufferGetBlock(i + 1));
-- 
2.49.0

From 58e17af7c48fd6eeafcff9523ecdacbd53e90ede Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 21 Feb 2025 14:20:18 +0100
Subject: [PATCH v22 6/7] Add new pg_shmem_numa_allocations view

Introduce new pg_shmem_numa_alloctions view that allows viewing the shared memory split layout across
NUMA nodes.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 doc/src/sgml/system-views.sgml           |  79 ++++++++++++++
 src/backend/catalog/system_views.sql     |   8 ++
 src/backend/storage/ipc/shmem.c          | 130 +++++++++++++++++++++++
 src/include/catalog/pg_proc.dat          |   8 ++
 src/test/regress/expected/numa.out       |  12 +++
 src/test/regress/expected/numa_1.out     |   3 +
 src/test/regress/expected/privileges.out |  16 ++-
 src/test/regress/expected/rules.out      |   4 +
 src/test/regress/parallel_schedule       |   2 +-
 src/test/regress/sql/numa.sql            |   9 ++
 src/test/regress/sql/privileges.sql      |   6 +-
 11 files changed, 272 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regress/expected/numa.out
 create mode 100644 src/test/regress/expected/numa_1.out
 create mode 100644 src/test/regress/sql/numa.sql

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..6bb5c8a5669 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
       <entry>shared memory allocations</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-shmem-numa-allocations"><structname>pg_shmem_numa_allocations</structname></link></entry>
+      <entry>NUMA node mappings for shared memory allocations</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
       <entry>planner statistics</entry>
@@ -4051,6 +4056,80 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   </para>
  </sect1>
 
+ <sect1 id="view-pg-shmem-numa-allocations">
+  <title><structname>pg_shmem_numa_allocations</structname></title>
+
+  <indexterm zone="view-pg-shmem-numa-allocations">
+   <primary>pg_shmem_numa_allocations</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_shmem_numa_allocations</structname> shows how shared
+   memory allocations in the server's main shared memory segment are distributed
+   across NUMA nodes. This includes both memory allocated by
+   <productname>PostgreSQL</productname> itself and memory allocated
+   by extensions using the mechanisms detailed in
+   <xref linkend="xfunc-shared-addin" />.
+  </para>
+
+  <para>
+   Note that this view does not include memory allocated using the dynamic
+   shared memory infrastructure.
+  </para>
+
+  <table>
+   <title><structname>pg_shmem_numa_allocations</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>name</structfield> <type>text</type>
+      </para>
+      <para>
+       The name of the shared memory allocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>node_id</structfield> <type>int4</type>
+      </para>
+      <para>
+      ID of <acronym>NUMA</acronym> node
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>size</structfield> <type>int4</type>
+      </para>
+      <para>
+       Size of the allocation on this particular NUMA memory node in bytes
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   By default, the <structname>pg_shmem_numa_allocations</structname> view can be
+   read only by superusers or roles with privileges of the
+   <literal>pg_read_all_stats</literal> role.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-stats">
   <title><structname>pg_stats</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..52ab03a37be 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
 
+CREATE VIEW pg_shmem_numa_allocations AS
+    SELECT * FROM pg_get_shmem_numa_allocations();
+
+REVOKE ALL ON pg_shmem_numa_allocations FROM PUBLIC;
+GRANT SELECT ON pg_shmem_numa_allocations TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_numa_allocations() TO pg_read_all_stats;
+
 CREATE VIEW pg_backend_memory_contexts AS
     SELECT * FROM pg_get_backend_memory_contexts();
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index e453f856794..36d89a58783 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "port/pg_numa.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
 #include "storage/shmem.h"
@@ -90,6 +91,8 @@ slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
 
 static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
 
 /*
  *	InitShmemAccess() --- set up basic pointers to shared memory.
@@ -570,3 +573,130 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	return (Datum) 0;
 }
+
+/* SQL SRF showing NUMA memory nodes for allocated shared memory */
+Datum
+pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	HASH_SEQ_STATUS hstat;
+	ShmemIndexEnt *ent;
+	Datum		values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+	bool		nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+	Size		os_page_size;
+	void	  **page_ptrs;
+	int		   *pages_status;
+	uint64		shm_total_page_count,
+				shm_ent_page_count,
+				max_nodes;
+	Size	   *nodes;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	if (pg_numa_init() == -1)
+	{
+		elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+		return (Datum) 0;
+	}
+	max_nodes = pg_numa_get_max_node();
+	nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+	/*
+	 * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+	 * the OS may have different memory page sizes.
+	 *
+	 * To correctly map between them, we need to: 1. Determine the OS memory
+	 * page size 2. Calculate how many OS pages are used by all buffer blocks
+	 * 3. Calculate how many OS pages are contained within each database
+	 * block.
+	 *
+	 * This information is needed before calling move_pages() for NUMA memory
+	 * node inquiry.
+	 */
+	os_page_size = pg_numa_get_pagesize();
+
+	/*
+	 * Allocate memory for page pointers and status based on total shared
+	 * memory size. This simplified approach allocates enough space for all
+	 * pages in shared memory rather than calculating the exact requirements
+	 * for each segment.
+	 */
+	shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
+	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+	pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+	if (firstNumaTouch)
+		elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+	LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+	hash_seq_init(&hstat, ShmemIndex);
+
+	/* output all allocated entries */
+	memset(nulls, 0, sizeof(nulls));
+	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			i;
+
+		/* Get number of OS aliged pages */
+		shm_ent_page_count = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
+
+		/*
+		 * If we get ever 0xff back from kernel inquiry, then we probably have
+		 * bug in our buffers to OS page mapping code here.
+		 */
+		memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+		for (i = 0; i < shm_ent_page_count; i++)
+		{
+			/*
+			 * In order to get reliable results we also need to touch memory
+			 * pages, so that inquiry about NUMA memory node doesn't return -2
+			 * (which indicates unmapped/unallocated pages).
+			 */
+			volatile uint64 touch pg_attribute_unused();
+
+			page_ptrs[i] = (char *) ent->location + (i * os_page_size);
+			if (firstNumaTouch)
+				pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+			elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+		memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+		/* Count number of NUMA nodes used for this shared memory entry */
+		for (i = 0; i < shm_ent_page_count; i++)
+		{
+			int			s = pages_status[i];
+
+			/* Ensure we are adding only valid index to the array */
+			if (s >= 0 && s <= max_nodes)
+				nodes[s]++;
+		}
+
+		for (i = 0; i <= max_nodes; i++)
+		{
+			values[0] = CStringGetTextDatum(ent->key);
+			values[1] = i;
+			values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+								 values, nulls);
+		}
+	}
+
+	/*
+	 * We are ignoring the following memory regions (as compared to
+	 * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+	 * counted via the shmem index 2. output as-of-yet unused shared memory.
+	 */
+
+	LWLockRelease(ShmemIndexLock);
+	firstNumaTouch = false;
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 63859661951..72efe8df667 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8522,6 +8522,14 @@
   proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_numa_available' },
 
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+  proname => 'pg_get_shmem_numa_allocations', prorows => '50', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+  proargnames => '{name,node_id,size}',
+  prosrc => 'pg_get_shmem_numa_allocations' },
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..fb882c5b771
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,12 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
+ ok 
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 5588d83e1bf..f66cf1bbfbd 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3127,8 +3127,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
 -- clean up
 DROP TABLE lock_table;
 DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
 -- switch to superuser
 \c -
 CREATE ROLE regress_readallstats;
@@ -3150,6 +3150,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
  f
 (1 row)
 
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
+ has_table_privilege 
+---------------------
+ f
+(1 row)
+
 GRANT pg_read_all_stats TO regress_readallstats;
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
  has_table_privilege 
@@ -3169,6 +3175,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
  t
 (1 row)
 
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
+ has_table_privilege 
+---------------------
+ t
+(1 row)
+
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
 SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..8b5862cb11a 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
     size,
     allocated_size
    FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_numa_allocations| SELECT name,
+    node_id,
+    size
+   FROM pg_get_shmem_numa_allocations() pg_get_shmem_numa_allocations(name, node_id, size);
 pg_stat_activity| SELECT s.datid,
     d.datname,
     s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..fddb21a260a
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,9 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 286b1d03756..ca51dfd7702 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1911,8 +1911,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
 DROP TABLE lock_table;
 DROP USER regress_locktable_user;
 
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_numa_allocations and pg_backend_memory_contexts.
 
 -- switch to superuser
 \c -
@@ -1922,12 +1922,14 @@ CREATE ROLE regress_readallstats;
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- no
 
 GRANT pg_read_all_stats TO regress_readallstats;
 
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_numa_allocations','SELECT'); -- yes
 
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
-- 
2.49.0

From 13d7cd9087f64b0393f51b525d5267bbe57ce837 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 3 Apr 2025 12:50:18 +0200
Subject: [PATCH v22 7/7] review

---
 src/backend/storage/ipc/shmem.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 36d89a58783..f711e7411db 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -673,7 +673,11 @@ pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
 		{
 			int			s = pages_status[i];
 
-			/* Ensure we are adding only valid index to the array */
+			/* Ensure we are adding only valid index to the array
+			 *
+			 * XXX I think we should just error-out if this is untrue, so that
+			 * we don't silently hide issues.
+			 */
 			if (s >= 0 && s <= max_nodes)
 				nodes[s]++;
 		}
-- 
2.49.0

Re: Draft for basic NUMA observability

Reply via email to