On 4/7/25 20:11, Bertrand Drouvot wrote:
> Hi,
> 
> On Mon, Apr 07, 2025 at 12:42:21PM -0400, Andres Freund wrote:
>> Hi,
>>
>> On 2025-04-07 18:36:24 +0200, Tomas Vondra wrote:
>>
>> I was thinking of checking if the BufferDesc indicates BM_VALID or
>> BM_TAG_VALID.
> 
> Yeah, that's what I did propose in [1] (when we were speaking about 
> get_mempolicy())
> and I think that would make sense as future improvement.
> 
>>
>>
>>> I think we need to decide whether the current patches are good enough
>>> for PG18, with the current behavior, and then maybe improve that in
>>> PG19.
>>
>> I think as long as the docs mention this with <note> or <warning> it's ok for
>> now.
> 
> +1
> 
> A few comments on v27:
> 
> === 1
> 
> pg_buffercache_numa() reports the node ID as "nodeid" while 
> pg_shmem_allocations_numa()
> reports it as node_id. Maybe we should use the same "naming" in both.
> 

This was renamed in v28 to "numa_node" in both parts.

> === 2
> 
> postgres=# select count(*) from pg_buffercache;
>  count
> -------
>  65536
> (1 row)
> 
> but 
> 
> postgres=# select count(*) from pg_buffercache_numa;
>  count
> -------
>     64
> (1 row)
> 
> with:
> 
> postgres=# show block_size;
>  block_size
> ------------
>  2048
> 
> and Hugepagesize:       2048 kB.
> 
> and
> 
> postgres=#  show shared_buffers;
>  shared_buffers
> ----------------
>  128MB
> (1 row)
> 
> And even if for testing I set:
> 
> -               funcctx->max_calls = idx;
> +               funcctx->max_calls = 65536;
> 
> then I start to see weird results:
> 
> postgres=# select count(*) from pg_buffercache_numa where bufferid not in 
> (select bufferid from pg_buffercache);
>  count
> -------
>  65472
> (1 row)
> 
> So it looks like that the new way to iterate on the buffers that has been 
> introduced
> in v26/v27 has some issue?
> 

Yeah, the calculations of the end pointers were wrong - we need to round
up (using TYPEALIGN()) when calculating number of pages, and just add
BLCKSZ (without any rounding) when calculating end of buffer. The 0004
fixes this for me (I tried this with various blocksizes / page sizes).

Thanks for noticing this!


regards

-- 
Tomas Vondra
From eac1dc9acbcbde8746d4fa48bc6f5647fd7dfa50 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Mon, 7 Apr 2025 17:31:17 +0200
Subject: [PATCH v29 1/4] Add support for basic NUMA awareness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add basic NUMA awareness routines, using a minimal src/port/pg_numa.c
portability wrapper and an optional build dependency, enabled by
--with-libnuma configure option. For now this is Linux-only, other
platforms may be supported later.

A built-in SQL function pg_numa_available() allows checking NUMA
support, i.e. that the server was built/linked with NUMA library.

The main function introduced is pg_numa_query_pages(), which allows
determining NUMA node for individual memory pages. Internally the
function uses move_pages(2) syscall, as it allows batching, and is
more efficient than get_mempolicy(2).

Author: Jakub Wartak <jakub.war...@enterprisedb.com>
Co-authored-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Álvaro Herrera <alvhe...@alvh.no-ip.org>
Reviewed-by: Tomas Vondra <to...@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 .cirrus.tasks.yml                   |   2 +
 configure                           | 187 ++++++++++++++++++++++++++++
 configure.ac                        |  14 +++
 doc/src/sgml/func.sgml              |  13 ++
 doc/src/sgml/installation.sgml      |  22 ++++
 meson.build                         |  23 ++++
 meson_options.txt                   |   3 +
 src/Makefile.global.in              |   6 +-
 src/backend/utils/misc/guc_tables.c |   2 +-
 src/include/catalog/pg_proc.dat     |   4 +
 src/include/pg_config.h.in          |   3 +
 src/include/port/pg_numa.h          |  40 ++++++
 src/include/storage/pg_shmem.h      |   1 +
 src/makefiles/meson.build           |   3 +
 src/port/Makefile                   |   1 +
 src/port/meson.build                |   1 +
 src/port/pg_numa.c                  | 120 ++++++++++++++++++
 17 files changed, 443 insertions(+), 2 deletions(-)
 create mode 100644 src/include/port/pg_numa.h
 create mode 100644 src/port/pg_numa.c

diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
index 86a1fa9bbdb..6f4f5c674a1 100644
--- a/.cirrus.tasks.yml
+++ b/.cirrus.tasks.yml
@@ -471,6 +471,7 @@ task:
             --enable-cassert --enable-injection-points --enable-debug \
             --enable-tap-tests --enable-nls \
             --with-segsize-blocks=6 \
+            --with-libnuma \
             --with-liburing \
             \
             ${LINUX_CONFIGURE_FEATURES} \
@@ -523,6 +524,7 @@ task:
             -Dllvm=disabled \
             --pkg-config-path /usr/lib/i386-linux-gnu/pkgconfig/ \
             -DPERL=perl5.36-i386-linux-gnu \
+            -Dlibnuma=disabled \
             build-32
         EOF
 
diff --git a/configure b/configure
index 11615d1122d..e27badd83c3 100755
--- a/configure
+++ b/configure
@@ -708,6 +708,9 @@ XML2_LIBS
 XML2_CFLAGS
 XML2_CONFIG
 with_libxml
+LIBNUMA_LIBS
+LIBNUMA_CFLAGS
+with_libnuma
 LIBCURL_LIBS
 LIBCURL_CFLAGS
 with_libcurl
@@ -872,6 +875,7 @@ with_liburing
 with_uuid
 with_ossp_uuid
 with_libcurl
+with_libnuma
 with_libxml
 with_libxslt
 with_system_tzdata
@@ -906,6 +910,8 @@ LIBURING_CFLAGS
 LIBURING_LIBS
 LIBCURL_CFLAGS
 LIBCURL_LIBS
+LIBNUMA_CFLAGS
+LIBNUMA_LIBS
 XML2_CONFIG
 XML2_CFLAGS
 XML2_LIBS
@@ -1588,6 +1594,7 @@ Optional Packages:
   --with-uuid=LIB         build contrib/uuid-ossp using LIB (bsd,e2fs,ossp)
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libcurl          build with libcurl support
+  --with-libnuma          build with libnuma support
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
   --with-system-tzdata=DIR
@@ -1629,6 +1636,10 @@ Some influential environment variables:
               C compiler flags for LIBCURL, overriding pkg-config
   LIBCURL_LIBS
               linker flags for LIBCURL, overriding pkg-config
+  LIBNUMA_CFLAGS
+              C compiler flags for LIBNUMA, overriding pkg-config
+  LIBNUMA_LIBS
+              linker flags for LIBNUMA, overriding pkg-config
   XML2_CONFIG path to xml2-config utility
   XML2_CFLAGS C compiler flags for XML2, overriding pkg-config
   XML2_LIBS   linker flags for XML2, overriding pkg-config
@@ -9063,6 +9074,182 @@ $as_echo "$as_me: WARNING: *** OAuth support tests require --with-python to run"
 fi
 
 
+#
+# libnuma
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with libnuma support" >&5
+$as_echo_n "checking whether to build with libnuma support... " >&6; }
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+  withval=$with_libnuma;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBNUMA 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libnuma=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_libnuma" >&5
+$as_echo "$with_libnuma" >&6; }
+
+
+if test "$with_libnuma" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_available in -lnuma" >&5
+$as_echo_n "checking for numa_available in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_available+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_available ();
+int
+main ()
+{
+return numa_available ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_numa_numa_available=yes
+else
+  ac_cv_lib_numa_numa_available=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_available" >&5
+$as_echo "$ac_cv_lib_numa_numa_available" >&6; }
+if test "x$ac_cv_lib_numa_numa_available" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+  LIBS="-lnuma $LIBS"
+
+else
+  as_fn_error $? "library 'libnuma' is required for NUMA support" "$LINENO" 5
+fi
+
+
+pkg_failed=no
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa" >&5
+$as_echo_n "checking for numa... " >&6; }
+
+if test -n "$LIBNUMA_CFLAGS"; then
+    pkg_cv_LIBNUMA_CFLAGS="$LIBNUMA_CFLAGS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBNUMA_CFLAGS=`$PKG_CONFIG --cflags "numa" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+if test -n "$LIBNUMA_LIBS"; then
+    pkg_cv_LIBNUMA_LIBS="$LIBNUMA_LIBS"
+ elif test -n "$PKG_CONFIG"; then
+    if test -n "$PKG_CONFIG" && \
+    { { $as_echo "$as_me:${as_lineno-$LINENO}: \$PKG_CONFIG --exists --print-errors \"numa\""; } >&5
+  ($PKG_CONFIG --exists --print-errors "numa") 2>&5
+  ac_status=$?
+  $as_echo "$as_me:${as_lineno-$LINENO}: \$? = $ac_status" >&5
+  test $ac_status = 0; }; then
+  pkg_cv_LIBNUMA_LIBS=`$PKG_CONFIG --libs "numa" 2>/dev/null`
+		      test "x$?" != "x0" && pkg_failed=yes
+else
+  pkg_failed=yes
+fi
+ else
+    pkg_failed=untried
+fi
+
+
+
+if test $pkg_failed = yes; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+
+if $PKG_CONFIG --atleast-pkgconfig-version 0.20; then
+        _pkg_short_errors_supported=yes
+else
+        _pkg_short_errors_supported=no
+fi
+        if test $_pkg_short_errors_supported = yes; then
+	        LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --short-errors --print-errors --cflags --libs "numa" 2>&1`
+        else
+	        LIBNUMA_PKG_ERRORS=`$PKG_CONFIG --print-errors --cflags --libs "numa" 2>&1`
+        fi
+	# Put the nasty error message in config.log where it belongs
+	echo "$LIBNUMA_PKG_ERRORS" >&5
+
+	as_fn_error $? "Package requirements (numa) were not met:
+
+$LIBNUMA_PKG_ERRORS
+
+Consider adjusting the PKG_CONFIG_PATH environment variable if you
+installed software in a non-standard prefix.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details." "$LINENO" 5
+elif test $pkg_failed = untried; then
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: no" >&5
+$as_echo "no" >&6; }
+	{ { $as_echo "$as_me:${as_lineno-$LINENO}: error: in \`$ac_pwd':" >&5
+$as_echo "$as_me: error: in \`$ac_pwd':" >&2;}
+as_fn_error $? "The pkg-config script could not be found or is too old.  Make sure it
+is in your PATH or set the PKG_CONFIG environment variable to the full
+path to pkg-config.
+
+Alternatively, you may set the environment variables LIBNUMA_CFLAGS
+and LIBNUMA_LIBS to avoid the need to call pkg-config.
+See the pkg-config man page for more details.
+
+To get pkg-config, see <http://pkg-config.freedesktop.org/>.
+See \`config.log' for more details" "$LINENO" 5; }
+else
+	LIBNUMA_CFLAGS=$pkg_cv_LIBNUMA_CFLAGS
+	LIBNUMA_LIBS=$pkg_cv_LIBNUMA_LIBS
+        { $as_echo "$as_me:${as_lineno-$LINENO}: result: yes" >&5
+$as_echo "yes" >&6; }
+
+fi
+fi
+
 #
 # XML
 #
diff --git a/configure.ac b/configure.ac
index debdf165044..d365a486d3d 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1053,6 +1053,20 @@ if test "$with_libcurl" = yes ; then
 fi
 
 
+#
+# libnuma
+#
+AC_MSG_CHECKING([whether to build with libnuma support])
+PGAC_ARG_BOOL(with, libnuma, no, [build with libnuma support],
+              [AC_DEFINE([USE_LIBNUMA], 1, [Define to build with NUMA support. (--with-libnuma)])])
+AC_MSG_RESULT([$with_libnuma])
+AC_SUBST(with_libnuma)
+
+if test "$with_libnuma" = yes ; then
+  AC_CHECK_LIB(numa,    numa_available, [], [AC_MSG_ERROR([library 'libnuma' is required for NUMA support])])
+  PKG_CHECK_MODULES(LIBNUMA, numa)
+fi
+
 #
 # XML
 #
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 0224f93733d..9ab070adffb 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -25143,6 +25143,19 @@ SELECT * FROM pg_ls_dir('.') WITH ORDINALITY AS t(ls,n);
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <indexterm>
+         <primary>pg_numa_available</primary>
+        </indexterm>
+        <function>pg_numa_available</function> ()
+        <returnvalue>boolean</returnvalue>
+       </para>
+       <para>
+        Returns true if the server has been compiled with <acronym>NUMA</acronym> support.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <indexterm>
diff --git a/doc/src/sgml/installation.sgml b/doc/src/sgml/installation.sgml
index cc28f041330..077bcc20759 100644
--- a/doc/src/sgml/installation.sgml
+++ b/doc/src/sgml/installation.sgml
@@ -1156,6 +1156,17 @@ build-postgresql:
        </listitem>
       </varlistentry>
 
+      <varlistentry id="configure-option-with-libnuma">
+       <term><option>--with-libnuma</option></term>
+       <listitem>
+        <para>
+         Build with libnuma support for basic NUMA support.
+         Only supported on platforms for which the <productname>libnuma</productname>
+         library is implemented.
+        </para>
+       </listitem>
+      </varlistentry>
+
       <varlistentry id="configure-option-with-liburing">
        <term><option>--with-liburing</option></term>
        <listitem>
@@ -2645,6 +2656,17 @@ ninja install
       </listitem>
      </varlistentry>
 
+     <varlistentry id="configure-with-libnuma-meson">
+      <term><option>-Dlibnuma={ auto | enabled | disabled }</option></term>
+      <listitem>
+       <para>
+        Build with libnuma support for basic NUMA support.
+        Only supported on platforms for which the <productname>libnuma</productname>
+        library is implemented. The default for this option is auto.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="configure-with-libxml-meson">
       <term><option>-Dlibxml={ auto | enabled | disabled }</option></term>
       <listitem>
diff --git a/meson.build b/meson.build
index 454ed81f5ea..0a625047f33 100644
--- a/meson.build
+++ b/meson.build
@@ -943,6 +943,27 @@ else
 endif
 
 
+###############################################################
+# Library: libnuma
+###############################################################
+
+libnumaopt = get_option('libnuma')
+if not libnumaopt.disabled()
+  # via pkg-config
+  libnuma = dependency('numa', required: false)
+  if not libnuma.found()
+    libnuma = cc.find_library('numa', required: libnumaopt)
+  endif
+  if not cc.has_header('numa.h', dependencies: libnuma, required: libnumaopt)
+    libnuma = not_found_dep
+  endif
+  if libnuma.found()
+    cdata.set('USE_LIBNUMA', 1)
+  endif
+else
+  libnuma = not_found_dep
+endif
+
 
 ###############################################################
 # Library: liburing
@@ -3243,6 +3264,7 @@ backend_both_deps += [
   icu_i18n,
   ldap,
   libintl,
+  libnuma,
   liburing,
   libxml,
   lz4,
@@ -3899,6 +3921,7 @@ if meson.version().version_compare('>=0.57')
       'icu': icu,
       'ldap': ldap,
       'libcurl': libcurl,
+      'libnuma': libnuma,
       'liburing': liburing,
       'libxml': libxml,
       'libxslt': libxslt,
diff --git a/meson_options.txt b/meson_options.txt
index dd7126da3a7..06bf5627d3c 100644
--- a/meson_options.txt
+++ b/meson_options.txt
@@ -106,6 +106,9 @@ option('libcurl', type : 'feature', value: 'auto',
 option('libedit_preferred', type: 'boolean', value: false,
   description: 'Prefer BSD Libedit over GNU Readline')
 
+option('libnuma', type: 'feature', value: 'auto',
+  description: 'NUMA support')
+
 option('liburing', type : 'feature', value: 'auto',
   description: 'io_uring support, for asynchronous I/O')
 
diff --git a/src/Makefile.global.in b/src/Makefile.global.in
index 737b2dd1869..6722fbdf365 100644
--- a/src/Makefile.global.in
+++ b/src/Makefile.global.in
@@ -196,6 +196,7 @@ with_gssapi	= @with_gssapi@
 with_krb_srvnam	= @with_krb_srvnam@
 with_ldap	= @with_ldap@
 with_libcurl	= @with_libcurl@
+with_libnuma	= @with_libnuma@
 with_liburing	= @with_liburing@
 with_libxml	= @with_libxml@
 with_libxslt	= @with_libxslt@
@@ -223,6 +224,9 @@ krb_srvtab = @krb_srvtab@
 ICU_CFLAGS		= @ICU_CFLAGS@
 ICU_LIBS		= @ICU_LIBS@
 
+LIBNUMA_CFLAGS		= @LIBNUMA_CFLAGS@
+LIBNUMA_LIBS		= @LIBNUMA_LIBS@
+
 LIBURING_CFLAGS		= @LIBURING_CFLAGS@
 LIBURING_LIBS		= @LIBURING_LIBS@
 
@@ -250,7 +254,7 @@ CPP = @CPP@
 CPPFLAGS = @CPPFLAGS@
 PG_SYSROOT = @PG_SYSROOT@
 
-override CPPFLAGS := $(ICU_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
+override CPPFLAGS := $(ICU_CFLAGS) $(LIBNUMA_CFLAGS) $(LIBURING_CFLAGS) $(CPPFLAGS)
 
 ifdef PGXS
 override CPPFLAGS := -I$(includedir_server) -I$(includedir_internal) $(CPPFLAGS)
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 4eaeca89f2c..ea8d796e7c4 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -566,7 +566,7 @@ static int	ssl_renegotiation_limit;
  */
 int			huge_pages = HUGE_PAGES_TRY;
 int			huge_page_size;
-static int	huge_pages_status = HUGE_PAGES_UNKNOWN;
+int			huge_pages_status = HUGE_PAGES_UNKNOWN;
 
 /*
  * These variables are all dummies that don't do anything, except in some
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5d5be8ba4e1..04834d130f9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8542,6 +8542,10 @@
   proargnames => '{name,off,size,allocated_size}',
   prosrc => 'pg_get_shmem_allocations' },
 
+{ oid => '9685', descr => 'Is NUMA support available?',
+  proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
+  proargtypes => '', prosrc => 'pg_numa_available' },
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c2f1241b234..b3166ec8f42 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -686,6 +686,9 @@
 /* Define to 1 to build with libcurl support. (--with-libcurl) */
 #undef USE_LIBCURL
 
+/* Define to 1 to build with NUMA support. (--with-libnuma) */
+#undef USE_LIBNUMA
+
 /* Define to build with io_uring support. (--with-liburing) */
 #undef USE_LIBURING
 
diff --git a/src/include/port/pg_numa.h b/src/include/port/pg_numa.h
new file mode 100644
index 00000000000..7e990d9f776
--- /dev/null
+++ b/src/include/port/pg_numa.h
@@ -0,0 +1,40 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.h
+ *	  Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ * 	src/include/port/pg_numa.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef PG_NUMA_H
+#define PG_NUMA_H
+
+#include "fmgr.h"
+
+extern PGDLLIMPORT int pg_numa_init(void);
+extern PGDLLIMPORT int pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status);
+extern PGDLLIMPORT int pg_numa_get_max_node(void);
+extern PGDLLIMPORT Size pg_numa_get_pagesize(void);
+
+#ifdef USE_LIBNUMA
+
+/*
+ * This is required on Linux, before pg_numa_query_pages() as we
+ * need to page-fault before move_pages(2) syscall returns valid results.
+ */
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+	ro_volatile_var = *(volatile uint64 *) ptr
+
+#else
+
+#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
+	do {} while(0)
+
+#endif
+
+#endif							/* PG_NUMA_H */
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..5f7d4b83a60 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
 extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;
 
 /* Possible values for huge_pages and huge_pages_status */
 typedef enum
diff --git a/src/makefiles/meson.build b/src/makefiles/meson.build
index 46d8da070e8..55da678ec27 100644
--- a/src/makefiles/meson.build
+++ b/src/makefiles/meson.build
@@ -200,6 +200,8 @@ pgxs_empty = [
 
   'ICU_LIBS',
 
+  'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
+
   'LIBURING_CFLAGS', 'LIBURING_LIBS',
 ]
 
@@ -232,6 +234,7 @@ pgxs_deps = {
   'icu': icu,
   'ldap': ldap,
   'libcurl': libcurl,
+  'libnuma': libnuma,
   'liburing': liburing,
   'libxml': libxml,
   'libxslt': libxslt,
diff --git a/src/port/Makefile b/src/port/Makefile
index f11896440d5..4274949dfa4 100644
--- a/src/port/Makefile
+++ b/src/port/Makefile
@@ -45,6 +45,7 @@ OBJS = \
 	path.o \
 	pg_bitutils.o \
 	pg_localeconv_r.o \
+	pg_numa.o \
 	pg_popcount_aarch64.o \
 	pg_popcount_avx512.o \
 	pg_strong_random.o \
diff --git a/src/port/meson.build b/src/port/meson.build
index 51041e75609..228888b2f66 100644
--- a/src/port/meson.build
+++ b/src/port/meson.build
@@ -8,6 +8,7 @@ pgport_sources = [
   'path.c',
   'pg_bitutils.c',
   'pg_localeconv_r.c',
+  'pg_numa.c',
   'pg_popcount_aarch64.c',
   'pg_popcount_avx512.c',
   'pg_strong_random.c',
diff --git a/src/port/pg_numa.c b/src/port/pg_numa.c
new file mode 100644
index 00000000000..5e2523cf798
--- /dev/null
+++ b/src/port/pg_numa.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_numa.c
+ * 		Basic NUMA portability routines
+ *
+ *
+ * Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/port/pg_numa.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include <unistd.h>
+
+#ifdef WIN32
+#include <windows.h>
+#endif
+
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+
+/*
+ * At this point we provide support only for Linux thanks to libnuma, but in
+ * future support for other platforms e.g. Win32 or FreeBSD might be possible
+ * too. For Win32 NUMA APIs see
+ * https://learn.microsoft.com/en-us/windows/win32/procthread/numa-support
+ */
+#ifdef USE_LIBNUMA
+
+#include <numa.h>
+#include <numaif.h>
+
+Datum		pg_numa_available(PG_FUNCTION_ARGS);
+
+/* libnuma requires initialization as per numa(3) on Linux */
+int
+pg_numa_init(void)
+{
+	int			r = numa_available();
+
+	return r;
+}
+
+/*
+ * We use move_pages(2) syscall here - instead of get_mempolicy(2) - as the
+ * first one allows us to batch and query about many memory pages in one single
+ * giant system call that is way faster.
+ */
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+	return numa_move_pages(pid, count, pages, NULL, status, 0);
+}
+
+int
+pg_numa_get_max_node(void)
+{
+	return numa_max_node();
+}
+
+#else
+
+Datum		pg_numa_available(PG_FUNCTION_ARGS);
+
+/* Empty wrappers */
+int
+pg_numa_init(void)
+{
+	/* We state that NUMA is not available */
+	return -1;
+}
+
+int
+pg_numa_query_pages(int pid, unsigned long count, void **pages, int *status)
+{
+	return 0;
+}
+
+int
+pg_numa_get_max_node(void)
+{
+	return 0;
+}
+
+#endif
+
+Datum
+pg_numa_available(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(pg_numa_init() != -1);
+}
+
+/* This should be used only after the server is started */
+Size
+pg_numa_get_pagesize(void)
+{
+	Size		os_page_size;
+#ifdef WIN32
+	SYSTEM_INFO sysinfo;
+
+	GetSystemInfo(&sysinfo);
+	os_page_size = sysinfo.dwPageSize;
+#else
+	os_page_size = sysconf(_SC_PAGESIZE);
+#endif
+
+	Assert(IsUnderPostmaster);
+	Assert(huge_pages_status != HUGE_PAGES_UNKNOWN);
+
+	if (huge_pages_status == HUGE_PAGES_ON)
+		GetHugePageSize(&os_page_size, NULL);
+
+	return os_page_size;
+}
-- 
2.49.0

From b02e2f2cc0770633b565888936a4b9eb95232a19 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Mon, 7 Apr 2025 19:32:39 +0200
Subject: [PATCH v29 2/4] Introduce pg_shmem_allocations_numa view

Introduce new pg_shmem_alloctions_numa view with information about how
shared memory is distributed across NUMA nodes. For each shared memory
segment, the view returns one row for each NUMA node backing it, with
the total amount of memory allocated from that node.

The view may be relatively expensive, especially when executed for the
first time in a backend, as it has to touch all memory pages to get
reliable information about the NUMA node. This may also force allocation
of the shared memory.

Unlike pg_shmem_allocations, the view does not show anonymous shared
memory allocations. It also does not show memory allocated using the
dynamic shared memory infrastructure.

Author: Jakub Wartak <jakub.war...@enterprisedb.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Reviewed-by: Tomas Vondra <to...@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 doc/src/sgml/system-views.sgml           |  95 ++++++++++++++
 src/backend/catalog/system_views.sql     |   8 ++
 src/backend/storage/ipc/shmem.c          | 159 +++++++++++++++++++++++
 src/include/catalog/pg_proc.dat          |   8 ++
 src/test/regress/expected/numa.out       |  13 ++
 src/test/regress/expected/numa_1.out     |   5 +
 src/test/regress/expected/privileges.out |  16 ++-
 src/test/regress/expected/rules.out      |   4 +
 src/test/regress/parallel_schedule       |   2 +-
 src/test/regress/sql/numa.sql            |  10 ++
 src/test/regress/sql/privileges.sql      |   6 +-
 11 files changed, 321 insertions(+), 5 deletions(-)
 create mode 100644 src/test/regress/expected/numa.out
 create mode 100644 src/test/regress/expected/numa_1.out
 create mode 100644 src/test/regress/sql/numa.sql

diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
index 4f336ee0adf..0eba37268bf 100644
--- a/doc/src/sgml/system-views.sgml
+++ b/doc/src/sgml/system-views.sgml
@@ -181,6 +181,11 @@
       <entry>shared memory allocations</entry>
      </row>
 
+     <row>
+      <entry><link linkend="view-pg-shmem-allocations-numa"><structname>pg_shmem_allocations_numa</structname></link></entry>
+      <entry>NUMA node mappings for shared memory allocations</entry>
+     </row>
+
      <row>
       <entry><link linkend="view-pg-stats"><structname>pg_stats</structname></link></entry>
       <entry>planner statistics</entry>
@@ -4051,6 +4056,96 @@ SELECT * FROM pg_locks pl LEFT JOIN pg_prepared_xacts ppx
   </para>
  </sect1>
 
+ <sect1 id="view-pg-shmem-allocations-numa">
+  <title><structname>pg_shmem_allocations_numa</structname></title>
+
+  <indexterm zone="view-pg-shmem-allocations-numa">
+   <primary>pg_shmem_allocations_numa</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_shmem_allocations_numa</structname> shows how shared
+   memory allocations in the server's main shared memory segment are distributed
+   across NUMA nodes. This includes both memory allocated by
+   <productname>PostgreSQL</productname> itself and memory allocated
+   by extensions using the mechanisms detailed in
+   <xref linkend="xfunc-shared-addin" />. This view will output multiple rows
+   for each of the shared memory segments provided that they are spread accross
+   multiple NUMA nodes. This view should not be queried by monitoring systems
+   as it is very slow and may end up allocating shared memory in case it was not
+   used earlier.
+   Current limitation for this view is that won't show anonymous shared memory
+   allocations.
+  </para>
+
+  <para>
+   Note that this view does not include memory allocated using the dynamic
+   shared memory infrastructure.
+  </para>
+
+  <warning>
+    <para>
+      When determining the <acronym>NUMA</acronym> node, the view touches
+      all memory pages for the shared memory segment. This will force
+      allocation of the shared memory, if it wasn't allocated already,
+      and the memory may get allocated in a single <acronym>NUMA</acronym>
+      node (depending on system configuration).
+    </para>
+  </warning>
+
+  <table>
+   <title><structname>pg_shmem_allocations_numa</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>name</structfield> <type>text</type>
+      </para>
+      <para>
+       The name of the shared memory allocation.
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>numa_node</structfield> <type>int4</type>
+      </para>
+      <para>
+      ID of <acronym>NUMA</acronym> node
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>size</structfield> <type>int4</type>
+      </para>
+      <para>
+       Size of the allocation on this particular NUMA memory node in bytes
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   By default, the <structname>pg_shmem_allocations_numa</structname> view can be
+   read only by superusers or roles with privileges of the
+   <literal>pg_read_all_stats</literal> role.
+  </para>
+ </sect1>
+
  <sect1 id="view-pg-stats">
   <title><structname>pg_stats</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 273008db37f..08f780a2e63 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -658,6 +658,14 @@ GRANT SELECT ON pg_shmem_allocations TO pg_read_all_stats;
 REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations() FROM PUBLIC;
 GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations() TO pg_read_all_stats;
 
+CREATE VIEW pg_shmem_allocations_numa AS
+    SELECT * FROM pg_get_shmem_allocations_numa();
+
+REVOKE ALL ON pg_shmem_allocations_numa FROM PUBLIC;
+GRANT SELECT ON pg_shmem_allocations_numa TO pg_read_all_stats;
+REVOKE EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() FROM PUBLIC;
+GRANT EXECUTE ON FUNCTION pg_get_shmem_allocations_numa() TO pg_read_all_stats;
+
 CREATE VIEW pg_backend_memory_contexts AS
     SELECT * FROM pg_get_backend_memory_contexts();
 
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..e10b380e5c7 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -68,6 +68,7 @@
 #include "fmgr.h"
 #include "funcapi.h"
 #include "miscadmin.h"
+#include "port/pg_numa.h"
 #include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
 #include "storage/shmem.h"
@@ -89,6 +90,8 @@ slock_t    *ShmemLock;			/* spinlock for shared memory and LWLock
 
 static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
 
+/* To get reliable results for NUMA inquiry we need to "touch pages" once */
+static bool firstNumaTouch = true;
 
 /*
  *	InitShmemAccess() --- set up basic pointers to shared memory.
@@ -568,3 +571,159 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
 	return (Datum) 0;
 }
+
+/*
+ * SQL SRF showing NUMA memory nodes for allocated shared memory
+ *
+ * Compared to pg_get_shmem_allocations(), this function does not return
+ * information about shared anonymous allocations and unused shared memory.
+ */
+Datum
+pg_get_shmem_allocations_numa(PG_FUNCTION_ARGS)
+{
+#define PG_GET_SHMEM_NUMA_SIZES_COLS 3
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	HASH_SEQ_STATUS hstat;
+	ShmemIndexEnt *ent;
+	Datum		values[PG_GET_SHMEM_NUMA_SIZES_COLS];
+	bool		nulls[PG_GET_SHMEM_NUMA_SIZES_COLS];
+	Size		os_page_size;
+	void	  **page_ptrs;
+	int		   *pages_status;
+	uint64		shm_total_page_count,
+				shm_ent_page_count,
+				max_nodes;
+	Size	   *nodes;
+
+	if (pg_numa_init() == -1)
+		elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	max_nodes = pg_numa_get_max_node();
+	nodes = palloc(sizeof(Size) * (max_nodes + 1));
+
+	/*
+	 * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
+	 * the OS may have different memory page sizes.
+	 *
+	 * To correctly map between them, we need to: 1. Determine the OS memory
+	 * page size 2. Calculate how many OS pages are used by all buffer blocks
+	 * 3. Calculate how many OS pages are contained within each database
+	 * block.
+	 *
+	 * This information is needed before calling move_pages() for NUMA memory
+	 * node inquiry.
+	 */
+	os_page_size = pg_numa_get_pagesize();
+
+	/*
+	 * Allocate memory for page pointers and status based on total shared
+	 * memory size. This simplified approach allocates enough space for all
+	 * pages in shared memory rather than calculating the exact requirements
+	 * for each segment.
+	 *
+	 * Add 1, because we don't know how exactly the segments align to OS
+	 * pages, so the allocation might use one more memory page. In practice
+	 * this is not very likely, and moreover we have more entries, each of
+	 * them using only fraction of the total pages.
+	 */
+	shm_total_page_count = (ShmemSegHdr->totalsize / os_page_size) + 1;
+	page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
+	pages_status = palloc(sizeof(int) * shm_total_page_count);
+
+	if (firstNumaTouch)
+		elog(DEBUG1, "NUMA: page-faulting shared memory segments for proper NUMA readouts");
+
+	LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+	hash_seq_init(&hstat, ShmemIndex);
+
+	/* output all allocated entries */
+	memset(nulls, 0, sizeof(nulls));
+	while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			i;
+		char	   *startptr,
+				   *endptr;
+		Size		total_len;
+
+		/*
+		 * Calculate the range of OS pages used by this segment. The segment
+		 * may start / end half-way through a page, we want to count these
+		 * pages too. So we align the start/end pointers down/up, and then
+		 * calculate the number of pages from that.
+		 */
+		startptr = (char *) TYPEALIGN_DOWN(os_page_size, ent->location);
+		endptr = (char *) TYPEALIGN(os_page_size,
+									(char *) ent->location + ent->allocated_size);
+		total_len = (endptr - startptr);
+
+		shm_ent_page_count = total_len / os_page_size;
+
+		/*
+		 * If we ever get 0xff (-1) back from kernel inquiry, then we probably
+		 * have a bug in mapping buffers to OS pages.
+		 */
+		memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
+
+		/*
+		 * Setup page_ptrs[] with pointers to all OS pages for this segment,
+		 * and get the NUMA status using pg_numa_query_pages.
+		 *
+		 * In order to get reliable results we also need to touch memory
+		 * pages, so that inquiry about NUMA memory node doesn't return -2
+		 * (ENOENT, which indicates unmapped/unallocated pages).
+		 */
+		for (i = 0; i < shm_ent_page_count; i++)
+		{
+			volatile uint64 touch pg_attribute_unused();
+
+			page_ptrs[i] = startptr + (i * os_page_size);
+
+			if (firstNumaTouch)
+				pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
+
+			CHECK_FOR_INTERRUPTS();
+		}
+
+		if (pg_numa_query_pages(0, shm_ent_page_count, page_ptrs, pages_status) == -1)
+			elog(ERROR, "failed NUMA pages inquiry status: %m");
+
+		/* Count number of NUMA nodes used for this shared memory entry */
+		memset(nodes, 0, sizeof(Size) * (max_nodes + 1));
+
+		for (i = 0; i < shm_ent_page_count; i++)
+		{
+			int			s = pages_status[i];
+
+			/* Ensure we are adding only valid index to the array */
+			if (s < 0 || s > max_nodes)
+			{
+				elog(ERROR, "invalid NUMA node id outside of allowed range "
+					 "[0, " UINT64_FORMAT "]: %d", max_nodes, s);
+			}
+
+			nodes[s]++;
+		}
+
+		/*
+		 * Add one entry for each NUMA node, including those without allocated
+		 * memory for this segment.
+		 */
+		for (i = 0; i <= max_nodes; i++)
+		{
+			values[0] = CStringGetTextDatum(ent->key);
+			values[1] = i;
+			values[2] = Int64GetDatum(nodes[i] * os_page_size);
+
+			tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+								 values, nulls);
+		}
+	}
+
+	LWLockRelease(ShmemIndexLock);
+	firstNumaTouch = false;
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 04834d130f9..8597981d6b3 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -8546,6 +8546,14 @@
   proname => 'pg_numa_available', provolatile => 's', prorettype => 'bool',
   proargtypes => '', prosrc => 'pg_numa_available' },
 
+# shared memory usage with NUMA info
+{ oid => '9686', descr => 'NUMA mappings for the main shared memory segment',
+  proname => 'pg_get_shmem_allocations_numa', prorows => '50', proretset => 't',
+  provolatile => 'v', prorettype => 'record', proargtypes => '',
+  proallargtypes => '{text,int4,int8}', proargmodes => '{o,o,o}',
+  proargnames => '{name,numa_node,size}',
+  prosrc => 'pg_get_shmem_allocations_numa' },
+
 # memory context of local backend
 { oid => '2282',
   descr => 'information about all memory contexts of local backend',
diff --git a/src/test/regress/expected/numa.out b/src/test/regress/expected/numa.out
new file mode 100644
index 00000000000..8af5dfeb9a5
--- /dev/null
+++ b/src/test/regress/expected/numa.out
@@ -0,0 +1,13 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+-- switch to superuser
+\c -
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
+ ok 
+----
+ t
+(1 row)
+
diff --git a/src/test/regress/expected/numa_1.out b/src/test/regress/expected/numa_1.out
new file mode 100644
index 00000000000..c90042fa7cc
--- /dev/null
+++ b/src/test/regress/expected/numa_1.out
@@ -0,0 +1,5 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+ERROR:  libnuma initialization failed or NUMA is not supported on this platform
+\quit
diff --git a/src/test/regress/expected/privileges.out b/src/test/regress/expected/privileges.out
index 1fddb13b6ae..c25062c288f 100644
--- a/src/test/regress/expected/privileges.out
+++ b/src/test/regress/expected/privileges.out
@@ -3219,8 +3219,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
 -- clean up
 DROP TABLE lock_table;
 DROP USER regress_locktable_user;
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
 -- switch to superuser
 \c -
 CREATE ROLE regress_readallstats;
@@ -3242,6 +3242,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
  f
 (1 row)
 
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
+ has_table_privilege 
+---------------------
+ f
+(1 row)
+
 GRANT pg_read_all_stats TO regress_readallstats;
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
  has_table_privilege 
@@ -3261,6 +3267,12 @@ SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT
  t
 (1 row)
 
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
+ has_table_privilege 
+---------------------
+ t
+(1 row)
+
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
 SELECT COUNT(*) >= 0 AS ok FROM pg_aios;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 673c63b8d1b..6cf828ca8d0 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1757,6 +1757,10 @@ pg_shmem_allocations| SELECT name,
     size,
     allocated_size
    FROM pg_get_shmem_allocations() pg_get_shmem_allocations(name, off, size, allocated_size);
+pg_shmem_allocations_numa| SELECT name,
+    numa_node,
+    size
+   FROM pg_get_shmem_allocations_numa() pg_get_shmem_allocations_numa(name, numa_node, size);
 pg_stat_activity| SELECT s.datid,
     d.datname,
     s.pid,
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 0a35f2f8f6a..0f38caa0d24 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -119,7 +119,7 @@ test: plancache limit plpgsql copy2 temp domain rangefuncs prepare conversion tr
 # The stats test resets stats, so nothing else needing stats access can be in
 # this group.
 # ----------
-test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate
+test: partition_join partition_prune reloptions hash_part indexing partition_aggregate partition_info tuplesort explain compression memoize stats predicate numa
 
 # event_trigger depends on create_am and cannot run concurrently with
 # any test that runs DDL
diff --git a/src/test/regress/sql/numa.sql b/src/test/regress/sql/numa.sql
new file mode 100644
index 00000000000..324481c33b7
--- /dev/null
+++ b/src/test/regress/sql/numa.sql
@@ -0,0 +1,10 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+SELECT COUNT(*) = 0 AS ok FROM pg_shmem_allocations_numa;
+\quit
+\endif
+
+-- switch to superuser
+\c -
+
+SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
diff --git a/src/test/regress/sql/privileges.sql b/src/test/regress/sql/privileges.sql
index 85d7280f35f..f337aa67c13 100644
--- a/src/test/regress/sql/privileges.sql
+++ b/src/test/regress/sql/privileges.sql
@@ -1947,8 +1947,8 @@ REVOKE MAINTAIN ON lock_table FROM regress_locktable_user;
 DROP TABLE lock_table;
 DROP USER regress_locktable_user;
 
--- test to check privileges of system views pg_shmem_allocations and
--- pg_backend_memory_contexts.
+-- test to check privileges of system views pg_shmem_allocations,
+-- pg_shmem_allocations_numa and pg_backend_memory_contexts.
 
 -- switch to superuser
 \c -
@@ -1958,12 +1958,14 @@ CREATE ROLE regress_readallstats;
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- no
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- no
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- no
 
 GRANT pg_read_all_stats TO regress_readallstats;
 
 SELECT has_table_privilege('regress_readallstats','pg_aios','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_backend_memory_contexts','SELECT'); -- yes
 SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations','SELECT'); -- yes
+SELECT has_table_privilege('regress_readallstats','pg_shmem_allocations_numa','SELECT'); -- yes
 
 -- run query to ensure that functions within views can be executed
 SET ROLE regress_readallstats;
-- 
2.49.0

From b5c6dc19a198977929b176573678a84da1c12fbd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.von...@postgresql.org>
Date: Mon, 7 Apr 2025 19:39:31 +0200
Subject: [PATCH v29 3/4] Add pg_buffercache_numa view with NUMA node info

Introduces a new view pg_buffercache_numa, showing NUMA memory nodes
for individual buffers. For each buffer the view returns an entry for
each memory page, with the associated NUMA node.

The database blocks and OS memory pages may have different size - the
default block size is 8KB, while the memory page is 4K (on x86). But
other combinations are possible, depending on configure parameters,
platform, etc. This means buffers may overlap with multiple memory
pages, each associated with a different NUMA node.

To determine the NUMA node for a buffer, we first need to touch the
memory pages using pg_numa_touch_mem_if_required, otherwise we might get
status -2 (ENOENT = The page is not present), indicating the page is
either unmapped or unallocated.

The view may be relatively expensive, especially when accessed for the
first time in a backend, as it touches all memory pages to get reliable
information about the NUMA node. This may also force allocation of the
shared memory.

Author: Jakub Wartak <jakub.war...@enterprisedb.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Reviewed-by: Tomas Vondra <to...@vondra.me>
Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
---
 contrib/pg_buffercache/Makefile               |   5 +-
 .../expected/pg_buffercache_numa.out          |  29 ++
 .../expected/pg_buffercache_numa_1.out        |   3 +
 contrib/pg_buffercache/meson.build            |   2 +
 .../pg_buffercache--1.5--1.6.sql              |  22 ++
 contrib/pg_buffercache/pg_buffercache.control |   2 +-
 contrib/pg_buffercache/pg_buffercache_pages.c | 283 ++++++++++++++++++
 .../sql/pg_buffercache_numa.sql               |  21 ++
 doc/src/sgml/pgbuffercache.sgml               |  85 +++++-
 src/tools/pgindent/typedefs.list              |   2 +
 10 files changed, 450 insertions(+), 4 deletions(-)
 create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa.out
 create mode 100644 contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
 create mode 100644 contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
 create mode 100644 contrib/pg_buffercache/sql/pg_buffercache_numa.sql

diff --git a/contrib/pg_buffercache/Makefile b/contrib/pg_buffercache/Makefile
index eae65ead9e5..5f748543e2e 100644
--- a/contrib/pg_buffercache/Makefile
+++ b/contrib/pg_buffercache/Makefile
@@ -8,10 +8,11 @@ OBJS = \
 EXTENSION = pg_buffercache
 DATA = pg_buffercache--1.2.sql pg_buffercache--1.2--1.3.sql \
 	pg_buffercache--1.1--1.2.sql pg_buffercache--1.0--1.1.sql \
-	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql
+	pg_buffercache--1.3--1.4.sql pg_buffercache--1.4--1.5.sql \
+	pg_buffercache--1.5--1.6.sql
 PGFILEDESC = "pg_buffercache - monitoring of shared buffer cache in real-time"
 
-REGRESS = pg_buffercache
+REGRESS = pg_buffercache pg_buffercache_numa
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa.out b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
new file mode 100644
index 00000000000..a10b331a552
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa.out
@@ -0,0 +1,29 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_numa;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ERROR:  permission denied for view pg_buffercache_numa
+RESET role;
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+ ?column? 
+----------
+ t
+(1 row)
+
+RESET role;
diff --git a/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
new file mode 100644
index 00000000000..6dd6824b4e4
--- /dev/null
+++ b/contrib/pg_buffercache/expected/pg_buffercache_numa_1.out
@@ -0,0 +1,3 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/contrib/pg_buffercache/meson.build b/contrib/pg_buffercache/meson.build
index 12d1fe48717..7cd039a1df9 100644
--- a/contrib/pg_buffercache/meson.build
+++ b/contrib/pg_buffercache/meson.build
@@ -23,6 +23,7 @@ install_data(
   'pg_buffercache--1.2.sql',
   'pg_buffercache--1.3--1.4.sql',
   'pg_buffercache--1.4--1.5.sql',
+  'pg_buffercache--1.5--1.6.sql',
   'pg_buffercache.control',
   kwargs: contrib_data_args,
 )
@@ -34,6 +35,7 @@ tests += {
   'regress': {
     'sql': [
       'pg_buffercache',
+      'pg_buffercache_numa',
     ],
   },
 }
diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
new file mode 100644
index 00000000000..f6668e41b37
--- /dev/null
+++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
@@ -0,0 +1,22 @@
+/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache UPDATE TO '1.6'" to load this file. \quit
+
+-- Register the new functions.
+CREATE OR REPLACE FUNCTION pg_buffercache_numa_pages()
+RETURNS SETOF RECORD
+AS 'MODULE_PATHNAME', 'pg_buffercache_numa_pages'
+LANGUAGE C PARALLEL SAFE;
+
+-- Create a view for convenient access.
+CREATE VIEW pg_buffercache_numa AS
+	SELECT P.* FROM pg_buffercache_numa_pages() AS P
+	(bufferid integer, os_page_num int4, numa_node int4);
+
+-- Don't want these to be available to public.
+REVOKE ALL ON FUNCTION pg_buffercache_numa_pages() FROM PUBLIC;
+REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
+
+GRANT EXECUTE ON FUNCTION pg_buffercache_numa_pages() TO pg_monitor;
+GRANT SELECT ON pg_buffercache_numa TO pg_monitor;
diff --git a/contrib/pg_buffercache/pg_buffercache.control b/contrib/pg_buffercache/pg_buffercache.control
index 5ee875f77dd..b030ba3a6fa 100644
--- a/contrib/pg_buffercache/pg_buffercache.control
+++ b/contrib/pg_buffercache/pg_buffercache.control
@@ -1,5 +1,5 @@
 # pg_buffercache extension
 comment = 'examine the shared buffer cache'
-default_version = '1.5'
+default_version = '1.6'
 module_pathname = '$libdir/pg_buffercache'
 relocatable = true
diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index 62602af1775..b9fdf87bcc4 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -11,6 +11,7 @@
 #include "access/htup_details.h"
 #include "catalog/pg_type.h"
 #include "funcapi.h"
+#include "port/pg_numa.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 
@@ -20,6 +21,8 @@
 #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
 #define NUM_BUFFERCACHE_USAGE_COUNTS_ELEM 4
 
+#define NUM_BUFFERCACHE_NUMA_ELEM	3
+
 PG_MODULE_MAGIC_EXT(
 					.name = "pg_buffercache",
 					.version = PG_VERSION
@@ -58,16 +61,44 @@ typedef struct
 	BufferCachePagesRec *record;
 } BufferCachePagesContext;
 
+/*
+ * Record structure holding the to be exposed cache data.
+ */
+typedef struct
+{
+	uint32		bufferid;
+	int32		page_num;
+	int32		numa_node;
+} BufferCacheNumaRec;
+
+/*
+ * Function context for data persisting over repeated calls.
+ */
+typedef struct
+{
+	TupleDesc	tupdesc;
+	int			buffers_per_page;
+	int			pages_per_buffer;
+	int			os_page_size;
+	BufferCacheNumaRec *record;
+} BufferCacheNumaContext;
+
 
 /*
  * Function returning data from the shared buffer cache - buffer number,
  * relation node/tablespace/database/blocknum and dirty indicator.
  */
 PG_FUNCTION_INFO_V1(pg_buffercache_pages);
+PG_FUNCTION_INFO_V1(pg_buffercache_numa_pages);
 PG_FUNCTION_INFO_V1(pg_buffercache_summary);
 PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
 PG_FUNCTION_INFO_V1(pg_buffercache_evict);
 
+
+/* Only need to touch memory once per backend process lifetime */
+static bool firstNumaTouch = true;
+
+
 Datum
 pg_buffercache_pages(PG_FUNCTION_ARGS)
 {
@@ -246,6 +277,258 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * Inquire about NUMA memory mappings for shared buffers.
+ *
+ * Returns NUMA node ID for each memory page used by the buffer. Buffers may
+ * be smaller or larger than OS memory pages. For each buffer we return one
+ * entry for each memory page used by the buffer (it fhe buffer is smaller,
+ * it only uses a part of one memory page).
+ *
+ * We expect both sizes (for buffers and memory pages) to be a power-of-2, so
+ * one is always a multiple of the other.
+ *
+ * In order to get reliable results we also need to touch memory pages, so
+ * that the inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages).
+ */
+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	MemoryContext oldcontext;
+	BufferCacheNumaContext *fctx;	/* User function context. */
+	TupleDesc	tupledesc;
+	TupleDesc	expected_tupledesc;
+	HeapTuple	tuple;
+	Datum		result;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		int			i,
+					idx;
+		Size		os_page_size = 0;
+		void	  **os_page_ptrs = NULL;
+		int		   *os_page_status;
+		uint64		os_page_count;
+		int			pages_per_buffer;
+		int			max_entries;
+		volatile uint64 touch pg_attribute_unused();
+		char	   *startptr,
+				   *endptr;
+
+		if (pg_numa_init() == -1)
+			elog(ERROR, "libnuma initialization failed or NUMA is not supported on this platform");
+
+		/*
+		 * The database block size and OS memory page size are unlikely to be
+		 * the same. The block size is 1-32KB, the memory page size depends on
+		 * platform. On x86 it's usually 4KB, on ARM it's 4KB or 64KB, but
+		 * there are also features like THP etc. Moreover, we don't quite know
+		 * how the pages and buffers "align" in memory - the buffers may be
+		 * shifted in some way, using more memory pages than necessary.
+		 *
+		 * So we need to be careful about mappping buffers to memory pages. We
+		 * calculate the maximum number of pages a buffer might use, so that
+		 * we allocate enough space for the entries. And then we count the
+		 * actual number of entries as we scan the buffers.
+		 *
+		 * This information is needed before calling move_pages() for NUMA
+		 * node id inquiry.
+		 */
+		os_page_size = pg_numa_get_pagesize();
+
+		/*
+		 * The pages and block size is expected to be 2^k, so one divides the
+		 * other (we don't know in which direction). This does not say
+		 * anything about relative alignment of pages/buffers.
+		 */
+		Assert((os_page_size % BLCKSZ == 0) || (BLCKSZ % os_page_size == 0));
+
+		/*
+		 * How many addresses we are going to query? Simply get the page for
+		 * the first buffer, and first page after the last buffer, and count
+		 * the pages from that.
+		 */
+		startptr = (char *) TYPEALIGN_DOWN(os_page_size,
+										   BufferGetBlock(1));
+		endptr = (char *) TYPEALIGN_DOWN(os_page_size,
+										 (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+		os_page_count = (endptr - startptr) / os_page_size;
+
+		/* Used to determine the NUMA node for all OS pages at once */
+		os_page_ptrs = palloc0(sizeof(void *) * os_page_count);
+		os_page_status = palloc(sizeof(uint64) * os_page_count);
+
+		/* Fill pointers for all the memory pages. */
+		idx = 0;
+		for (char *ptr = startptr; ptr < endptr; ptr += os_page_size)
+		{
+			os_page_ptrs[idx++] = ptr;
+
+			/* Only need to touch memory once per backend process lifetime */
+			if (firstNumaTouch)
+				pg_numa_touch_mem_if_required(touch, ptr);
+		}
+
+		Assert(idx == os_page_count);
+
+		elog(DEBUG1, "NUMA: NBuffers=%d os_page_count=" UINT64_FORMAT " "
+			 "os_page_size=%zu", NBuffers, os_page_count, os_page_size);
+
+		/*
+		 * If we ever get 0xff back from kernel inquiry, then we probably have
+		 * bug in our buffers to OS page mapping code here.
+		 */
+		memset(os_page_status, 0xff, sizeof(int) * os_page_count);
+
+		/* Query NUMA status for all the pointers */
+		if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_page_status) == -1)
+			elog(ERROR, "failed NUMA pages inquiry: %m");
+
+		/* Initialize the multi-call context, load entries about buffers */
+
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/* Switch context when allocating stuff to be used in later calls */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* Create a user function context for cross-call persistence */
+		fctx = (BufferCacheNumaContext *) palloc(sizeof(BufferCacheNumaContext));
+
+		if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		if (expected_tupledesc->natts != NUM_BUFFERCACHE_NUMA_ELEM)
+			elog(ERROR, "incorrect number of output arguments");
+
+		/* Construct a tuple descriptor for the result rows. */
+		tupledesc = CreateTemplateTupleDesc(expected_tupledesc->natts);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 1, "bufferid",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 2, "os_page_num",
+						   INT4OID, -1, 0);
+		TupleDescInitEntry(tupledesc, (AttrNumber) 3, "numa_node",
+						   INT4OID, -1, 0);
+
+		fctx->tupdesc = BlessTupleDesc(tupledesc);
+
+		/*
+		 * Each buffer needs at least one entry, but it might be offset in
+		 * some way, and use one extra entry. So we allocate space for the
+		 * maximum number of entries we might need, and then count the exact
+		 * number as we're walking buffers. That way we can do it in one pass,
+		 * without reallocating memory.
+		 */
+		pages_per_buffer = Max(1, BLCKSZ / os_page_size) + 1;
+		max_entries = NBuffers * pages_per_buffer;
+
+		/* Allocate entries for BufferCachePagesRec records. */
+		fctx->record = (BufferCacheNumaRec *)
+			MemoryContextAllocHuge(CurrentMemoryContext,
+								   sizeof(BufferCacheNumaRec) * max_entries);
+
+		/* Return to original context when allocating transient memory */
+		MemoryContextSwitchTo(oldcontext);
+
+		if (firstNumaTouch)
+			elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
+
+		/*
+		 * Scan through all the buffers, saving the relevant fields in the
+		 * fctx->record structure.
+		 *
+		 * We don't hold the partition locks, so we don't get a consistent
+		 * snapshot across all buffers, but we do grab the buffer header
+		 * locks, so the information of each buffer is self-consistent.
+		 *
+		 * This loop touches and stores addresses into os_page_ptrs[] as input
+		 * to one big big move_pages(2) inquiry system call. Basically we ask
+		 * for all memory pages for NBuffers.
+		 */
+		startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char *) BufferGetBlock(1));
+		idx = 0;
+		for (i = 0; i < NBuffers; i++)
+		{
+			char	   *buffptr = (char *) BufferGetBlock(i + 1);
+			BufferDesc *bufHdr;
+			uint32		buf_state;
+			uint32		bufferid;
+			int32		page_num;
+			char	   *startptr_buff,
+					   *endptr_buff;
+
+			CHECK_FOR_INTERRUPTS();
+
+			bufHdr = GetBufferDescriptor(i);
+
+			/* Lock each buffer header before inspecting. */
+			buf_state = LockBufHdr(bufHdr);
+			bufferid = BufferDescriptorGetBuffer(bufHdr);
+			UnlockBufHdr(bufHdr, buf_state);
+
+			/* start of the first page of this buffer */
+			startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
+
+			/* start of the page right after this buffer */
+			endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+
+			/* calculate ID of the first page for this buffer */
+			page_num = (startptr_buff - startptr) / os_page_size;
+
+			/* Add an entry for each OS page overlapping with this buffer. */
+			for (char *ptr = startptr_buff; ptr < endptr_buff; ptr += os_page_size)
+			{
+				fctx->record[idx].bufferid = bufferid;
+				fctx->record[idx].page_num = page_num;
+				fctx->record[idx].numa_node = os_page_status[page_num];
+
+				/* advance to the next entry/page */
+				++idx;
+				++page_num;
+			}
+		}
+
+		Assert((idx >= os_page_count) && (idx <= max_entries));
+
+		/* Set max calls and remember the user function context. */
+		funcctx->max_calls = idx;
+		funcctx->user_fctx = fctx;
+
+		/* Remember this backend touched the pages */
+		firstNumaTouch = false;
+	}
+
+	funcctx = SRF_PERCALL_SETUP();
+
+	/* Get the saved state */
+	fctx = funcctx->user_fctx;
+
+	if (funcctx->call_cntr < funcctx->max_calls)
+	{
+		uint32		i = funcctx->call_cntr;
+		Datum		values[NUM_BUFFERCACHE_NUMA_ELEM];
+		bool		nulls[NUM_BUFFERCACHE_NUMA_ELEM];
+
+		values[0] = Int32GetDatum(fctx->record[i].bufferid);
+		nulls[0] = false;
+
+		values[1] = Int32GetDatum(fctx->record[i].page_num);
+		nulls[1] = false;
+
+		values[2] = Int32GetDatum(fctx->record[i].numa_node);
+		nulls[2] = false;
+
+		/* Build and return the tuple. */
+		tuple = heap_form_tuple(fctx->tupdesc, values, nulls);
+		result = HeapTupleGetDatum(tuple);
+
+		SRF_RETURN_NEXT(funcctx, result);
+	}
+	else
+		SRF_RETURN_DONE(funcctx);
+}
+
 Datum
 pg_buffercache_summary(PG_FUNCTION_ARGS)
 {
diff --git a/contrib/pg_buffercache/sql/pg_buffercache_numa.sql b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
new file mode 100644
index 00000000000..837f3d64e21
--- /dev/null
+++ b/contrib/pg_buffercache/sql/pg_buffercache_numa.sql
@@ -0,0 +1,21 @@
+SELECT NOT(pg_numa_available()) AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- We expect at least one entry for each buffer
+select count(*) >= (select setting::bigint
+                    from pg_settings
+                    where name = 'shared_buffers')
+from pg_buffercache_numa;
+
+-- Check that the functions / views can't be accessed by default. To avoid
+-- having to create a dedicated user, use the pg_database_owner pseudo-role.
+SET ROLE pg_database_owner;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
+
+-- Check that pg_monitor is allowed to query view / function
+SET ROLE pg_monitor;
+SELECT count(*) > 0 FROM pg_buffercache_numa;
+RESET role;
diff --git a/doc/src/sgml/pgbuffercache.sgml b/doc/src/sgml/pgbuffercache.sgml
index 802a5112d77..b5050cd7343 100644
--- a/doc/src/sgml/pgbuffercache.sgml
+++ b/doc/src/sgml/pgbuffercache.sgml
@@ -30,7 +30,9 @@
  <para>
   This module provides the <function>pg_buffercache_pages()</function>
   function (wrapped in the <structname>pg_buffercache</structname> view),
-  the <function>pg_buffercache_summary()</function> function, the
+  <function>pg_buffercache_numa_pages()</function> function (wrapped in the
+  <structname>pg_buffercache_numa</structname> view), the
+  <function>pg_buffercache_summary()</function> function, the
   <function>pg_buffercache_usage_counts()</function> function and
   the <function>pg_buffercache_evict()</function> function.
  </para>
@@ -42,6 +44,15 @@
   convenient use.
  </para>
 
+ <para>
+  The <function>pg_buffercache_numa_pages()</function> provides
+  <acronym>NUMA</acronym> node mappings for shared buffer entries. This
+  information is not part of <function>pg_buffercache_pages()</function>
+  itself, as it is much slower to retrieve.
+  The <structname>pg_buffercache_numa</structname> view wraps the function for
+  convenient use.
+ </para>
+
  <para>
   The <function>pg_buffercache_summary()</function> function returns a single
   row summarizing the state of the shared buffer cache.
@@ -200,6 +211,78 @@
   </para>
  </sect2>
 
+ <sect2 id="pgbuffercache-pg-buffercache-numa">
+  <title>The <structname>pg_buffercache_numa</structname> View</title>
+
+  <para>
+   The definitions of the columns exposed by the view are shown in <xref linkend="pgbuffercache-numa-columns"/>.
+  </para>
+
+  <table id="pgbuffercache-numa-columns">
+   <title><structname>pg_buffercache_numa</structname> Columns</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>bufferid</structfield> <type>integer</type>
+      </para>
+      <para>
+       ID, in the range 1..<varname>shared_buffers</varname>
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>os_page_num</structfield> <type>int</type>
+      </para>
+      <para>
+       number of OS memory page for this buffer
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>numa_node</structfield> <type>int</type>
+      </para>
+      <para>
+       ID of <acronym>NUMA</acronym> node
+      </para></entry>
+     </row>
+
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   As <acronym>NUMA</acronym> node ID inquiry for each page requires memory pages
+   to be paged-in, the first execution of this function can take a noticeable
+   amount of time. In all the cases (first execution or not), retrieving this
+   information is costly and querying the view at a high frequency is not recommended.
+  </para>
+
+  <warning>
+    <para>
+      When determining the <acronym>NUMA</acronym> node, the view touches
+      all memory pages for the shared memory segment. This will force
+      allocation of the shared memory, if it wasn't allocated already,
+      and the memory may get allocated in a single <acronym>NUMA</acronym>
+      node (depending on system configuration).
+    </para>
+  </warning>
+
+ </sect2>
+
  <sect2 id="pgbuffercache-summary">
   <title>The <function>pg_buffercache_summary()</function> Function</title>
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 229fbff47ae..714cee6d6f1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -340,6 +340,8 @@ BufFile
 Buffer
 BufferAccessStrategy
 BufferAccessStrategyType
+BufferCacheNumaRec
+BufferCacheNumaContext
 BufferCachePagesContext
 BufferCachePagesRec
 BufferDesc
-- 
2.49.0

From 8b0438045bbff5b665e65c4d1cf73b5da7d0d955 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <to...@vondra.me>
Date: Mon, 7 Apr 2025 21:33:20 +0200
Subject: [PATCH v29 4/4] fixup

---
 contrib/pg_buffercache/pg_buffercache_pages.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
index b9fdf87bcc4..a702a47efe9 100644
--- a/contrib/pg_buffercache/pg_buffercache_pages.c
+++ b/contrib/pg_buffercache/pg_buffercache_pages.c
@@ -307,8 +307,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 	{
 		int			i,
 					idx;
-		Size		os_page_size = 0;
-		void	  **os_page_ptrs = NULL;
+		Size		os_page_size;
+		void	  **os_page_ptrs;
 		int		   *os_page_status;
 		uint64		os_page_count;
 		int			pages_per_buffer;
@@ -352,8 +352,8 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 		 */
 		startptr = (char *) TYPEALIGN_DOWN(os_page_size,
 										   BufferGetBlock(1));
-		endptr = (char *) TYPEALIGN_DOWN(os_page_size,
-										 (char *) BufferGetBlock(NBuffers) + BLCKSZ);
+		endptr = (char *) TYPEALIGN(os_page_size,
+									(char *) BufferGetBlock(NBuffers) + BLCKSZ);
 		os_page_count = (endptr - startptr) / os_page_size;
 
 		/* Used to determine the NUMA node for all OS pages at once */
@@ -470,8 +470,10 @@ pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
 			/* start of the first page of this buffer */
 			startptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr);
 
-			/* start of the page right after this buffer */
-			endptr_buff = (char *) TYPEALIGN_DOWN(os_page_size, buffptr + BLCKSZ);
+			/* end of the buffer (no need to align to memory page) */
+			endptr_buff = buffptr + BLCKSZ;
+
+			Assert(startptr_buff < endptr_buff);
 
 			/* calculate ID of the first page for this buffer */
 			page_num = (startptr_buff - startptr) / os_page_size;
-- 
2.49.0

Reply via email to