I ran into a situation where a machine with 4 NUMA memory nodes and
40 cores had performance problems due to NUMA. The problems were
worst right after they rebooted the OS and warmed the cache by
running a script of queries to read all tables. These were all run
on a single connection. As it turned out, the size of the database
was just over one-quarter of the size of RAM, and with default NUMA
policies both the OS cache for the database and the PostgreSQL
shared memory allocation were placed on a single NUMA segment, so
access to the CPU package managing that segment became a
bottleneck. On top of that, processes which happened to run on the
CPU package which had all the cached data had to allocate memory
for local use on more distant memory because there was none left in
the more local memory.
Through normal operations, things eventually tended to shift around
and get better (after several hours of heavy use with substandard
performance). I ran some benchmarks and found that even in
long-running tests, spreading these allocations among the memory
segments showed about a 2% benefit in a read-only load. The
biggest difference I saw in a long-running read-write load was
about a 20% hit for unbalanced allocations, but I only saw that
once. I talked to someone at PGCon who managed to engineer much
worse performance hits for an unbalanced load, although the
circumstances were fairly artificial. Still, fixing this seems
like something worth doing if further benchmarks confirm benefits
at this level.
By default, the OS cache and buffers are allocated in the memory
node with the shortest "distance" from the CPU a process is running
on. This is determined by a the "cpuset" associated with the
process which reads or writes the disk page. Typically a NUMA
machine starts with a single cpuset with a policy specifying this
behavior. Fixing this aspect of things seems like an issue for
packagers, although we should probably document it for those
running from their own source builds.
To set an alternate policy for PostgreSQL, you first need to find
or create the location for cpuset specification, which uses a
filesystem in a way similar to the /proc directory. On a machine
with more than one memory node, the appropriate filesystem is
probably already mounted, although different distributions use
different filesystem names and mount locations. I will illustrate
the process on my Ubuntu machine. Even though it has only one
memory node (and so, this makes no difference), I have it handy at
the moment to confirm the commands as I put them into the email.
# Sysadmin must create the root cpuset if not already done. (On a
# system with NUMA memory, this will probably already be mounted.)
# Location and options can vary by distro.
sudo sudo mkdir /dev/cpuset
sudo mount -t cpuset none /dev/cpuset
# Sysadmin must create a cpuset for postgres and configure
# resources. This will normally be all cores and all RAM. This is
# where we specify that this cpuset will spread pages among its
# memory nodes.
sudo mkdir /dev/cpuset/postgres
sudo /bin/bash -c "echo 0-3 >/dev/cpuset/postgres/cpus"
sudo /bin/bash -c "echo 0 >/dev/cpuset/postgres/mems"
sudo /bin/bash -c "echo 1 >/dev/cpuset/postgres/memory_spread_page"
# Sysadmin must grant permissions to the desired setting(s).
# This could be by user or group.
sudo chown postgres /dev/cpuset/postgres/tasks
# The pid of postmaster or an ancestor process must be written to
# the tasks "file" of the cpuset. This can be a shell from which
# pg_ctl is run, at least for bash shells. It could also be
# written by the postmaster itself, essentially as an extra pid
# file. Possible snippet from a service script:
echo $$ >/dev/cpuset/postgres/tasks
pg_ctl start ...
Where the OS cache is larger than shared_buffers, the above is
probably more important than the attached patch, which causes the
main shared memory segment to be spread among all available memory
nodes. This patch only compiles in the relevant code if configure
is run using the --with-libnuma option, in which case a dependency
on the numa library is created. It is v3 to avoid confusion with
earlier versions I have shared with a few people off-list. (The
only difference from v2 is fixing bitrot.)
I'll add it to the next CF.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/configure b/configure
index ed1ff0a..79a0dea 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
GREP
with_zlib
with_system_tzdata
+with_libnuma
with_libxslt
with_libxml
XML2_CONFIG
@@ -831,6 +832,7 @@ with_uuid
with_ossp_uuid
with_libxml
with_libxslt
+with_libnuma
with_system_tzdata
with_zlib
with_gnu_ld
@@ -1518,6 +1520,7 @@ Optional Packages:
--with-ossp-uuid obsolete spelling of --with-uuid=ossp
--with-libxml build with XML support
--with-libxslt use XSLT support when building contrib/xml2
+ --with-libnuma use libnuma for NUMA support
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
@@ -5822,6 +5825,39 @@ fi
+
+#
+# NUMA library
+#
+
+
+
+# Check whether --with-libnuma was given.
+if test "${with_libnuma+set}" = set; then :
+ withval=$with_libnuma;
+ case $withval in
+ yes)
+ $as_echo "#define USE_LIBNUMA 1 Define to 1 to use NUMA features, like interleaved shared memory. (--with-libnuma)" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-libnuma option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_libnuma=no
+
+fi
+
+
+
+
+
+
#
# tzdata
#
@@ -8781,6 +8817,56 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for numa_set_localalloc in -lnuma" >&5
+$as_echo_n "checking for numa_set_localalloc in -lnuma... " >&6; }
+if ${ac_cv_lib_numa_numa_set_localalloc+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lnuma $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char numa_set_localalloc ();
+int
+main ()
+{
+return numa_set_localalloc ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_numa_numa_set_localalloc=yes
+else
+ ac_cv_lib_numa_numa_set_localalloc=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_numa_numa_set_localalloc" >&5
+$as_echo "$ac_cv_lib_numa_numa_set_localalloc" >&6; }
+if test "x$ac_cv_lib_numa_numa_set_localalloc" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBNUMA 1
+_ACEOF
+
+ LIBS="-lnuma $LIBS"
+
+else
+ as_fn_error $? "library 'numa' is required for NUMA support" "$LINENO" 5
+fi
+
+fi
+
# for contrib/sepgsql
if test "$with_selinux" = yes; then
{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for security_compute_create_name in -lselinux" >&5
@@ -9466,6 +9552,17 @@ fi
fi
+if test "$with_libnuma" = yes ; then
+ ac_fn_c_check_header_mongrel "$LINENO" "numa.h" "ac_cv_header_numa_h" "$ac_includes_default"
+if test "x$ac_cv_header_numa_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <numa.h> is required for NUMA support" "$LINENO" 5
+fi
+
+
+fi
+
if test "$with_ldap" = yes ; then
if test "$PORTNAME" != "win32"; then
for ac_header in ldap.h
diff --git a/configure.in b/configure.in
index 80df1d7..fb06737 100644
--- a/configure.in
+++ b/configure.in
@@ -761,6 +761,16 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
AC_SUBST(with_libxslt)
+
+#
+# NUMA library
+#
+PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA support],
+ [AC_DEFINE([USE_LIBNUMA], 1 [Define to 1 to use NUMA features, like interleaved shared memory. (--with-libnuma)])])
+
+AC_SUBST(with_libnuma)
+
+
#
# tzdata
#
@@ -969,6 +979,10 @@ if test "$with_libxslt" = yes ; then
AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
fi
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_LIB(numa, numa_set_localalloc, [], [AC_MSG_ERROR([library 'numa' is required for NUMA support])])
+fi
+
# for contrib/sepgsql
if test "$with_selinux" = yes; then
AC_CHECK_LIB(selinux, security_compute_create_name, [],
@@ -1097,6 +1111,10 @@ if test "$with_libxslt" = yes ; then
AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
fi
+if test "$with_libnuma" = yes ; then
+ AC_CHECK_HEADER(numa.h, [], [AC_MSG_ERROR([header file <numa.h> is required for NUMA support])])
+fi
+
if test "$with_ldap" = yes ; then
if test "$PORTNAME" != "win32"; then
AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 7430757..6d6cd10 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -27,6 +27,9 @@
#ifdef HAVE_SYS_SHM_H
#include <sys/shm.h>
#endif
+#ifdef USE_LIBNUMA
+#include <numa.h>
+#endif
#include "miscadmin.h"
#include "portability/mem.h"
@@ -536,6 +539,24 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port,
*/
}
+#ifdef USE_LIBNUMA
+ /*
+ * If this is not a private segment and we are using libnuma, make the
+ * large memory segment interleaved.
+ */
+ if (!makePrivate && numa_available())
+ {
+ void *start;
+
+ if (AnonymousShmem == NULL)
+ start = memAddress;
+ else
+ start = AnonymousShmem;
+
+ numa_interleave_memory(start, size, numa_all_nodes_ptr);
+ }
+#endif
+
/*
* OK, we created a new segment. Mark it as created by this process. The
* order of assignments here is critical so that another Postgres process
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers