On 30.10.2013 19:11, Andres Freund wrote:
On 2013-10-30 22:39:20 +0530, Abhijit Menon-Sen wrote:
At 2013-10-30 11:04:36 -0400, t...@sss.pgh.pa.us wrote:

As a compromise, perhaps we can unconditionally round the size up to be
a multiple of 2MB? […]

That sounds reasonably painless to me.

Here's a patch that does that and adds a DEBUG1 log message when we try
with MAP_HUGETLB and fail and fallback to ordinary mmap.

But it's in no way guaranteed that the smallest hugepage size is
2MB. It'll be on current x86 hardware, but not on any other platform...

Sure, but there's no big harm done. We're just trying to avoid hitting a kernel bug, and as a bonus, we avoid wasting some memory that would otherwise be lost due to the kernel rounding the allocation. If the smallest hugepage size is smaller than 2MB, we round up the allocation unnecessarily, but that doesn't seem serious.


I spent some time whacking this around, new patch version attached. I moved the mmap() code into a new function, that leaves the PGSharedMemoryCreate more readable.

I modified the patch so that it throws an error if you set huge_tlb_pages=on, and the platform doesn't support MAP_HUGETLB (ie. non-Linux, or EXEC_BACKEND). 'try' is the default, so this only affects you if you explicitly set it to 'on'. I think that's the right behavior; if you explicitly ask for it, and you don't get it, that should be an error. But I'm not wedded to the idea if someone objects; a log message might also be reasonable: "LOG: huge TLB pages are not supported on this platform, but huge_tlb_pages was 'on'"

The error message on failed allocation, if huge_tlb_pages=on, needs updating:

$ bin/postmaster -D data
FATAL:  could not map anonymous shared memory: Cannot allocate memory
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded available memory or swap space. To reduce the request size (currently 189390848 bytes), reduce PostgreSQL's shared memory usage, perhaps by reducing shared_buffers or max_connections.

The reason the allocation failed in this case was that I used huge_tlb_pages=on, but had not configured the kernel for huge pages. The hint is quite misleading in that case, it should advise to configure the kernel, or turn off huge_tlb_pages.

The documentation needs some work. I think it's pretty user-unfriendly to link to https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt. It gives a lot of details, and although it explains stuff that is relevant, like setting the nr_hugepages sysctl, it also contains a lot of stuff that is not relevant to us, like how to mount hugetlbfs. Can we do better than that? Is there a better guide somewhere on how to set the kernel settings. If not, we should include step-by-step instructions in our manual.

The "Managing Kernel Resources" section in the user manual should also be updated to mention how to enable huge pages.

Also, now that I changed huge_tlb_pages='on' to fail on platforms where it's not supported at all, the docs need to be updated to reflect it.

- Heikki
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 77a9303..7a60ad0 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1052,6 +1052,41 @@ include 'filename'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-huge-tlb-pages" xreflabel="huge_tlb_pages">
+      <term><varname>huge_tlb_pages</varname> (<type>enum</type>)</term>
+      <indexterm>
+       <primary><varname>huge_tlb_pages</> configuration parameter</primary>
+      </indexterm>
+      <listitem>
+       <para>
+        Enables/disables the use of huge TLB pages. Valid values are
+        <literal>try</literal> (the default), <literal>on</literal>,
+        and <literal>off</literal>.
+       </para>
+
+       <para>
+        At present, this feature is supported only on Linux. The setting
+        is ignored on other systems.
+       </para>
+
+       <para>
+        The use of huge TLB pages results in smaller page tables and
+        less CPU time spent on memory management, increasing performance. For
+        more details, see
+        <ulink url="https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt";>hugepages.txt</ulink>
+        in the Linux kernel documentation.
+       </para>
+
+       <para>
+        With <varname>huge_tlb_pages</varname> set to <literal>try</literal>,
+        the server will try to use huge pages, but fall back to using
+        normal allocation if that fails. With <literal>on</literal, failure
+        to use huge pages will prevent the server from starting up. With
+        <literal>off</literal>, huge pages will not be used.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
       <term><varname>temp_buffers</varname> (<type>integer</type>)</term>
       <indexterm>
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index b604407..3ccd2c2 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -32,6 +32,7 @@
 #include "portability/mem.h"
 #include "storage/ipc.h"
 #include "storage/pg_shmem.h"
+#include "utils/guc.h"
 
 
 typedef key_t IpcMemoryKey;		/* shared memory key passed to shmget(2) */
@@ -41,7 +42,7 @@ typedef int IpcMemoryId;		/* shared memory ID returned by shmget(2) */
 unsigned long UsedShmemSegID = 0;
 void	   *UsedShmemSegAddr = NULL;
 static Size AnonymousShmemSize;
-static void *AnonymousShmem;
+static void *AnonymousShmem = NULL;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -317,6 +318,79 @@ PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2)
 	return true;
 }
 
+/*
+ * Creates an anonymous mmap()ed shared memory segment.
+ *
+ * Pass the desired size in *size. This function will modify *size to the
+ * actual size of the allocation, if it ends up allocating a larger than
+ * desired segment.
+ */
+#ifndef EXEC_BACKEND
+static void *
+CreateAnonymousSegment(Size *size)
+{
+	Size		allocsize;
+	void	   *ptr = MAP_FAILED;
+
+#ifndef MAP_HUGETLB
+	if (huge_tlb_pages == HUGE_TLB_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge TLB pages not supported on this platform")));
+#else
+	if (huge_tlb_pages == HUGE_TLB_ON || huge_tlb_pages == HUGE_TLB_TRY)
+	{
+		/*
+		 * Round up the request size to a suitable large value.
+		 *
+		 * Some Linux kernel versions are known to have a bug, which
+		 * causes mmap() with MAP_HUGETLB to fail if the request size is
+		 * not a multiple of any supported huge page size. To work around
+		 * that, we round up the request size to nearest 2MB. 2MB is the
+		 * most common huge page page size on affected systems.
+		 *
+		 * Aside from that bug, even with a kernel that does the
+		 * allocation correctly, rounding it up ourselvees avoids wasting
+		 * memory. Without it, if we for example make an allocation of
+		 * 2MB + 1 bytes, the kernel might decide to use two 2MB huge
+		 * pages for that, and waste 2 MB - 1 of memory. When we do the
+		 * rounding ourselves, we can use that space for allocations.
+		 */
+		int			hugepagesize = 2 * 1024 * 1024;
+
+		allocsize = *size;
+		if (allocsize % hugepagesize != 0)
+			allocsize += hugepagesize - (allocsize % hugepagesize);
+
+		ptr = mmap(NULL, *size, PROT_READ | PROT_WRITE,
+				   PG_MMAP_FLAGS | MAP_HUGETLB, -1, 0);
+		if (huge_tlb_pages == HUGE_TLB_TRY && ptr == MAP_FAILED)
+			elog(DEBUG1, "mmap with MAP_HUGETLB failed, huge pages disabled: %m");
+	}
+#endif
+
+	if (huge_tlb_pages == HUGE_TLB_OFF || huge_tlb_pages == HUGE_TLB_TRY)
+	{
+		allocsize = *size;
+		ptr = mmap(NULL, *size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, -1, 0);
+	}
+
+	if (ptr == MAP_FAILED)
+		ereport(FATAL,
+				(errmsg("could not map anonymous shared memory: %m"),
+				 (errno == ENOMEM) ?
+				 errhint("This error usually means that PostgreSQL's request "
+						"for a shared memory segment exceeded available memory "
+						 "or swap space. To reduce the request size (currently "
+						 "%lu bytes), reduce PostgreSQL's shared memory usage, "
+						 "perhaps by reducing shared_buffers or "
+						 "max_connections.",
+						 (unsigned long) *size) : 0));
+
+	*size = allocsize;
+	return ptr;
+}
+#endif
 
 /*
  * PGSharedMemoryCreate
@@ -344,7 +418,14 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	PGShmemHeader *hdr;
 	IpcMemoryId shmid;
 	struct stat statbuf;
-	Size		sysvsize = size;
+	Size		sysvsize;
+
+#if defined(EXEC_BACKEND) || !defined(MAP_HUGETLB)
+	if (huge_tlb_pages == HUGE_TLB_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge TLB pages not supported on this platform")));
+#endif
 
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
@@ -359,6 +440,12 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	 * to run many copies of PostgreSQL without needing to adjust system
 	 * settings.
 	 *
+	 * We assume that no one will attempt to run PostgreSQL 9.3 or later
+	 * on systems that are ancient enough that anonymous shared memory is
+	 * not supported, such as pre-2.4 versions of Linux.  If that turns
+	 * out to be false, we might need to add a run-time test here and do
+	 * this only if the running kernel supports it.
+	 *
 	 * However, we disable this logic in the EXEC_BACKEND case, and fall back
 	 * to the old method of allocating the entire segment using System V
 	 * shared memory, because there's no way to attach an mmap'd segment to a
@@ -366,44 +453,13 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	 * developer use, this shouldn't be a big problem.
 	 */
 #ifndef EXEC_BACKEND
-	{
-		long		pagesize = sysconf(_SC_PAGE_SIZE);
-
-		/*
-		 * Ensure request size is a multiple of pagesize.
-		 *
-		 * pagesize will, for practical purposes, always be a power of two.
-		 * But just in case it isn't, we do it this way instead of using
-		 * TYPEALIGN().
-		 */
-		if (pagesize > 0 && size % pagesize != 0)
-			size += pagesize - (size % pagesize);
+	AnonymousShmem = CreateAnonymousSegment(&size);
+	AnonymousShmemSize = size;
 
-		/*
-		 * We assume that no one will attempt to run PostgreSQL 9.3 or later
-		 * on systems that are ancient enough that anonymous shared memory is
-		 * not supported, such as pre-2.4 versions of Linux.  If that turns
-		 * out to be false, we might need to add a run-time test here and do
-		 * this only if the running kernel supports it.
-		 */
-		AnonymousShmem = mmap(NULL, size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS,
-							  -1, 0);
-		if (AnonymousShmem == MAP_FAILED)
-			ereport(FATAL,
-					(errmsg("could not map anonymous shared memory: %m"),
-					 (errno == ENOMEM) ?
-				errhint("This error usually means that PostgreSQL's request "
-					 "for a shared memory segment exceeded available memory "
-					  "or swap space. To reduce the request size (currently "
-					  "%lu bytes), reduce PostgreSQL's shared memory usage, "
-						"perhaps by reducing shared_buffers or "
-						"max_connections.",
-						(unsigned long) size) : 0));
-		AnonymousShmemSize = size;
-
-		/* Now we need only allocate a minimal-sized SysV shmem block. */
-		sysvsize = sizeof(PGShmemHeader);
-	}
+	/* Now we need only allocate a minimal-sized SysV shmem block. */
+	sysvsize = sizeof(PGShmemHeader);
+#else
+	sysvsize = size;
 #endif
 
 	/* Make sure PGSharedMemoryAttach doesn't fail without need */
diff --git a/src/backend/port/win32_shmem.c b/src/backend/port/win32_shmem.c
index 0db8e8f..38efe37 100644
--- a/src/backend/port/win32_shmem.c
+++ b/src/backend/port/win32_shmem.c
@@ -128,6 +128,11 @@ PGSharedMemoryCreate(Size size, bool makePrivate, int port)
 	DWORD		size_high;
 	DWORD		size_low;
 
+	if (huge_tlb_pages == HUGE_TLB_ON)
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("huge TLB pages not supported on this platform")));
+
 	/* Room for a header? */
 	Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 54d8078..71b30a2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -64,6 +64,7 @@
 #include "storage/dsm_impl.h"
 #include "storage/standby.h"
 #include "storage/fd.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"
@@ -381,6 +382,23 @@ static const struct config_enum_entry synchronous_commit_options[] = {
 };
 
 /*
+ * Although only "on", "off", "try" are documented, we accept all the likely
+ * variants of "on" and "off".
+ */
+static const struct config_enum_entry huge_tlb_options[] = {
+	{"off", HUGE_TLB_OFF, false},
+	{"on", HUGE_TLB_ON, false},
+	{"try", HUGE_TLB_TRY, false},
+	{"true", HUGE_TLB_ON, true},
+	{"false", HUGE_TLB_OFF, true},
+	{"yes", HUGE_TLB_ON, true},
+	{"no", HUGE_TLB_OFF, true},
+	{"1", HUGE_TLB_ON, true},
+	{"0", HUGE_TLB_OFF, true},
+	{NULL, 0, false}
+};
+
+/*
  * Options for enum values stored in other modules
  */
 extern const struct config_enum_entry wal_level_options[];
@@ -441,6 +459,12 @@ int			tcp_keepalives_interval;
 int			tcp_keepalives_count;
 
 /*
+ * This really belongs in pg_shmem.c, but is defined here so that it doesn't
+ * need to be duplicated in all the different implementations of pg_shmem.c.
+ */
+int			huge_tlb_pages;
+
+/*
  * These variables are all dummies that don't do anything, except in some
  * cases provide the value for SHOW to display.  The real state is elsewhere
  * and is kept in sync by assign_hooks.
@@ -3377,6 +3401,15 @@ static struct config_enum ConfigureNamesEnum[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"huge_tlb_pages", PGC_POSTMASTER, RESOURCES_MEM,
+			gettext_noop("Use of huge TLB pages on Linux"),
+			NULL
+		},
+		&huge_tlb_pages,
+		HUGE_TLB_TRY, huge_tlb_options,
+		NULL, NULL, NULL
+	},
 
 	/* End-of-list marker */
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 34a2d05..ed9573a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -113,6 +113,8 @@
 
 #shared_buffers = 32MB			# min 128kB
 					# (change requires restart)
+#huge_tlb_pages = try			# on, off, or try
+					# (change requires restart)
 #temp_buffers = 8MB			# min 800kB
 #max_prepared_transactions = 0		# zero disables the feature
 					# (change requires restart)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 251fbdf..3a6cbf5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -38,6 +38,16 @@ typedef struct PGShmemHeader	/* standard header for all Postgres shmem */
 #endif
 } PGShmemHeader;
 
+/* GUC variable */
+extern int huge_tlb_pages;
+
+/* Possible values for huge_tlb_pages */
+typedef enum
+{
+	HUGE_TLB_OFF,
+	HUGE_TLB_ON,
+	HUGE_TLB_TRY
+} HugeTlbType;
 
 #ifdef EXEC_BACKEND
 #ifndef WIN32
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to