It's been known for a while that Postgres spends a lot of time translating
instruction addresses, and using huge pages in the text segment yields a
substantial performance boost in OLTP workloads [1][2]. The difficulty is,
this normally requires a lot of painstaking work (unless your OS does
superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text
segment
- mmap aligned start address to a second region with huge pages and
MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the
remapping is done in a function that's forced to live in a different
segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  .text end:   0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG:  aligned .text end:   0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG:  un-mmapping temporary code
region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that
still saves 512 entries in the TLB and might bring a small improvement. The
un-remapped region below 0x600000 contains the ~600kB of "cold" code, since
the linker puts the cold section first, at least recent versions of ld and
lld.

0002 is my attempt to force the linker's hand and get the entire text
segment mapped to huge pages. It's quite a finicky hack, and easily broken
(see below). That said, it still builds easily within our normal build
process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152
-Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
done for predictability, but that means the next 2MB boundary is very
nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE
 --------------  --------------
  53.7%  4.90Mi  58.7%  4.90Mi    .text
...
 100.0%  9.12Mi 100.0%  8.35Mi    TOTAL

$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES
Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000486000 086000 00001b 00
 AX  0   0  4
  [13] .plt              PROGBITS        0000000000486020 086020 001520 10
 AX  0   0 16
  [14] .text             PROGBITS        0000000000487540 087540 4e59d2 00
 AX  0   0 16
...

0002:

$ bloaty inst-perf/bin/postgres

    FILE SIZE        VM SIZE
 --------------  --------------
  46.9%  8.00Mi  69.9%  8.00Mi    .text
...
 100.0%  17.1Mi 100.0%  11.4Mi    TOTAL


$ readelf -S --wide inst-perf/bin/postgres

  [Nr] Name              Type            Address          Off    Size   ES
Flg Lk Inf Al
...
  [12] .init             PROGBITS        0000000000600000 200000 00001b 00
 AX  0   0  4
  [13] .plt              PROGBITS        0000000000600020 200020 001520 10
 AX  0   0 16
  [14] .text             PROGBITS        0000000000601540 201540 7ff512 00
 AX  0   0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  .text end:   0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG:  aligned .text end:   0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG:  un-mmapping temporary code
region

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
causes complete startup failure if the .text segment is larger than 8MB. I
haven't looked into what's happening there yet, but I didn't want to get
too far in the weeds before getting feedback on whether the entire approach
in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
    (paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

-- 
John Naylor
EDB: http://www.enterprisedb.com
From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 31 Oct 2022 13:59:30 +0700
Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages

Tell linker to align addresses on 2MB boundaries. The .init
section will be so aligned, with the .text section soon after that.
Therefore, the start address of .text should always be align up to
nearly 2MB ahead of the actual start. The first nearly 2MB of .text
will not map to huge pages.

We count on cold sections linking to the front of the .text segment:
Since the cold sections total about 600kB in size, we need ~1.4MB of
additional padding to keep non-cold pages mappable to huge pages. Since
PG has about 5.0MB of .text, we also need an additional 1MB to push
the .text end just past an aligned boundary, so when we align the end
down, only a small number of pages will remain un-remapped at their
original 4kB size.
---
 meson.build                  |  3 +++
 src/backend/port/filler.c    | 29 +++++++++++++++++++++++++++++
 src/backend/port/meson.build |  3 +++
 3 files changed, 35 insertions(+)
 create mode 100644 src/backend/port/filler.c

diff --git a/meson.build b/meson.build
index bfacbdc0af..450946370c 100644
--- a/meson.build
+++ b/meson.build
@@ -239,6 +239,9 @@ elif host_system == 'freebsd'
 elif host_system == 'linux'
   sema_kind = 'unnamed_posix'
   cppflags += '-D_GNU_SOURCE'
+  # WIP: debug builds are huge
+  # TODO: add portability check
+  ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152']
 
 elif host_system == 'netbsd'
   # We must resolve all dynamic linking in the core server at program start.
diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c
new file mode 100644
index 0000000000..de4e33bb05
--- /dev/null
+++ b/src/backend/port/filler.c
@@ -0,0 +1,29 @@
+/*
+ * Add enough padding to .text segment to bring the end just
+ * past a 2MB alignment boundary. In practice, this means .text needs
+ * to be at least 8MB. It shouldn't be much larger than this,
+ * because then more hot pages will remain in 4kB pages.
+ *
+ * FIXME: With this filler added, if explicit huge pages are turned off
+ * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash
+ * instead of reporting failure if the .text segment is larger than 8MB.
+ *
+ * See MapStaticCodeToLargePages() in large_page.c
+ *
+ * XXX: The exact amount of filler must be determined experimentally
+ * on platforms of interest, in non-assert builds.
+ *
+ */
+static void
+__attribute__((used))
+__attribute__((cold))
+fill_function(int x)
+{
+	/* TODO: More architectures */
+#ifdef __x86_64__
+__asm__(
+	".fill 3251000"
+);
+#endif
+	(void) x;
+}
\ No newline at end of file
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index 5ab65115e9..d876712e0c 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  if host_system == 'linux'
+    backend_sources += files('filler.c')
+  endif
   backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
-- 
2.37.3

From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Mon, 31 Oct 2022 15:24:29 +0700
Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at
 postmaster start

Based on MIT licensed libary at https://github.com/intel/iodlr

The basic steps are:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap temporary region and memcpy aligned portion of .text segment
- mmap start address to new region with huge pages and MAP_FIXED
- memcpy over and revoke the PROT_WRITE bit

The Postgres .text segment is ~5.0MB in a non-assert build, so this
method can put 2-4MB into huge pages.
---
 src/backend/port/large_page.c       | 348 ++++++++++++++++++++++++++++
 src/backend/port/meson.build        |   1 +
 src/backend/postmaster/postmaster.c |   7 +
 src/include/port/large_page.h       |  18 ++
 4 files changed, 374 insertions(+)
 create mode 100644 src/backend/port/large_page.c
 create mode 100644 src/include/port/large_page.h

diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c
new file mode 100644
index 0000000000..66a584f785
--- /dev/null
+++ b/src/backend/port/large_page.c
@@ -0,0 +1,348 @@
+/*-------------------------------------------------------------------------
+ *
+ * large_page.c
+ *	  Map .text segment of binary to huge pages
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/backend/port/large_page.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/*
+ * Based on Intel ioldr library:
+ * https://github.com/intel/iodlr.git
+ * MIT license and copyright notice follows
+ */
+
+/*
+ * Copyright (C) 2018 Intel Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom
+ * the Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included
+ * in all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
+ * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES
+ * OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+ * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
+ * OR OTHER DEALINGS IN THE SOFTWARE.
+ *
+ * SPDX-License-Identifier: MIT
+ */
+
+#include "postgres.h"
+
+#include <link.h>
+#include <sys/mman.h>
+
+#include "port/large_page.h"
+#include "storage/pg_shmem.h"
+
+typedef struct
+{
+	char	   *from;
+	char	   *to;
+}			mem_range;
+
+typedef struct
+{
+	uintptr_t	start;
+	uintptr_t	end;
+	bool		found;
+}			FindParams;
+
+static inline uintptr_t
+largepage_align_down(uintptr_t addr, size_t hugepagesize)
+{
+	return (addr & ~(hugepagesize - 1));
+}
+
+static inline uintptr_t
+largepage_align_up(uintptr_t addr, size_t hugepagesize)
+{
+	return largepage_align_down(addr + hugepagesize - 1, hugepagesize);
+}
+
+static bool
+FindTextSection(const char *fname, ElfW(Shdr) * text_section)
+{
+	ElfW(Ehdr) ehdr;
+	FILE	   *bin;
+
+	ElfW(Shdr) * shdrs = NULL;
+	ElfW(Shdr) * sh_strab;
+	char	   *section_names = NULL;
+
+	bin = fopen(fname, "r");
+	if (bin == NULL)
+		return false;
+
+	/* Read the header. */
+	if (fread(&ehdr, sizeof(ehdr), 1, bin) != 1)
+		return false;;
+
+	/* Read the section headers. */
+	shdrs = (ElfW(Shdr) *) palloc(ehdr.e_shnum * sizeof(ElfW(Shdr)));
+	if (fseek(bin, ehdr.e_shoff, SEEK_SET) != 0)
+		return false;;
+	if (fread(shdrs, sizeof(shdrs[0]), ehdr.e_shnum, bin) != ehdr.e_shnum)
+		return false;;
+
+	/* Read the string table. */
+	sh_strab = &shdrs[ehdr.e_shstrndx];
+	section_names = palloc(sh_strab->sh_size * sizeof(char));
+
+	if (fseek(bin, sh_strab->sh_offset, SEEK_SET) != 0)
+		return false;;
+	if (fread(section_names, sh_strab->sh_size, 1, bin) != 1)
+		return false;;
+
+	/* Find the ".text" section. */
+	for (uint32_t idx = 0; idx < ehdr.e_shnum; idx++)
+	{
+		ElfW(Shdr) * sh = &shdrs[idx];
+		if (!memcmp(&section_names[sh->sh_name], ".text", 5))
+		{
+			*text_section = *sh;
+			fclose(bin);
+			return true;
+		}
+	}
+	return false;
+}
+
+/* Callback for dl_iterate_phdr to set the start and end of the .text segment */
+static int
+FindMapping(struct dl_phdr_info *hdr, size_t size, void *data)
+{
+	ElfW(Shdr) text_section;
+	FindParams *find_params = (FindParams *) data;
+
+	/*
+	 * We are only interested in the mapping matching the main executable.
+	 * This has the empty string for a name.
+	 */
+	if (hdr->dlpi_name[0] != '\0')
+		return 0;
+
+	/*
+	 * Open the info structure for the executable on disk to find the location
+	 * of its .text section. We use the base address given to calculate the
+	 * .text section offset in memory.
+	 */
+	text_section.sh_size = 0;
+#ifdef __linux__
+	if (FindTextSection("/proc/self/exe", &text_section))
+	{
+		find_params->start = hdr->dlpi_addr + text_section.sh_addr;
+		find_params->end = find_params->start + text_section.sh_size;
+		find_params->found = true;
+		return 1;
+	}
+#endif
+	return 0;
+}
+
+/*
+ * Identify and return the text segment in the currently mapped memory region.
+ */
+static bool
+FindTextRegion(mem_range * region)
+{
+	FindParams	find_params = {0, 0, false};
+
+	/*
+	 * Note: the upstream source worked with shared libraries as well, hence
+	 * the iteration over all ojects.
+	 */
+	dl_iterate_phdr(FindMapping, &find_params);
+	if (find_params.found)
+	{
+		region->from = (char *) find_params.start;
+		region->to = (char *) find_params.end;
+	}
+
+	return find_params.found;
+}
+
+/*
+ * Move specified region to large pages.
+ *
+ * NB: We need to be very careful:
+ * 1. This function itself should not be moved. We use compiler attributes:
+ *      WIP: if these aren't available, the function should do nothing
+ * (__section__) to put it outside the ".text" section
+ * (__noline__) to not inline this function
+ *
+ * 2. This function should not call any function(s) that might be moved.
+ */
+static void
+__attribute__((__section__("lpstub")))
+__attribute__((__noinline__))
+MoveRegionToLargePages(const mem_range * r, int mmap_flags)
+{
+	void	   *nmem = MAP_FAILED;
+	void	   *tmem = MAP_FAILED;
+	int			ret = 0;
+	int			mmap_errno = 0;
+	void	   *start = r->from;
+	size_t		size = r->to - r->from;
+	bool		success = false;
+
+	/* Allocate temporary region */
+	nmem = mmap(NULL, size,
+				PROT_READ | PROT_WRITE,
+				MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (nmem == MAP_FAILED)
+	{
+		elog(DEBUG1, "failed to allocate temporary region");
+		return;
+	}
+
+	/* copy the original code */
+	memcpy(nmem, r->from, size);
+
+	/*
+	 * mmap using the start address with MAP_FIXED so we get exactly the same
+	 * virtual address. We already know the original page is r-xp (PROT_READ,
+	 * PROT_EXEC, MAP_PRIVATE) We want PROT_WRITE because we are writing into
+	 * it.
+	 */
+	Assert(mmap_flags & MAP_HUGETLB);
+	tmem = mmap(start, size,
+				PROT_READ | PROT_WRITE | PROT_EXEC,
+				MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | mmap_flags,
+				-1, 0);
+	mmap_errno = errno;
+
+	if (tmem == MAP_FAILED && huge_pages == HUGE_PAGES_ON)
+	{
+		/*
+		 * WIP: need a way for the user to determine total huge pages needed,
+		 * perhaps with shared_memory_size_in_huge_pages
+		 */
+		errno = mmap_errno;
+		ereport(FATAL,
+				errmsg("mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size),
+				(mmap_errno == ENOMEM) ?
+				errhint("This usually means not enough explicit huge pages were "
+						"configured in the kernel") : 0);
+		goto cleanup_tmp;
+	}
+	else if (tmem == MAP_FAILED)
+	{
+		Assert(huge_pages == HUGE_PAGES_TRY);
+
+		errno = mmap_errno;
+		elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size);
+
+		/*
+		 * try remapping again with normal pages
+		 *
+		 * XXX we cannot just back out now
+		 */
+		tmem = mmap(start, size,
+					PROT_READ | PROT_WRITE | PROT_EXEC,
+					MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
+					-1, 0);
+		mmap_errno = errno;
+
+		if (tmem == MAP_FAILED)
+		{
+			/*
+			 * If we get here we cannot start the server. It's unlikely we
+			 * will fail here after the postmaster successfully set up shared
+			 * memory, but maybe we should have a GUC to turn off code
+			 * remapping, hinted here.
+			 */
+			errno = mmap_errno;
+			ereport(FATAL,
+					errmsg("mmap(%zu) failed for fallback code region: %m", size));
+			goto cleanup_tmp;
+		}
+	}
+	else
+		success = true;
+
+	/* copy the code to the newly mapped area and unset the write bit */
+	memcpy(start, nmem, size);
+	ret = mprotect(start, size, PROT_READ | PROT_EXEC);
+	if (ret < 0)
+	{
+		/* WIP: see note above about GUC and hint */
+		ereport(FATAL,
+				errmsg("failed to protect remapped code pages"));
+
+		/* Cannot start but at least try to clean up after ourselves */
+		munmap(tmem, size);
+		goto cleanup_tmp;
+	}
+
+	if (success)
+		elog(DEBUG1, "binary mapped to huge pages");
+
+cleanup_tmp:
+	/* Release the old/temporary mapped region */
+	elog(DEBUG3, "un-mmapping temporary code region");
+	ret = munmap(nmem, size);
+	if (ret < 0)
+		/* WIP: not sure of severity here */
+		ereport(LOG,
+				errmsg("failed to unmap temporary region"));
+
+	return;
+}
+
+/*  Align the region to to be mapped to huge page boundaries. */
+static void
+AlignRegionToPageBoundary(mem_range * r, size_t hugepagesize)
+{
+	r->from = (char *) largepage_align_up((uintptr_t) r->from, hugepagesize);
+	r->to = (char *) largepage_align_down((uintptr_t) r->to, hugepagesize);
+}
+
+
+/*  Map the postgres .text segment into huge pages. */
+void
+MapStaticCodeToLargePages(void)
+{
+	size_t		hugepagesize;
+	int			mmap_flags;
+	mem_range	r = {0};
+
+	if (huge_pages == HUGE_PAGES_OFF)
+		return;
+
+	GetHugePageSize(&hugepagesize, &mmap_flags);
+	if (hugepagesize == 0)
+		return;
+
+	FindTextRegion(&r);
+	if (r.from == NULL || r.to == NULL)
+		return;
+
+	elog(DEBUG3, ".text start: %p", r.from);
+	elog(DEBUG3, ".text end:   %p", r.to);
+
+	AlignRegionToPageBoundary(&r, hugepagesize);
+
+	elog(DEBUG3, "aligned .text start: %p", r.from);
+	elog(DEBUG3, "aligned .text end:   %p", r.to);
+
+	/* check if aligned map region is large enough for huge pages */
+	if (r.to - r.from < hugepagesize || r.from > r.to)
+		return;
+
+	MoveRegionToLargePages(&r, mmap_flags);
+}
diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build
index a22c25dd95..5ab65115e9 100644
--- a/src/backend/port/meson.build
+++ b/src/backend/port/meson.build
@@ -16,6 +16,7 @@ if cdata.has('USE_WIN32_SEMAPHORES')
 endif
 
 if cdata.has('USE_SYSV_SHARED_MEMORY')
+  backend_sources += files('large_page.c')
   backend_sources += files('sysv_shmem.c')
 endif
 
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 30fb576ac3..b30769c2b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -106,6 +106,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "port/pg_bswap.h"
+#include "port/large_page.h"
 #include "postmaster/autovacuum.h"
 #include "postmaster/auxprocess.h"
 #include "postmaster/bgworker_internals.h"
@@ -1084,6 +1085,12 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	CreateSharedMemoryAndSemaphores();
 
+	/*
+	 * If enough huge pages are available after setting up shared memory, try
+	 * to map the binary code to huge pages.
+	 */
+	MapStaticCodeToLargePages();
+
 	/*
 	 * Estimate number of openable files.  This must happen after setting up
 	 * semaphores, because on some platforms semaphores count as open files.
diff --git a/src/include/port/large_page.h b/src/include/port/large_page.h
new file mode 100644
index 0000000000..171819dd53
--- /dev/null
+++ b/src/include/port/large_page.h
@@ -0,0 +1,18 @@
+/*-------------------------------------------------------------------------
+ *
+ * large_page.h
+ *	  Map .text segment of binary to huge pages
+ *
+ * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	  src/include/port/large_page.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef LARGE_PAGE_H
+#define LARGE_PAGE_H
+
+extern void MapStaticCodeToLargePages(void);
+
+#endif							/* LARGE_PAGE_H */
-- 
2.37.3

Reply via email to