It's been known for a while that Postgres spends a lot of time translating instruction addresses, and using huge pages in the text segment yields a substantial performance boost in OLTP workloads [1][2]. The difficulty is, this normally requires a lot of painstaking work (unless your OS does superpage promotion, like FreeBSD).
I found an MIT-licensed library "iodlr" from Intel [3] that allows one to remap the .text segment to huge pages at program start. Attached is a hackish, Meson-only, "works on my machine" patchset to experiment with this idea. 0001 adapts the library to our error logging and GUC system. The overview: - read ELF info to get the start/end addresses of the .text segment - calculate addresses therein aligned at huge page boundaries - mmap a temporary region and memcpy the aligned portion of the .text segment - mmap aligned start address to a second region with huge pages and MAP_FIXED - memcpy over from the temp region and revoke the PROT_WRITE bit The reason this doesn't "saw off the branch you're standing on" is that the remapping is done in a function that's forced to live in a different segment, and doesn't call any non-libc functions living elsewhere: static void __attribute__((__section__("lpstub"))) __attribute__((__noinline__)) MoveRegionToLargePages(const mem_range * r, int mmap_flags) Debug messages show 2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540 2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12 2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x600000 2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x800000 2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages 2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code region Here, out of 5MB of Postgres text, only 1 huge page can be used, but that still saves 512 entries in the TLB and might bring a small improvement. The un-remapped region below 0x600000 contains the ~600kB of "cold" code, since the linker puts the cold section first, at least recent versions of ld and lld. 0002 is my attempt to force the linker's hand and get the entire text segment mapped to huge pages. It's quite a finicky hack, and easily broken (see below). That said, it still builds easily within our normal build process, and maybe there is a better way to get the effect. It does two things: - Pass the linker -Wl,-zcommon-page-size=2097152 -Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's done for predictability, but that means the next 2MB boundary is very nearly 2MB away. - Add a "cold" __asm__ filler function that just takes up space, enough to push the end of the .text segment over the next aligned boundary, or to ~8MB in size. In a non-assert build: 0001: $ bloaty inst-perf/bin/postgres FILE SIZE VM SIZE -------------- -------------- 53.7% 4.90Mi 58.7% 4.90Mi .text ... 100.0% 9.12Mi 100.0% 8.35Mi TOTAL $ readelf -S --wide inst-perf/bin/postgres [Nr] Name Type Address Off Size ES Flg Lk Inf Al ... [12] .init PROGBITS 0000000000486000 086000 00001b 00 AX 0 0 4 [13] .plt PROGBITS 0000000000486020 086020 001520 10 AX 0 0 16 [14] .text PROGBITS 0000000000487540 087540 4e59d2 00 AX 0 0 16 ... 0002: $ bloaty inst-perf/bin/postgres FILE SIZE VM SIZE -------------- -------------- 46.9% 8.00Mi 69.9% 8.00Mi .text ... 100.0% 17.1Mi 100.0% 11.4Mi TOTAL $ readelf -S --wide inst-perf/bin/postgres [Nr] Name Type Address Off Size ES Flg Lk Inf Al ... [12] .init PROGBITS 0000000000600000 200000 00001b 00 AX 0 0 4 [13] .plt PROGBITS 0000000000600020 200020 001520 10 AX 0 0 16 [14] .text PROGBITS 0000000000601540 201540 7ff512 00 AX 0 0 16 ... Debug messages with 0002 shows 6MB mapped: 2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540 2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52 2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x800000 2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe00000 2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages 2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code region Since the front is all-cold, and there is very little at the end, practically all hot pages are now remapped. The biggest problem with the hackish filler function (in addition to maintainability) is, if explicit huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB causes complete startup failure if the .text segment is larger than 8MB. I haven't looked into what's happening there yet, but I didn't want to get too far in the weeds before getting feedback on whether the entire approach in this thread is sound enough to justify working further on. [1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf (paper: "On the Impact of Instruction Address Translation Overhead") [2] https://twitter.com/AndresFreundTec/status/1214305610172289024 [3] https://github.com/intel/iodlr -- John Naylor EDB: http://www.enterprisedb.com
From 9cde401f87937c1982f2355c8f81449514166376 Mon Sep 17 00:00:00 2001 From: John Naylor <john.naylor@postgresql.org> Date: Mon, 31 Oct 2022 13:59:30 +0700 Subject: [PATCH v1 2/2] Put all non-cold .text in huge pages Tell linker to align addresses on 2MB boundaries. The .init section will be so aligned, with the .text section soon after that. Therefore, the start address of .text should always be align up to nearly 2MB ahead of the actual start. The first nearly 2MB of .text will not map to huge pages. We count on cold sections linking to the front of the .text segment: Since the cold sections total about 600kB in size, we need ~1.4MB of additional padding to keep non-cold pages mappable to huge pages. Since PG has about 5.0MB of .text, we also need an additional 1MB to push the .text end just past an aligned boundary, so when we align the end down, only a small number of pages will remain un-remapped at their original 4kB size. --- meson.build | 3 +++ src/backend/port/filler.c | 29 +++++++++++++++++++++++++++++ src/backend/port/meson.build | 3 +++ 3 files changed, 35 insertions(+) create mode 100644 src/backend/port/filler.c diff --git a/meson.build b/meson.build index bfacbdc0af..450946370c 100644 --- a/meson.build +++ b/meson.build @@ -239,6 +239,9 @@ elif host_system == 'freebsd' elif host_system == 'linux' sema_kind = 'unnamed_posix' cppflags += '-D_GNU_SOURCE' + # WIP: debug builds are huge + # TODO: add portability check + ldflags += ['-Wl,-zcommon-page-size=2097152', '-Wl,-zmax-page-size=2097152'] elif host_system == 'netbsd' # We must resolve all dynamic linking in the core server at program start. diff --git a/src/backend/port/filler.c b/src/backend/port/filler.c new file mode 100644 index 0000000000..de4e33bb05 --- /dev/null +++ b/src/backend/port/filler.c @@ -0,0 +1,29 @@ +/* + * Add enough padding to .text segment to bring the end just + * past a 2MB alignment boundary. In practice, this means .text needs + * to be at least 8MB. It shouldn't be much larger than this, + * because then more hot pages will remain in 4kB pages. + * + * FIXME: With this filler added, if explicit huge pages are turned off + * in the kernel, attempting mmap() with MAP_HUGETLB causes a crash + * instead of reporting failure if the .text segment is larger than 8MB. + * + * See MapStaticCodeToLargePages() in large_page.c + * + * XXX: The exact amount of filler must be determined experimentally + * on platforms of interest, in non-assert builds. + * + */ +static void +__attribute__((used)) +__attribute__((cold)) +fill_function(int x) +{ + /* TODO: More architectures */ +#ifdef __x86_64__ +__asm__( + ".fill 3251000" +); +#endif + (void) x; +} \ No newline at end of file diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build index 5ab65115e9..d876712e0c 100644 --- a/src/backend/port/meson.build +++ b/src/backend/port/meson.build @@ -16,6 +16,9 @@ if cdata.has('USE_WIN32_SEMAPHORES') endif if cdata.has('USE_SYSV_SHARED_MEMORY') + if host_system == 'linux' + backend_sources += files('filler.c') + endif backend_sources += files('large_page.c') backend_sources += files('sysv_shmem.c') endif -- 2.37.3
From 0012baab70779f5fc06c8717392dc76e8f156270 Mon Sep 17 00:00:00 2001 From: John Naylor <john.naylor@postgresql.org> Date: Mon, 31 Oct 2022 15:24:29 +0700 Subject: [PATCH v1 1/2] Partly remap the .text segment into huge pages at postmaster start Based on MIT licensed libary at https://github.com/intel/iodlr The basic steps are: - read ELF info to get the start/end addresses of the .text segment - calculate addresses therein aligned at huge page boundaries - mmap temporary region and memcpy aligned portion of .text segment - mmap start address to new region with huge pages and MAP_FIXED - memcpy over and revoke the PROT_WRITE bit The Postgres .text segment is ~5.0MB in a non-assert build, so this method can put 2-4MB into huge pages. --- src/backend/port/large_page.c | 348 ++++++++++++++++++++++++++++ src/backend/port/meson.build | 1 + src/backend/postmaster/postmaster.c | 7 + src/include/port/large_page.h | 18 ++ 4 files changed, 374 insertions(+) create mode 100644 src/backend/port/large_page.c create mode 100644 src/include/port/large_page.h diff --git a/src/backend/port/large_page.c b/src/backend/port/large_page.c new file mode 100644 index 0000000000..66a584f785 --- /dev/null +++ b/src/backend/port/large_page.c @@ -0,0 +1,348 @@ +/*------------------------------------------------------------------------- + * + * large_page.c + * Map .text segment of binary to huge pages + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/backend/port/large_page.c + * + *------------------------------------------------------------------------- + */ + +/* + * Based on Intel ioldr library: + * https://github.com/intel/iodlr.git + * MIT license and copyright notice follows + */ + +/* + * Copyright (C) 2018 Intel Corporation + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom + * the Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included + * in all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS + * OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES + * OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE + * OR OTHER DEALINGS IN THE SOFTWARE. + * + * SPDX-License-Identifier: MIT + */ + +#include "postgres.h" + +#include <link.h> +#include <sys/mman.h> + +#include "port/large_page.h" +#include "storage/pg_shmem.h" + +typedef struct +{ + char *from; + char *to; +} mem_range; + +typedef struct +{ + uintptr_t start; + uintptr_t end; + bool found; +} FindParams; + +static inline uintptr_t +largepage_align_down(uintptr_t addr, size_t hugepagesize) +{ + return (addr & ~(hugepagesize - 1)); +} + +static inline uintptr_t +largepage_align_up(uintptr_t addr, size_t hugepagesize) +{ + return largepage_align_down(addr + hugepagesize - 1, hugepagesize); +} + +static bool +FindTextSection(const char *fname, ElfW(Shdr) * text_section) +{ + ElfW(Ehdr) ehdr; + FILE *bin; + + ElfW(Shdr) * shdrs = NULL; + ElfW(Shdr) * sh_strab; + char *section_names = NULL; + + bin = fopen(fname, "r"); + if (bin == NULL) + return false; + + /* Read the header. */ + if (fread(&ehdr, sizeof(ehdr), 1, bin) != 1) + return false;; + + /* Read the section headers. */ + shdrs = (ElfW(Shdr) *) palloc(ehdr.e_shnum * sizeof(ElfW(Shdr))); + if (fseek(bin, ehdr.e_shoff, SEEK_SET) != 0) + return false;; + if (fread(shdrs, sizeof(shdrs[0]), ehdr.e_shnum, bin) != ehdr.e_shnum) + return false;; + + /* Read the string table. */ + sh_strab = &shdrs[ehdr.e_shstrndx]; + section_names = palloc(sh_strab->sh_size * sizeof(char)); + + if (fseek(bin, sh_strab->sh_offset, SEEK_SET) != 0) + return false;; + if (fread(section_names, sh_strab->sh_size, 1, bin) != 1) + return false;; + + /* Find the ".text" section. */ + for (uint32_t idx = 0; idx < ehdr.e_shnum; idx++) + { + ElfW(Shdr) * sh = &shdrs[idx]; + if (!memcmp(§ion_names[sh->sh_name], ".text", 5)) + { + *text_section = *sh; + fclose(bin); + return true; + } + } + return false; +} + +/* Callback for dl_iterate_phdr to set the start and end of the .text segment */ +static int +FindMapping(struct dl_phdr_info *hdr, size_t size, void *data) +{ + ElfW(Shdr) text_section; + FindParams *find_params = (FindParams *) data; + + /* + * We are only interested in the mapping matching the main executable. + * This has the empty string for a name. + */ + if (hdr->dlpi_name[0] != '\0') + return 0; + + /* + * Open the info structure for the executable on disk to find the location + * of its .text section. We use the base address given to calculate the + * .text section offset in memory. + */ + text_section.sh_size = 0; +#ifdef __linux__ + if (FindTextSection("/proc/self/exe", &text_section)) + { + find_params->start = hdr->dlpi_addr + text_section.sh_addr; + find_params->end = find_params->start + text_section.sh_size; + find_params->found = true; + return 1; + } +#endif + return 0; +} + +/* + * Identify and return the text segment in the currently mapped memory region. + */ +static bool +FindTextRegion(mem_range * region) +{ + FindParams find_params = {0, 0, false}; + + /* + * Note: the upstream source worked with shared libraries as well, hence + * the iteration over all ojects. + */ + dl_iterate_phdr(FindMapping, &find_params); + if (find_params.found) + { + region->from = (char *) find_params.start; + region->to = (char *) find_params.end; + } + + return find_params.found; +} + +/* + * Move specified region to large pages. + * + * NB: We need to be very careful: + * 1. This function itself should not be moved. We use compiler attributes: + * WIP: if these aren't available, the function should do nothing + * (__section__) to put it outside the ".text" section + * (__noline__) to not inline this function + * + * 2. This function should not call any function(s) that might be moved. + */ +static void +__attribute__((__section__("lpstub"))) +__attribute__((__noinline__)) +MoveRegionToLargePages(const mem_range * r, int mmap_flags) +{ + void *nmem = MAP_FAILED; + void *tmem = MAP_FAILED; + int ret = 0; + int mmap_errno = 0; + void *start = r->from; + size_t size = r->to - r->from; + bool success = false; + + /* Allocate temporary region */ + nmem = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (nmem == MAP_FAILED) + { + elog(DEBUG1, "failed to allocate temporary region"); + return; + } + + /* copy the original code */ + memcpy(nmem, r->from, size); + + /* + * mmap using the start address with MAP_FIXED so we get exactly the same + * virtual address. We already know the original page is r-xp (PROT_READ, + * PROT_EXEC, MAP_PRIVATE) We want PROT_WRITE because we are writing into + * it. + */ + Assert(mmap_flags & MAP_HUGETLB); + tmem = mmap(start, size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | mmap_flags, + -1, 0); + mmap_errno = errno; + + if (tmem == MAP_FAILED && huge_pages == HUGE_PAGES_ON) + { + /* + * WIP: need a way for the user to determine total huge pages needed, + * perhaps with shared_memory_size_in_huge_pages + */ + errno = mmap_errno; + ereport(FATAL, + errmsg("mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size), + (mmap_errno == ENOMEM) ? + errhint("This usually means not enough explicit huge pages were " + "configured in the kernel") : 0); + goto cleanup_tmp; + } + else if (tmem == MAP_FAILED) + { + Assert(huge_pages == HUGE_PAGES_TRY); + + errno = mmap_errno; + elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", size); + + /* + * try remapping again with normal pages + * + * XXX we cannot just back out now + */ + tmem = mmap(start, size, + PROT_READ | PROT_WRITE | PROT_EXEC, + MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, + -1, 0); + mmap_errno = errno; + + if (tmem == MAP_FAILED) + { + /* + * If we get here we cannot start the server. It's unlikely we + * will fail here after the postmaster successfully set up shared + * memory, but maybe we should have a GUC to turn off code + * remapping, hinted here. + */ + errno = mmap_errno; + ereport(FATAL, + errmsg("mmap(%zu) failed for fallback code region: %m", size)); + goto cleanup_tmp; + } + } + else + success = true; + + /* copy the code to the newly mapped area and unset the write bit */ + memcpy(start, nmem, size); + ret = mprotect(start, size, PROT_READ | PROT_EXEC); + if (ret < 0) + { + /* WIP: see note above about GUC and hint */ + ereport(FATAL, + errmsg("failed to protect remapped code pages")); + + /* Cannot start but at least try to clean up after ourselves */ + munmap(tmem, size); + goto cleanup_tmp; + } + + if (success) + elog(DEBUG1, "binary mapped to huge pages"); + +cleanup_tmp: + /* Release the old/temporary mapped region */ + elog(DEBUG3, "un-mmapping temporary code region"); + ret = munmap(nmem, size); + if (ret < 0) + /* WIP: not sure of severity here */ + ereport(LOG, + errmsg("failed to unmap temporary region")); + + return; +} + +/* Align the region to to be mapped to huge page boundaries. */ +static void +AlignRegionToPageBoundary(mem_range * r, size_t hugepagesize) +{ + r->from = (char *) largepage_align_up((uintptr_t) r->from, hugepagesize); + r->to = (char *) largepage_align_down((uintptr_t) r->to, hugepagesize); +} + + +/* Map the postgres .text segment into huge pages. */ +void +MapStaticCodeToLargePages(void) +{ + size_t hugepagesize; + int mmap_flags; + mem_range r = {0}; + + if (huge_pages == HUGE_PAGES_OFF) + return; + + GetHugePageSize(&hugepagesize, &mmap_flags); + if (hugepagesize == 0) + return; + + FindTextRegion(&r); + if (r.from == NULL || r.to == NULL) + return; + + elog(DEBUG3, ".text start: %p", r.from); + elog(DEBUG3, ".text end: %p", r.to); + + AlignRegionToPageBoundary(&r, hugepagesize); + + elog(DEBUG3, "aligned .text start: %p", r.from); + elog(DEBUG3, "aligned .text end: %p", r.to); + + /* check if aligned map region is large enough for huge pages */ + if (r.to - r.from < hugepagesize || r.from > r.to) + return; + + MoveRegionToLargePages(&r, mmap_flags); +} diff --git a/src/backend/port/meson.build b/src/backend/port/meson.build index a22c25dd95..5ab65115e9 100644 --- a/src/backend/port/meson.build +++ b/src/backend/port/meson.build @@ -16,6 +16,7 @@ if cdata.has('USE_WIN32_SEMAPHORES') endif if cdata.has('USE_SYSV_SHARED_MEMORY') + backend_sources += files('large_page.c') backend_sources += files('sysv_shmem.c') endif diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 30fb576ac3..b30769c2b2 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -106,6 +106,7 @@ #include "pg_getopt.h" #include "pgstat.h" #include "port/pg_bswap.h" +#include "port/large_page.h" #include "postmaster/autovacuum.h" #include "postmaster/auxprocess.h" #include "postmaster/bgworker_internals.h" @@ -1084,6 +1085,12 @@ PostmasterMain(int argc, char *argv[]) */ CreateSharedMemoryAndSemaphores(); + /* + * If enough huge pages are available after setting up shared memory, try + * to map the binary code to huge pages. + */ + MapStaticCodeToLargePages(); + /* * Estimate number of openable files. This must happen after setting up * semaphores, because on some platforms semaphores count as open files. diff --git a/src/include/port/large_page.h b/src/include/port/large_page.h new file mode 100644 index 0000000000..171819dd53 --- /dev/null +++ b/src/include/port/large_page.h @@ -0,0 +1,18 @@ +/*------------------------------------------------------------------------- + * + * large_page.h + * Map .text segment of binary to huge pages + * + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/port/large_page.h + * + *------------------------------------------------------------------------- + */ +#ifndef LARGE_PAGE_H +#define LARGE_PAGE_H + +extern void MapStaticCodeToLargePages(void); + +#endif /* LARGE_PAGE_H */ -- 2.37.3