> Documentation syntax error "<literal>2MB<literal>" shows up as:
Ops, sorry, should be fixed now. > The build is currently failing on Windows: Ahh, thanks. Looks like the Windows stuff isn't autogenerated, so maybe this new patch works.. > When using huge_pages=on, huge_page_size=1GB, but default shared_buffers, I noticed that the error message reports the wrong (unrounded) size in this message: Ahh, yes, that is correct. Switched to printing the _real_ allocsize now! > 1GB pages are so big that it becomes a little tricky to set shared buffers large enough without wasting RAM. What I mean is, if I want to use shared_buffers=16GB, I need to have at least 17 huge pages available, but the 17th page is nearly entirely wasted! Imagine that on POWER 16GB pages. That makes me wonder if we should actually redefine these GUCs differently so that you state the total, or at least use the rounded memory for buffers... I think we could consider that to be a separate problem with a separate patch though. Yes, that is a good point! But as you say, I guess that fits better in another patch. > Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a 3.5GB table against itself. [...] Thanks for the results! Will look into your patch when I get time, but it certainly looks cool! I have a 4-node numa machine with ~100GiB of memory and a single node numa machine, so i'll take some benchmarks when I get time! > I wondered if this was something to do > with NUMA effects on this two node box, so I tried running that again > with postgres under numactl --cpunodebind 0 --membind 0 and I got: [...] Yes, making this "properly" numa aware to avoid/limit cross-numa memory access is kinda tricky. When reserving huge pages they are distributed more or less evenly between the nodes, and they can be found by using `grep -R "" /sys/devices/system/node/node*/hugepages/hugepages-*/nr_hugepages` (can also be written to), so there _may_ be a chance that the huge pages you got was on another node than 0 (due to the fact that there not were enough), but that is just guessing. tor. 18. jun. 2020 kl. 06:01 skrev Thomas Munro <thomas.mu...@gmail.com>: > > Hi Odin, > > Documentation syntax error "<literal>2MB<literal>" shows up as: > > config.sgml:1605: parser error : Opening and ending tag mismatch: > literal line 1602 and para > </para> > ^ > > Please install the documentation tools > https://www.postgresql.org/docs/devel/docguide-toolsets.html, rerun > configure and "make docs" to see these kinds of errors. > > The build is currently failing on Windows: > > undefined symbol: HAVE_DECL_MAP_HUGE_MASK at src/include/pg_config.h > line 143 at src/tools/msvc/Mkvcbuild.pm line 851. > > I think that's telling us that you need to add this stuff into > src/tools/msvc/Solution.pm, so that we can say it doesn't have it. I > don't have Windows but whenever you post a new version we'll see if > Windows likes it here: > > http://cfbot.cputube.org/odin-ugedal.html > > When using huge_pages=on, huge_page_size=1GB, but default > shared_buffers, I noticed that the error message reports the wrong > (unrounded) size in this message: > > 2020-06-18 02:06:30.407 UTC [73552] HINT: This error usually means > that PostgreSQL's request for a shared memory segment exceeded > available memory, swap space, or huge pages. To reduce the request > size (currently 149069824 bytes), reduce PostgreSQL's shared memory > usage, perhaps by reducing shared_buffers or max_connections. > > The request size was actually: > > mmap(NULL, 1073741824, PROT_READ|PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB|30<<MAP_HUGE_SHIFT, -1, 0) = -1 > ENOMEM (Cannot allocate memory) > > 1GB pages are so big that it becomes a little tricky to set shared > buffers large enough without wasting RAM. What I mean is, if I want > to use shared_buffers=16GB, I need to have at least 17 huge pages > available, but the 17th page is nearly entirely wasted! Imagine that > on POWER 16GB pages. That makes me wonder if we should actually > redefine these GUCs differently so that you state the total, or at > least use the rounded memory for buffers... I think we could consider > that to be a separate problem with a separate patch though. > > Just for fun, I compared 4KB, 2MB and 1GB pages for a hash join of a > 3.5GB table against itself. Hash joins are the perfect way to > exercise the TLB because they're very likely to miss. I also applied > my patch[1] to allow parallel queries to use shared memory from the > main shared memory area, so that they benefit from the configured page > size, using pages that are allocated once at start up. (Without that, > you'd have to mess around with /dev/shm mount options, and then hope > that pages were available at query time, and it'd also be slower for > other stupid implementation reasons). > > # echo never > /sys/kernel/mm/transparent_hugepage/enabled > # echo 8500 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages > # echo 17 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages > > shared_buffers=8GB > dynamic_shared_memory_main_size=8GB > > create table t as select generate_series(1, 100000000)::int i; > alter table t set (parallel_workers = 7); > create extension pg_prewarm; > select pg_prewarm('t'); > set max_parallel_workers_per_gather=7; > set work_mem='1GB'; > > select count(*) from t t1 join t t2 using (i); > > 4KB pages: 12.42 seconds > 2MB pages: 9.12 seconds > 1GB pages: 9.07 seconds > > Unfortunately I can't access the TLB miss counters on this system due > to virtualisation restrictions, and the systems where I can don't have > 1GB pages. According to cpuid(1) this system has a fairly typical > setup: > > cache and TLB information (2): > 0x63: data TLB: 2M/4M pages, 4-way, 32 entries > data TLB: 1G pages, 4-way, 4 entries > 0x03: data TLB: 4K pages, 4-way, 64 entries > > This operation is touching about 8GB of data (scanning 3.5GB of table, > building a 4.5GB hash table) so 4 x 1GB is not enough do this without > TLB misses. > > Let's try that again, except this time with shared_buffers=4GB, > dynamic_shared_memory_main_size=4GB, and only half as many tuples in > t, so it ought to fit: > > 4KB pages: 6.37 seconds > 2MB pages: 4.96 seconds > 1GB pages: 5.07 seconds > > Well that's disappointing. I wondered if this was something to do > with NUMA effects on this two node box, so I tried running that again > with postgres under numactl --cpunodebind 0 --membind 0 and I got: > > 4KB pages: 5.43 seconds > 2MB pages: 4.05 seconds > 1GB pages: 4.00 seconds > > From this I can't really conclude that it's terribly useful to use > larger page sizes, but it's certainly useful to have the ability to do > further testing using the proposed GUC. > > [1] > https://www.postgresql.org/message-id/flat/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
From fa3b30a32032bf38c8dc72de9656526a5d5e8daa Mon Sep 17 00:00:00 2001 From: Odin Ugedal <o...@ugedal.com> Date: Sun, 7 Jun 2020 21:04:57 +0200 Subject: [PATCH v4] Add support for choosing huge page size This adds support for using non-default huge page sizes for shared memory. This is achived via the new "huge_page_size" config entry. The config value defaults to 0, meaning it will use the system default. --- configure | 26 +++++++ configure.in | 4 ++ doc/src/sgml/config.sgml | 27 ++++++++ doc/src/sgml/runtime.sgml | 41 ++++++----- src/backend/port/sysv_shmem.c | 69 ++++++++++++++----- src/backend/utils/misc/guc.c | 25 +++++++ src/backend/utils/misc/postgresql.conf.sample | 2 + src/include/pg_config.h.in | 8 +++ src/include/pg_config_manual.h | 6 ++ src/include/storage/pg_shmem.h | 1 + src/tools/msvc/Solution.pm | 2 + 11 files changed, 179 insertions(+), 32 deletions(-) diff --git a/configure b/configure index 2feff37fe3..11e3112ee4 100755 --- a/configure +++ b/configure @@ -15488,6 +15488,32 @@ _ACEOF fi # fi +# Check if system supports mmap flags for allocating huge page memory with page sizes +# other than the default +ac_fn_c_check_decl "$LINENO" "MAP_HUGE_MASK" "ac_cv_have_decl_MAP_HUGE_MASK" "#include <sys/mman.h> +" +if test "x$ac_cv_have_decl_MAP_HUGE_MASK" = xyes; then : + ac_have_decl=1 +else + ac_have_decl=0 +fi + +cat >>confdefs.h <<_ACEOF +#define HAVE_DECL_MAP_HUGE_MASK $ac_have_decl +_ACEOF +ac_fn_c_check_decl "$LINENO" "MAP_HUGE_SHIFT" "ac_cv_have_decl_MAP_HUGE_SHIFT" "#include <sys/mman.h> +" +if test "x$ac_cv_have_decl_MAP_HUGE_SHIFT" = xyes; then : + ac_have_decl=1 +else + ac_have_decl=0 +fi + +cat >>confdefs.h <<_ACEOF +#define HAVE_DECL_MAP_HUGE_SHIFT $ac_have_decl +_ACEOF + + ac_fn_c_check_decl "$LINENO" "fdatasync" "ac_cv_have_decl_fdatasync" "#include <unistd.h> " if test "x$ac_cv_have_decl_fdatasync" = xyes; then : diff --git a/configure.in b/configure.in index 0188c6ff07..f56c06eb3d 100644 --- a/configure.in +++ b/configure.in @@ -1687,6 +1687,10 @@ AC_CHECK_FUNCS(posix_fadvise) AC_CHECK_DECLS(posix_fadvise, [], [], [#include <fcntl.h>]) ]) # fi +# Check if system supports mmap flags for allocating huge page memory with page sizes +# other than the default +AC_CHECK_DECLS([MAP_HUGE_MASK, MAP_HUGE_SHIFT], [], [], [#include <sys/mman.h>]) + AC_CHECK_DECLS(fdatasync, [], [], [#include <unistd.h>]) AC_CHECK_DECLS([strlcat, strlcpy, strnlen]) # This is probably only present on macOS, but may as well check always diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml index aca8f73a50..8f73720327 100644 --- a/doc/src/sgml/config.sgml +++ b/doc/src/sgml/config.sgml @@ -1582,6 +1582,33 @@ include_dir 'conf.d' </listitem> </varlistentry> + <varlistentry id="guc-huge-page-size" xreflabel="huge_page_size"> + <term><varname>huge_page_size</varname> (<type>integer</type>) + <indexterm> + <primary><varname>huge_page_size</varname> configuration parameter</primary> + </indexterm> + </term> + <listitem> + <para> + Controls what size of huge pages is used in conjunction with + <xref linkend="guc-huge-pages"/>. + The default is zero (<literal>0</literal>). + When set to <literal>0</literal>, the default huge page size on the system will + be used. + </para> + <para> + Some commonly available page sizes on modern 64 bit server architectures include: + <literal>2MB</literal> and <literal>1GB</literal> (Intel and AMD), <literal>16MB</literal> and + <literal>16GB</literal> (IBM POWER), and <literal>64kB</literal>, <literal>2MB</literal>, + <literal>32MB</literal> and <literal>1GB</literal> (ARM). For more information + about usage and support, see <xref linkend="linux-huge-pages"/>. + </para> + <para> + Controlling huge page size is currently not supported on Windows. + </para> + </listitem> + </varlistentry> + <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers"> <term><varname>temp_buffers</varname> (<type>integer</type>) <indexterm> diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 88210c4a5d..cbdbcb4fdf 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1391,41 +1391,50 @@ export PG_OOM_ADJUST_VALUE=0 using large values of <xref linkend="guc-shared-buffers"/>. To use this feature in <productname>PostgreSQL</productname> you need a kernel with <varname>CONFIG_HUGETLBFS=y</varname> and - <varname>CONFIG_HUGETLB_PAGE=y</varname>. You will also have to adjust - the kernel setting <varname>vm.nr_hugepages</varname>. To estimate the - number of huge pages needed, start <productname>PostgreSQL</productname> - without huge pages enabled and check the - postmaster's anonymous shared memory segment size, as well as the system's - huge page size, using the <filename>/proc</filename> file system. This might - look like: + <varname>CONFIG_HUGETLB_PAGE=y</varname>. You will also have to pre-allocate + huge pages with the the desired huge page size. To estimate the number of + huge pages needed, start <productname>PostgreSQL</productname> without huge + pages enabled and check the postmaster's anonymous shared memory segment size, + as well as the system's supported huge page sizes, using the + <filename>/sys</filename> file system. This might look like: <programlisting> $ <userinput>head -1 $PGDATA/postmaster.pid</userinput> 4170 $ <userinput>pmap 4170 | awk '/rw-s/ && /zero/ {print $2}'</userinput> 6490428K +$ <userinput>ls /sys/kernel/mm/hugepages</userinput> +hugepages-1048576kB hugepages-2048kB +</programlisting> + + You can now choose between the supported sizes, 2MiB and 1GiB in this case. + By default <productname>PostgreSQL</productname> will use the default huge + page size on the system, but that can be configured via + <xref linkend="guc-huge-page-size"/>. + The default huge page size can be found with: +<programlisting> $ <userinput>grep ^Hugepagesize /proc/meminfo</userinput> Hugepagesize: 2048 kB </programlisting> + + For <literal>2MiB</literal>, <literal>6490428</literal> / <literal>2048</literal> gives approximately <literal>3169.154</literal>, so in this example we need at least <literal>3170</literal> huge pages, which we can set with: <programlisting> -$ <userinput>sysctl -w vm.nr_hugepages=3170</userinput> +$ <userinput>echo 3170 | tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages</userinput> </programlisting> A larger setting would be appropriate if other programs on the machine - also need huge pages. Don't forget to add this setting - to <filename>/etc/sysctl.conf</filename> so that it will be reapplied - after reboots. + also need huge pages. It is also possible to pre allocate huge pages on boot + by adding the kernel parameters <literal>hugepagesz=2M hugepages=3170</literal>. </para> <para> Sometimes the kernel is not able to allocate the desired number of huge - pages immediately, so it might be necessary to repeat the command or to - reboot. (Immediately after a reboot, most of the machine's memory - should be available to convert into huge pages.) To verify the huge - page allocation situation, use: + pages immediately due to external fragmentation, so it might be necessary to + repeat the command or to reboot. To verify the huge page allocation situation + for a given size, use: <programlisting> -$ <userinput>grep Huge /proc/meminfo</userinput> +$ <userinput>cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages</userinput> </programlisting> </para> diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 198a6985bf..0f45f6c9ac 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -32,6 +32,7 @@ #endif #include "miscadmin.h" +#include "port/pg_bitutils.h" #include "portability/mem.h" #include "storage/dsm.h" #include "storage/fd.h" @@ -464,25 +465,15 @@ PGSharedMemoryAttach(IpcMemoryId shmId, * hugepage sizes, we might want to think about more invasive strategies, * such as increasing shared_buffers to absorb the extra space. * - * Returns the (real or assumed) page size into *hugepagesize, + * Returns the (real, assumed or config provided) page size into *hugepagesize, * and the hugepage-related mmap flags to use into *mmap_flags. - * - * Currently *mmap_flags is always just MAP_HUGETLB. Someday, on systems - * that support it, we might OR in additional bits to specify a particular - * non-default huge page size. */ + + static void GetHugePageSize(Size *hugepagesize, int *mmap_flags) { - /* - * If we fail to find out the system's default huge page size, assume it - * is 2MB. This will work fine when the actual size is less. If it's - * more, we might get mmap() or munmap() failures due to unaligned - * requests; but at this writing, there are no reports of any non-Linux - * systems being picky about that. - */ - *hugepagesize = 2 * 1024 * 1024; - *mmap_flags = MAP_HUGETLB; + Size default_hugepagesize = 0; /* * System-dependent code to find out the default huge page size. @@ -491,6 +482,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags) * nnnn kB". Ignore any failures, falling back to the preset default. */ #ifdef __linux__ + { FILE *fp = AllocateFile("/proc/meminfo", "r"); char buf[128]; @@ -505,7 +497,7 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags) { if (ch == 'k') { - *hugepagesize = sz * (Size) 1024; + default_hugepagesize = sz * (Size) 1024; break; } /* We could accept other units besides kB, if needed */ @@ -515,6 +507,51 @@ GetHugePageSize(Size *hugepagesize, int *mmap_flags) } } #endif /* __linux__ */ + + if (huge_page_size != 0) + { + /* If huge page size is provided in in config we use that size */ + *hugepagesize = (Size) huge_page_size * 1024; + } + else if (default_hugepagesize != 0) + { + *hugepagesize = default_hugepagesize; + } + else + { + /* + * If we fail to find out the system's default huge page size, or no + * huge page size is provided in config, assume it is 2MB. This will + * work fine when the actual size is less. If it's more, we might get + * mmap() or munmap() failures due to unaligned requests; but at this + * writing, there are no reports of any non-Linux systems being picky + * about that. + */ + *hugepagesize = 2 * 1024 * 1024; + } + + + *mmap_flags = MAP_HUGETLB; + + /* + * System-dependent code to configure mmap_flags. + * + * On Linux, configure flags to include page size, since default huge page + * size will be used in case no size is provided. + */ +#ifdef USE_NON_DEFAULT_HUGE_PAGE_SIZES + + /* + * If the selected huge page size is not the default, add flag to mmap to + * specify it + */ + if (*hugepagesize != default_hugepagesize) + { + int shift = pg_ceil_log2_64(*hugepagesize); + + *mmap_flags |= (shift & MAP_HUGE_MASK) << MAP_HUGE_SHIFT; + } +#endif /* USE_NON_DEFAULT_HUGE_PAGE_SIZES */ } #endif /* MAP_HUGETLB */ @@ -583,7 +620,7 @@ CreateAnonymousSegment(Size *size) "(currently %zu bytes), reduce PostgreSQL's shared " "memory usage, perhaps by reducing shared_buffers or " "max_connections.", - *size) : 0)); + allocsize) : 0)); } *size = allocsize; diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index 2f3e0a70e0..019e2690c3 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -197,6 +197,7 @@ static bool check_autovacuum_max_workers(int *newval, void **extra, GucSource so static bool check_max_wal_senders(int *newval, void **extra, GucSource source); static bool check_autovacuum_work_mem(int *newval, void **extra, GucSource source); static bool check_effective_io_concurrency(int *newval, void **extra, GucSource source); +static bool check_huge_page_size(int *newval, void **extra, GucSource source); static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source); static void assign_pgstat_temp_directory(const char *newval, void *extra); static bool check_application_name(char **newval, void **extra, GucSource source); @@ -585,6 +586,7 @@ int ssl_renegotiation_limit; * need to be duplicated in all the different implementations of pg_shmem.c. */ int huge_pages; +int huge_page_size; /* * These variables are all dummies that don't do anything, except in some @@ -2269,6 +2271,16 @@ static struct config_int ConfigureNamesInt[] = 1024, 16, INT_MAX / 2, NULL, NULL, NULL }, + { + {"huge_page_size", PGC_POSTMASTER, RESOURCES_MEM, + gettext_noop("The size of huge page that should be used."), + NULL, + GUC_UNIT_KB + }, + &huge_page_size, + 0, 0, INT_MAX, + check_huge_page_size, NULL, NULL + }, { {"temp_buffers", PGC_USERSET, RESOURCES_MEM, @@ -11573,6 +11585,19 @@ check_effective_io_concurrency(int *newval, void **extra, GucSource source) return true; } +static bool +check_huge_page_size(int *newval, void **extra, GucSource source) +{ +#ifndef USE_NON_DEFAULT_HUGE_PAGE_SIZES + if (*newval != 0) + { + GUC_check_errdetail("huge_page_size must be set to 0 on platforms that lack support for choosing huge page size."); + return false; + } +#endif /* USE_NON_DEFAULT_HUGE_PAGE_SIZES */ + return true; +} + static bool check_maintenance_io_concurrency(int *newval, void **extra, GucSource source) { diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample index ac02bd0c00..750d3f6245 100644 --- a/src/backend/utils/misc/postgresql.conf.sample +++ b/src/backend/utils/misc/postgresql.conf.sample @@ -122,6 +122,8 @@ # (change requires restart) #huge_pages = try # on, off, or try # (change requires restart) +#huge_page_size = 0 # use defualt huge page size when set to zero + # (change requires restart) #temp_buffers = 8MB # min 800kB #max_prepared_transactions = 0 # zero disables the feature # (change requires restart) diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index c199cd46d2..4ee8e23b47 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -138,6 +138,14 @@ to 0 if you don't. */ #undef HAVE_DECL_LLVMORCGETSYMBOLADDRESSIN +/* Define to 1 if you have the declaration of `MAP_HUGE_MASK', and to 0 if you + don't. */ +#undef HAVE_DECL_MAP_HUGE_MASK + +/* Define to 1 if you have the declaration of `MAP_HUGE_SHIFT', and to 0 if + you don't. */ +#undef HAVE_DECL_MAP_HUGE_SHIFT + /* Define to 1 if you have the declaration of `posix_fadvise', and to 0 if you don't. */ #undef HAVE_DECL_POSIX_FADVISE diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h index 8f3ec6bde1..f994652190 100644 --- a/src/include/pg_config_manual.h +++ b/src/include/pg_config_manual.h @@ -156,6 +156,12 @@ #define USE_PREFETCH #endif +/* + * USE_NON_DEFAULT_HUGE_PAGE_SIZES */ +#if defined(HAVE_DECL_MAP_HUGE_SHIFT) && defined(HAVE_DECL_MAP_HUGE_MASK) +#define USE_NON_DEFAULT_HUGE_PAGE_SIZES +#endif + /* * Default and maximum values for backend_flush_after, bgwriter_flush_after * and checkpoint_flush_after; measured in blocks. Currently, these are diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h index 0de26b3427..9992932a00 100644 --- a/src/include/storage/pg_shmem.h +++ b/src/include/storage/pg_shmem.h @@ -44,6 +44,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */ /* GUC variables */ extern int shared_memory_type; extern int huge_pages; +extern int huge_page_size; /* Possible values for huge_pages */ typedef enum diff --git a/src/tools/msvc/Solution.pm b/src/tools/msvc/Solution.pm index a13ca6e02e..32179b7e23 100644 --- a/src/tools/msvc/Solution.pm +++ b/src/tools/msvc/Solution.pm @@ -234,6 +234,8 @@ sub GenerateFiles HAVE_DECL_LLVMGETHOSTCPUNAME => 0, HAVE_DECL_LLVMGETHOSTCPUFEATURES => 0, HAVE_DECL_LLVMORCGETSYMBOLADDRESSIN => 0, + HAVE_DECL_MAP_HUGE_MASK => undef, + HAVE_DECL_MAP_HUGE_SHIFT => undef, HAVE_DECL_POSIX_FADVISE => undef, HAVE_DECL_RTLD_GLOBAL => 0, HAVE_DECL_RTLD_NOW => 0, -- 2.27.0