> On Fri, Oct 18, 2024 at 09:21:19PM GMT, Dmitry Dolgov wrote:
> TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via
> changing shared memory mapping layout. Any feedback is appreciated.

Hi,

Here is a new version of the patch, which contains a proposal about how to
coordinate shared memory resizing between backends. The rest is more or less
the same, a feedback about coordination is appreciated. It's a lot to read, but
the main difference is about:

1. Allowing to decouple a GUC value change from actually applying it, sort of a
"pending" change. The idea is to let a custom logic be triggered on an assign
hook, and then take responsibility for what happens later and how it's going to
be applied. This allows to use regular GUC infrastructure in cases where value
change requires some complicated processing. I was trying to make the change
not so invasive, plus it's missing GUC reporting yet.

2. Shared memory resizing patch became more complicated thanks to some
coordination between backends. The current implementation was chosen from few
more or less equal alternatives, which are evolving along following lines:

* There should be one "coordinator" process overseeing the change. Having
postmaster to fulfill this role like in this patch seems like a natural idea,
but it poses certain challenges since it doesn't have locking infrastructure.
Another option would be to elect a single backend to be a coordinator, which
will handle the postmaster as a special case. If there will ever be a
"coordinator" worker in Postgres, that would be useful here.

* The coordinator uses EmitProcSignalBarrier to reach out to all other backends
and trigger the resize process. Backends join a Barrier to synchronize and wait
untill everyone is finished.

* There is some resizing state stored in shared memory, which is there to
handle backends that were for some reason late or didn't receive the signal.
What to store there is open for discussion.

* Since we want to make sure all processes share the same understanding of what
NBuffers value is, any failure is mostly a hard stop, since to rollback the
change coordination is needed as well and sounds a bit too complicated for now.

We've tested this change manually for now, although it might be useful to try
out injection points. The testing strategy, which has caught plenty of bugs,
was simply to run pgbench workload against a running instance and change
shared_buffers on the fly. Some more subtle cases were verified by manually
injecting delays to trigger expected scenarios.

To reiterate, here is patches breakdown:

Patches 1-3 prepare the infrastructure and shared memory layout. They could be
useful even with multithreaded PostgreSQL, when there will be no need for
shared memory. I assume, in the multithreaded world there still will be need
for a contiguous chunk of memory to share between threads, and its layout would
be similar to the one with shared memory mappings. Note that patch nr 2 is
going away as soon as I'll get to implement shared memory address reservation,
but for now it's needed.

Patch 4 is a new addition to handle "pending" GUC changes.

Patch 5 actually does resizing. It's shared memory specific of course, and
utilized Linux specific mremap, meaning open portability questions.

Patch 6 is somewhat independent, but quite convenient to have. It also utilizes
Linux specific call memfd_create.

I would like to get some feedback on the synchronization part. While waiting
I'll proceed implementing shared memory address space reservation and Ashutosh
will continue with buffer eviction to support shared memory reduction.
>From d88185fb3b4a3a0e102a3af52f4fb5564468db15 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 19 Feb 2025 17:43:13 +0100
Subject: [PATCH v2 1/6] Allow to use multiple shared memory mappings

Currently all the work with shared memory is done via a single anonymous
memory mapping, which limits ways how the shared memory could be organized.

Introduce possibility to allocate multiple shared memory mappings, where
a single mapping is associated with a specified shared memory segment.
There is only fixed amount of available segments, currently only one
main shared memory segment is allocated. A new shared memory API is
introduces, extended with a segment as a new parameter. As a path of
least resistance, the original API is kept in place, utilizing the main
shared memory segment.
---
 src/backend/port/posix_sema.c       |   4 +-
 src/backend/port/sysv_sema.c        |   4 +-
 src/backend/port/sysv_shmem.c       | 138 ++++++++++++++++++---------
 src/backend/port/win32_sema.c       |   2 +-
 src/backend/storage/ipc/ipc.c       |   4 +-
 src/backend/storage/ipc/ipci.c      |  63 +++++++------
 src/backend/storage/ipc/shmem.c     | 141 +++++++++++++++++++---------
 src/backend/storage/lmgr/lwlock.c   |   5 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/ipc.h           |   2 +-
 src/include/storage/pg_sema.h       |   2 +-
 src/include/storage/pg_shmem.h      |  18 ++++
 src/include/storage/shmem.h         |  12 +++
 13 files changed, 272 insertions(+), 124 deletions(-)

diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c
index 269c7460817..401e1113fa1 100644
--- a/src/backend/port/posix_sema.c
+++ b/src/backend/port/posix_sema.c
@@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * we don't have to expose the counters to other processes.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
        struct stat statbuf;
 
@@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas)
         * ShmemAlloc() won't be ready yet.
         */
        sharedSemas = (PGSemaphore)
-               ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+               ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), 
shmem_segment);
 #endif
 
        numSems = 0;
diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c
index f7c8638aec5..b6301463ac7 100644
--- a/src/backend/port/sysv_sema.c
+++ b/src/backend/port/sysv_sema.c
@@ -313,7 +313,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * have clobbered.)
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
        struct stat statbuf;
 
@@ -334,7 +334,7 @@ PGReserveSemaphores(int maxSemas)
         * ShmemAlloc() won't be ready yet.
         */
        sharedSemas = (PGSemaphore)
-               ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas));
+               ShmemAllocUnlockedInSegment(PGSemaphoreShmemSize(maxSemas), 
shmem_segment);
        numSharedSemas = 0;
        maxSharedSemas = maxSemas;
 
diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 197926d44f6..843b1b3220f 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -94,8 +94,19 @@ typedef enum
 unsigned long UsedShmemSegID = 0;
 void      *UsedShmemSegAddr = NULL;
 
-static Size AnonymousShmemSize;
-static void *AnonymousShmem = NULL;
+typedef struct AnonymousMapping
+{
+       int shmem_segment;
+       Size shmem_size;                        /* Size of the mapping */
+       void *shmem;                            /* Pointer to the start of the 
mapped memory */
+       void *seg_addr;                         /* SysV shared memory for the 
header */
+       unsigned long seg_id;           /* IPC key */
+} AnonymousMapping;
+
+static AnonymousMapping Mappings[ANON_MAPPINGS];
+
+/* Keeps track of used mapping segments */
+static int next_free_segment = 0;
 
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
@@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId 
shmId,
                                                                                
   void *attachAt,
                                                                                
   PGShmemHeader **addr);
 
+static const char*
+MappingName(int shmem_segment)
+{
+       switch (shmem_segment)
+       {
+               case MAIN_SHMEM_SEGMENT:
+                       return "main";
+               default:
+                       return "unknown";
+       }
+}
+
+static void
+DebugMappings()
+{
+       for(int i = 0; i < next_free_segment; i++)
+       {
+               AnonymousMapping m = Mappings[i];
+               elog(DEBUG1, "Mapping[%s]: addr %p, size %zu",
+                        MappingName(i), m.shmem, m.shmem_size);
+       }
+}
 
 /*
  *     InternalIpcMemoryCreate(memKey, size)
@@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource 
source)
 /*
  * Creates an anonymous mmap()ed shared memory segment.
  *
- * Pass the requested size in *size.  This function will modify *size to the
- * actual size of the allocation, if it ends up allocating a segment that is
- * larger than requested.
+ * This function will modify mapping size to the actual size of the allocation,
+ * if it ends up allocating a segment that is larger than requested.
  */
-static void *
-CreateAnonymousSegment(Size *size)
+static void
+CreateAnonymousSegment(AnonymousMapping *mapping)
 {
-       Size            allocsize = *size;
+       Size            allocsize = mapping->shmem_size;
        void       *ptr = MAP_FAILED;
        int                     mmap_errno = 0;
 
@@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size)
                                   PG_MMAP_FLAGS | mmap_flags, -1, 0);
                mmap_errno = errno;
                if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
-                       elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge 
pages disabled: %m",
-                                allocsize);
+               {
+                       DebugMappings();
+                       elog(DEBUG1, "segment[%s]: mmap(%zu) with MAP_HUGETLB 
failed, huge pages disabled: %m",
+                                MappingName(mapping->shmem_segment), 
allocsize);
+               }
        }
 #endif
 
@@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size)
                 * Use the original size, not the rounded-up value, when 
falling back
                 * to non-huge pages.
                 */
-               allocsize = *size;
+               allocsize = mapping->shmem_size;
                ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
                                   PG_MMAP_FLAGS, -1, 0);
                mmap_errno = errno;
@@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size)
        if (ptr == MAP_FAILED)
        {
                errno = mmap_errno;
+               DebugMappings();
                ereport(FATAL,
-                               (errmsg("could not map anonymous shared memory: 
%m"),
+                               (errmsg("segment[%s]: could not map anonymous 
shared memory: %m",
+                                               
MappingName(mapping->shmem_segment)),
                                 (mmap_errno == ENOMEM) ?
                                 errhint("This error usually means that 
PostgreSQL's request "
                                                 "for a shared memory segment 
exceeded available memory, "
@@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size)
                                                 allocsize) : 0));
        }
 
-       *size = allocsize;
-       return ptr;
+       mapping->shmem = ptr;
+       mapping->shmem_size = allocsize;
 }
 
 /*
@@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size)
 static void
 AnonymousShmemDetach(int status, Datum arg)
 {
-       /* Release anonymous shared memory block, if any. */
-       if (AnonymousShmem != NULL)
+       for(int i = 0; i < next_free_segment; i++)
        {
-               if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-                       elog(LOG, "munmap(%p, %zu) failed: %m",
-                                AnonymousShmem, AnonymousShmemSize);
-               AnonymousShmem = NULL;
+               AnonymousMapping m = Mappings[i];
+
+               /* Release anonymous shared memory block, if any. */
+               if (m.shmem != NULL)
+               {
+                       if (munmap(m.shmem, m.shmem_size) < 0)
+                               elog(LOG, "munmap(%p, %zu) failed: %m",
+                                        m.shmem, m.shmem_size);
+                       m.shmem = NULL;
+               }
        }
 }
 
@@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size,
        PGShmemHeader *hdr;
        struct stat statbuf;
        Size            sysvsize;
+       AnonymousMapping *mapping = &Mappings[next_free_segment];
 
        /*
         * We use the data directory's ID info (inode and device numbers) to
@@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size,
 
        /* Room for a header? */
        Assert(size > MAXALIGN(sizeof(PGShmemHeader)));
+       mapping->shmem_size = size;
+       mapping->shmem_segment = next_free_segment;
 
        if (shared_memory_type == SHMEM_TYPE_MMAP)
        {
-               AnonymousShmem = CreateAnonymousSegment(&size);
-               AnonymousShmemSize = size;
+               /* On success, mapping data will be modified. */
+               CreateAnonymousSegment(mapping);
+
+               next_free_segment++;
 
                /* Register on-exit routine to unmap the anonymous segment */
                on_shmem_exit(AnonymousShmemDetach, (Datum) 0);
@@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size,
         * loop simultaneously.  (CreateDataDirLockFile() does not entirely 
ensure
         * that, but prefer fixing it over coping here.)
         */
-       NextShmemSegID = statbuf.st_ino;
+       NextShmemSegID = statbuf.st_ino + next_free_segment;
 
        for (;;)
        {
@@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size,
        /*
         * Initialize space allocation status for segment.
         */
-       hdr->totalsize = size;
+       hdr->totalsize = mapping->shmem_size;
        hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader));
        *shim = hdr;
 
        /* Save info for possible future use */
-       UsedShmemSegAddr = memAddress;
-       UsedShmemSegID = (unsigned long) NextShmemSegID;
+       mapping->seg_addr = memAddress;
+       mapping->seg_id = (unsigned long) NextShmemSegID;
 
        /*
         * If AnonymousShmem is NULL here, then we're not using anonymous shared
@@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size,
         * block. Otherwise, the System V shared memory block is only a shim, 
and
         * we must return a pointer to the real block.
         */
-       if (AnonymousShmem == NULL)
+       if (mapping->shmem == NULL)
                return hdr;
-       memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader));
-       return (PGShmemHeader *) AnonymousShmem;
+       memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader));
+       return (PGShmemHeader *) mapping->shmem;
 }
 
 #ifdef EXEC_BACKEND
@@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void)
 void
 PGSharedMemoryDetach(void)
 {
-       if (UsedShmemSegAddr != NULL)
+       for(int i = 0; i < next_free_segment; i++)
        {
-               if ((shmdt(UsedShmemSegAddr) < 0)
+               AnonymousMapping m = Mappings[i];
+
+               if (m.seg_addr != NULL)
+               {
+                       if ((shmdt(m.seg_addr) < 0)
 #if defined(EXEC_BACKEND) && defined(__CYGWIN__)
-               /* Work-around for cygipc exec bug */
-                       && shmdt(NULL) < 0
+                       /* Work-around for cygipc exec bug */
+                               && shmdt(NULL) < 0
 #endif
-                       )
-                       elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr);
-               UsedShmemSegAddr = NULL;
-       }
+                               )
+                               elog(LOG, "shmdt(%p) failed: %m", m.seg_addr);
+                       m.seg_addr = NULL;
+               }
 
-       if (AnonymousShmem != NULL)
-       {
-               if (munmap(AnonymousShmem, AnonymousShmemSize) < 0)
-                       elog(LOG, "munmap(%p, %zu) failed: %m",
-                                AnonymousShmem, AnonymousShmemSize);
-               AnonymousShmem = NULL;
+               if (m.shmem != NULL)
+               {
+                       if (munmap(m.shmem, m.shmem_size) < 0)
+                               elog(LOG, "munmap(%p, %zu) failed: %m",
+                                        m.shmem, m.shmem_size);
+                       m.shmem = NULL;
+               }
        }
 }
diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c
index 5854ad1f54d..e7365ff8060 100644
--- a/src/backend/port/win32_sema.c
+++ b/src/backend/port/win32_sema.c
@@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas)
  * process exits.
  */
 void
-PGReserveSemaphores(int maxSemas)
+PGReserveSemaphores(int maxSemas, int shmem_segment)
 {
        mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE));
        if (mySemSet == NULL)
diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c
index e4d5b944e12..9d526eb43fd 100644
--- a/src/backend/storage/ipc/ipc.c
+++ b/src/backend/storage/ipc/ipc.c
@@ -61,6 +61,8 @@ static void proc_exit_prepare(int code);
  * but provide some additional features we need --- in particular,
  * we want to register callbacks to invoke when we are disconnecting
  * from a broken shared-memory context but not exiting the postmaster.
+ * Maximum number of such exit callbacks depends on the number of shared
+ * segments.
  *
  * Callback functions can take zero, one, or two args: the first passed
  * arg is the integer exitcode, the second is the Datum supplied when
@@ -68,7 +70,7 @@ static void proc_exit_prepare(int code);
  * ----------------------------------------------------------------
  */
 
-#define MAX_ON_EXITS 20
+#define MAX_ON_EXITS 40
 
 struct ONEXIT
 {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 174eed70367..4f6c707c204 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -85,7 +85,7 @@ RequestAddinShmemSpace(Size size)
  * required.
  */
 Size
-CalculateShmemSize(int *num_semaphores)
+CalculateShmemSize(int *num_semaphores, int shmem_segment)
 {
        Size            size;
        int                     numSemas;
@@ -204,33 +204,38 @@ CreateSharedMemoryAndSemaphores(void)
 
        Assert(!IsUnderPostmaster);
 
-       /* Compute the size of the shared-memory block */
-       size = CalculateShmemSize(&numSemas);
-       elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
-
-       /*
-        * Create the shmem segment
-        */
-       seghdr = PGSharedMemoryCreate(size, &shim);
-
-       /*
-        * Make sure that huge pages are never reported as "unknown" while the
-        * server is running.
-        */
-       Assert(strcmp("unknown",
-                                 GetConfigOption("huge_pages_status", false, 
false)) != 0);
-
-       InitShmemAccess(seghdr);
-
-       /*
-        * Create semaphores
-        */
-       PGReserveSemaphores(numSemas);
-
-       /*
-        * Set up shared memory allocation mechanism
-        */
-       InitShmemAllocation();
+       for(int segment = 0; segment < ANON_MAPPINGS; segment++)
+       {
+               /* Compute the size of the shared-memory block */
+               size = CalculateShmemSize(&numSemas, segment);
+               elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size);
+
+               /*
+                * Create the shmem segment.
+                *
+                * XXX: Do multiple shims are needed, one per segment?
+                */
+               seghdr = PGSharedMemoryCreate(size, &shim);
+
+               /*
+                * Make sure that huge pages are never reported as "unknown" 
while the
+                * server is running.
+                */
+               Assert(strcmp("unknown",
+                                         GetConfigOption("huge_pages_status", 
false, false)) != 0);
+
+               InitShmemAccessInSegment(seghdr, segment);
+
+               /*
+                * Create semaphores
+                */
+               PGReserveSemaphores(numSemas, segment);
+
+               /*
+                * Set up shared memory allocation mechanism
+                */
+               InitShmemAllocationInSegment(segment);
+       }
 
        /* Initialize subsystems */
        CreateOrAttachShmemStructs();
@@ -360,7 +365,7 @@ InitializeShmemGUCs(void)
        /*
         * Calculate the shared memory size and round up to the nearest 
megabyte.
         */
-       size_b = CalculateShmemSize(&num_semas);
+       size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SEGMENT);
        size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024);
        sprintf(buf, "%zu", size_mb);
        SetConfigOption("shared_memory_size", buf,
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..389abc82519 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -75,19 +75,19 @@
 #include "utils/builtins.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
+static void *ShmemAllocRawInSegment(Size size, Size *allocated_size,
+                                                                int 
shmem_segment);
 
 /* shared memory global variables */
 
-static PGShmemHeader *ShmemSegHdr;     /* shared mem segment header */
+ShmemSegment Segments[ANON_MAPPINGS];
 
-static void *ShmemBase;                        /* start address of shared 
memory */
-
-static void *ShmemEnd;                 /* end+1 address of shared memory */
-
-slock_t    *ShmemLock;                 /* spinlock for shared memory and LWLock
-                                                                * allocation */
-
-static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */
+/*
+ * Primary index hashtable for shmem, for simplicity we use a single for all
+ * shared memory segments. There can be performance consequences of that, and
+ * an alternative option would be to have one index per shared memory segments.
+ */
+static HTAB *ShmemIndex = NULL;
 
 
 /*
@@ -96,9 +96,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable 
for shmem */
 void
 InitShmemAccess(PGShmemHeader *seghdr)
 {
-       ShmemSegHdr = seghdr;
-       ShmemBase = seghdr;
-       ShmemEnd = (char *) ShmemBase + seghdr->totalsize;
+       InitShmemAccessInSegment(seghdr, MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAccessInSegment(PGShmemHeader *seghdr, int shmem_segment)
+{
+       PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr;
+       ShmemSegment *seg = &Segments[shmem_segment];
+       seg->ShmemSegHdr = shmhdr;
+       seg->ShmemBase = (void *) shmhdr;
+       seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize;
 }
 
 /*
@@ -109,7 +117,13 @@ InitShmemAccess(PGShmemHeader *seghdr)
 void
 InitShmemAllocation(void)
 {
-       PGShmemHeader *shmhdr = ShmemSegHdr;
+       InitShmemAllocationInSegment(MAIN_SHMEM_SEGMENT);
+}
+
+void
+InitShmemAllocationInSegment(int shmem_segment)
+{
+       PGShmemHeader *shmhdr = Segments[shmem_segment].ShmemSegHdr;
        char       *aligned;
 
        Assert(shmhdr != NULL);
@@ -118,9 +132,9 @@ InitShmemAllocation(void)
         * Initialize the spinlock used by ShmemAlloc.  We must use
         * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet.
         */
-       ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
+       Segments[shmem_segment].ShmemLock = (slock_t *) 
ShmemAllocUnlockedInSegment(sizeof(slock_t), shmem_segment);
 
-       SpinLockInit(ShmemLock);
+       SpinLockInit(Segments[shmem_segment].ShmemLock);
 
        /*
         * Allocations after this point should go through ShmemAlloc, which
@@ -145,11 +159,17 @@ InitShmemAllocation(void)
  */
 void *
 ShmemAlloc(Size size)
+{
+       return ShmemAllocInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocInSegment(Size size, int shmem_segment)
 {
        void       *newSpace;
        Size            allocated_size;
 
-       newSpace = ShmemAllocRaw(size, &allocated_size);
+       newSpace = ShmemAllocRawInSegment(size, &allocated_size, shmem_segment);
        if (!newSpace)
                ereport(ERROR,
                                (errcode(ERRCODE_OUT_OF_MEMORY),
@@ -179,6 +199,12 @@ ShmemAllocNoError(Size size)
  */
 static void *
 ShmemAllocRaw(Size size, Size *allocated_size)
+{
+       return ShmemAllocRawInSegment(size, allocated_size, MAIN_SHMEM_SEGMENT);
+}
+
+static void *
+ShmemAllocRawInSegment(Size size, Size *allocated_size, int shmem_segment)
 {
        Size            newStart;
        Size            newFree;
@@ -198,22 +224,22 @@ ShmemAllocRaw(Size size, Size *allocated_size)
        size = CACHELINEALIGN(size);
        *allocated_size = size;
 
-       Assert(ShmemSegHdr != NULL);
+       Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-       SpinLockAcquire(ShmemLock);
+       SpinLockAcquire(Segments[shmem_segment].ShmemLock);
 
-       newStart = ShmemSegHdr->freeoffset;
+       newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
        newFree = newStart + size;
-       if (newFree <= ShmemSegHdr->totalsize)
+       if (newFree <= Segments[shmem_segment].ShmemSegHdr->totalsize)
        {
-               newSpace = (char *) ShmemBase + newStart;
-               ShmemSegHdr->freeoffset = newFree;
+               newSpace = (char *) Segments[shmem_segment].ShmemBase + 
newStart;
+               Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
        }
        else
                newSpace = NULL;
 
-       SpinLockRelease(ShmemLock);
+       SpinLockRelease(Segments[shmem_segment].ShmemLock);
 
        /* note this assert is okay with newSpace == NULL */
        Assert(newSpace == (void *) CACHELINEALIGN(newSpace));
@@ -231,6 +257,12 @@ ShmemAllocRaw(Size size, Size *allocated_size)
  */
 void *
 ShmemAllocUnlocked(Size size)
+{
+       return ShmemAllocUnlockedInSegment(size, MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemAllocUnlockedInSegment(Size size, int shmem_segment)
 {
        Size            newStart;
        Size            newFree;
@@ -241,19 +273,19 @@ ShmemAllocUnlocked(Size size)
         */
        size = MAXALIGN(size);
 
-       Assert(ShmemSegHdr != NULL);
+       Assert(Segments[shmem_segment].ShmemSegHdr != NULL);
 
-       newStart = ShmemSegHdr->freeoffset;
+       newStart = Segments[shmem_segment].ShmemSegHdr->freeoffset;
 
        newFree = newStart + size;
-       if (newFree > ShmemSegHdr->totalsize)
+       if (newFree > Segments[shmem_segment].ShmemSegHdr->totalsize)
                ereport(ERROR,
                                (errcode(ERRCODE_OUT_OF_MEMORY),
                                 errmsg("out of shared memory (%zu bytes 
requested)",
                                                size)));
-       ShmemSegHdr->freeoffset = newFree;
+       Segments[shmem_segment].ShmemSegHdr->freeoffset = newFree;
 
-       newSpace = (char *) ShmemBase + newStart;
+       newSpace = (char *) Segments[shmem_segment].ShmemBase + newStart;
 
        Assert(newSpace == (void *) MAXALIGN(newSpace));
 
@@ -268,7 +300,13 @@ ShmemAllocUnlocked(Size size)
 bool
 ShmemAddrIsValid(const void *addr)
 {
-       return (addr >= ShmemBase) && (addr < ShmemEnd);
+       return ShmemAddrIsValidInSegment(addr, MAIN_SHMEM_SEGMENT);
+}
+
+bool
+ShmemAddrIsValidInSegment(const void *addr, int shmem_segment)
+{
+       return (addr >= Segments[shmem_segment].ShmemBase) && (addr < 
Segments[shmem_segment].ShmemEnd);
 }
 
 /*
@@ -329,6 +367,18 @@ ShmemInitHash(const char *name,            /* table string 
name for shmem index */
                          long max_size,        /* max size of the table */
                          HASHCTL *infoP,       /* info about key and bucket 
size */
                          int hash_flags)       /* info about infoP */
+{
+       return ShmemInitHashInSegment(name, init_size, max_size, infoP, 
hash_flags,
+                                                          MAIN_SHMEM_SEGMENT);
+}
+
+HTAB *
+ShmemInitHashInSegment(const char *name,               /* table string name 
for shmem index */
+                         long init_size,               /* initial table size */
+                         long max_size,                /* max size of the 
table */
+                         HASHCTL *infoP,               /* info about key and 
bucket size */
+                         int hash_flags,               /* info about infoP */
+                         int shmem_segment)    /* in which segment to keep the 
table */
 {
        bool            found;
        void       *location;
@@ -345,9 +395,9 @@ ShmemInitHash(const char *name,             /* table string 
name for shmem index */
        hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
        /* look it up in the shmem index */
-       location = ShmemInitStruct(name,
+       location = ShmemInitStructInSegment(name,
                                                           
hash_get_shared_size(infoP, hash_flags),
-                                                          &found);
+                                                          &found, 
shmem_segment);
 
        /*
         * if it already exists, attach to it rather than allocate and 
initialize
@@ -380,6 +430,13 @@ ShmemInitHash(const char *name,            /* table string 
name for shmem index */
  */
 void *
 ShmemInitStruct(const char *name, Size size, bool *foundPtr)
+{
+       return ShmemInitStructInSegment(name, size, foundPtr, 
MAIN_SHMEM_SEGMENT);
+}
+
+void *
+ShmemInitStructInSegment(const char *name, Size size, bool *foundPtr,
+                                         int shmem_segment)
 {
        ShmemIndexEnt *result;
        void       *structPtr;
@@ -388,7 +445,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 
        if (!ShmemIndex)
        {
-               PGShmemHeader *shmemseghdr = ShmemSegHdr;
+               PGShmemHeader *shmemseghdr = 
Segments[shmem_segment].ShmemSegHdr;
 
                /* Must be trying to create/attach to ShmemIndex itself */
                Assert(strcmp(name, "ShmemIndex") == 0);
@@ -411,7 +468,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
                         * process can be accessing shared memory yet.
                         */
                        Assert(shmemseghdr->index == NULL);
-                       structPtr = ShmemAlloc(size);
+                       structPtr = ShmemAllocInSegment(size, shmem_segment);
                        shmemseghdr->index = structPtr;
                        *foundPtr = false;
                }
@@ -428,8 +485,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
                LWLockRelease(ShmemIndexLock);
                ereport(ERROR,
                                (errcode(ERRCODE_OUT_OF_MEMORY),
-                                errmsg("could not create ShmemIndex entry for 
data structure \"%s\"",
-                                               name)));
+                                errmsg("could not create ShmemIndex entry for 
data structure \"%s\" in segment %d",
+                                               name, shmem_segment)));
        }
 
        if (*foundPtr)
@@ -454,7 +511,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
                Size            allocated_size;
 
                /* It isn't in the table yet. allocate and initialize it */
-               structPtr = ShmemAllocRaw(size, &allocated_size);
+               structPtr = ShmemAllocRawInSegment(size, &allocated_size, 
shmem_segment);
                if (structPtr == NULL)
                {
                        /* out of memory; remove the failed ShmemIndex entry */
@@ -473,14 +530,13 @@ ShmemInitStruct(const char *name, Size size, bool 
*foundPtr)
 
        LWLockRelease(ShmemIndexLock);
 
-       Assert(ShmemAddrIsValid(structPtr));
+       Assert(ShmemAddrIsValidInSegment(structPtr, shmem_segment));
 
        Assert(structPtr == (void *) CACHELINEALIGN(structPtr));
 
        return structPtr;
 }
 
-
 /*
  * Add two Size values, checking for overflow
  */
@@ -537,10 +593,11 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
 
        /* output all allocated entries */
        memset(nulls, 0, sizeof(nulls));
+       /* XXX: take all shared memory segments into account. */
        while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
        {
                values[0] = CStringGetTextDatum(ent->key);
-               values[1] = Int64GetDatum((char *) ent->location - (char *) 
ShmemSegHdr);
+               values[1] = Int64GetDatum((char *) ent->location - (char *) 
Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr);
                values[2] = Int64GetDatum(ent->size);
                values[3] = Int64GetDatum(ent->allocated_size);
                named_allocated += ent->allocated_size;
@@ -552,15 +609,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS)
        /* output shared memory allocated but not counted via the shmem index */
        values[0] = CStringGetTextDatum("<anonymous>");
        nulls[1] = true;
-       values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated);
+       values[2] = 
Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset - 
named_allocated);
        values[3] = values[2];
        tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
        /* output as-of-yet unused shared memory */
        nulls[0] = true;
-       values[1] = Int64GetDatum(ShmemSegHdr->freeoffset);
+       values[1] = 
Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
        nulls[1] = false;
-       values[2] = Int64GetDatum(ShmemSegHdr->totalsize - 
ShmemSegHdr->freeoffset);
+       values[2] = 
Int64GetDatum(Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->totalsize - 
Segments[MAIN_SHMEM_SEGMENT].ShmemSegHdr->freeoffset);
        values[3] = values[2];
        tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls);
 
diff --git a/src/backend/storage/lmgr/lwlock.c 
b/src/backend/storage/lmgr/lwlock.c
index f1e74f184f1..40aa4014b5f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -81,6 +81,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "postmaster/postmaster.h"
+#include "storage/pg_shmem.h"
 #include "storage/proc.h"
 #include "storage/proclist.h"
 #include "storage/spin.h"
@@ -607,9 +608,9 @@ LWLockNewTrancheId(void)
 
        LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int));
        /* We use the ShmemLock spinlock to protect LWLockCounter */
-       SpinLockAcquire(ShmemLock);
+       SpinLockAcquire(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
        result = (*LWLockCounter)++;
-       SpinLockRelease(ShmemLock);
+       SpinLockRelease(Segments[MAIN_SHMEM_SEGMENT].ShmemLock);
 
        return result;
 }
diff --git a/src/include/storage/buf_internals.h 
b/src/include/storage/buf_internals.h
index 1a65342177d..4595f5a9676 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -22,6 +22,7 @@
 #include "storage/condition_variable.h"
 #include "storage/lwlock.h"
 #include "storage/shmem.h"
+#include "storage/pg_shmem.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
 #include "utils/relcache.h"
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index e0f5f92e947..c0439f2206b 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void);
 /* ipci.c */
 extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook;
 
-extern Size CalculateShmemSize(int *num_semaphores);
+extern Size CalculateShmemSize(int *num_semaphores, int shmem_segment);
 extern void CreateSharedMemoryAndSemaphores(void);
 #ifdef EXEC_BACKEND
 extern void AttachSharedMemoryStructs(void);
diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h
index fa6ca35a51f..8ae9637fcd0 100644
--- a/src/include/storage/pg_sema.h
+++ b/src/include/storage/pg_sema.h
@@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore;
 extern Size PGSemaphoreShmemSize(int maxSemas);
 
 /* Module initialization (called during postmaster start or shmem reinit) */
-extern void PGReserveSemaphores(int maxSemas);
+extern void PGReserveSemaphores(int maxSemas, int shmem_segment);
 
 /* Allocate a PGSemaphore structure with initial count 1 */
 extern PGSemaphore PGSemaphoreCreate(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index b99ebc9e86f..138078c29c5 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -25,6 +25,7 @@
 #define PG_SHMEM_H
 
 #include "storage/dsm_impl.h"
+#include "storage/spin.h"
 
 typedef struct PGShmemHeader   /* standard header for all Postgres shmem */
 {
@@ -41,6 +42,20 @@ typedef struct PGShmemHeader /* standard header for all 
Postgres shmem */
 #endif
 } PGShmemHeader;
 
+typedef struct ShmemSegment
+{
+       PGShmemHeader *ShmemSegHdr;     /* shared mem segment header */
+       void *ShmemBase;                                /* start address of 
shared memory */
+       void *ShmemEnd;                                 /* end+1 address of 
shared memory */
+       slock_t    *ShmemLock;                  /* spinlock for shared memory 
and LWLock
+                                                                        * 
allocation */
+} ShmemSegment;
+
+/* Number of available segments for anonymous memory mappings */
+#define ANON_MAPPINGS 1
+
+extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, 
unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/* The main segment, contains everything except buffer blocks and related 
data. */
+#define MAIN_SHMEM_SEGMENT 0
+
 #endif                                                 /* PG_SHMEM_H */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 904a336b851..5929f140236 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -29,15 +29,27 @@
 extern PGDLLIMPORT slock_t *ShmemLock;
 struct PGShmemHeader;                  /* avoid including storage/pg_shmem.h 
here */
 extern void InitShmemAccess(struct PGShmemHeader *seghdr);
+extern void InitShmemAccessInSegment(struct PGShmemHeader *seghdr,
+                                                                        int 
shmem_segment);
 extern void InitShmemAllocation(void);
+extern void InitShmemAllocationInSegment(int shmem_segment);
 extern void *ShmemAlloc(Size size);
+extern void *ShmemAllocInSegment(Size size, int shmem_segment);
 extern void *ShmemAllocNoError(Size size);
 extern void *ShmemAllocUnlocked(Size size);
+extern void *ShmemAllocUnlockedInSegment(Size size, int shmem_segment);
 extern bool ShmemAddrIsValid(const void *addr);
+extern bool ShmemAddrIsValidInSegment(const void *addr, int shmem_segment);
 extern void InitShmemIndex(void);
+extern void InitVariableShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
                                                   HASHCTL *infoP, int 
hash_flags);
+extern HTAB *ShmemInitHashInSegment(const char *name, long init_size,
+                                                                       long 
max_size, HASHCTL *infoP,
+                                                                       int 
hash_flags, int shmem_segment);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemInitStructInSegment(const char *name, Size size,
+                                                                         bool 
*foundPtr, int shmem_segment);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 

base-commit: 80d7f990496b1c7be61d9a00a2635b7d96b96197
-- 
2.45.1

>From 7543fcdfc8ca1a0e1c85f397eb6dddfe1426b379 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 16 Oct 2024 20:21:33 +0200
Subject: [PATCH v2 2/6] Allow placing shared memory mapping with an offset

Currently the kernel is responsible to chose an address, where to place each
shared memory mapping, which is the lowest possible address that do not clash
with any other mappings. This is considered to be the most portable approach,
but one of the downsides is that there is no place to resize allocated mappings
anymore. Here is how it looks like for one mapping in /proc/$PID/maps,
/dev/zero represents the anonymous shared memory we talk about:

    00400000-00490000         /path/bin/postgres
    ...
    012d9000-0133e000         [heap]
    7f443a800000-7f470a800000 /dev/zero (deleted)
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
    ...
    7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
    7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)

By specifying the mapping address directly it's possible to place the
mapping in a way that leaves room for resizing. The idea is first to get
the address chosen by the kernel, then apply some offset derived from
the expected upper limit. Because we base the layout on the address
chosen by the kernel, things like address space randomization should not
be a problem, since the randomization is applied to the mmap base, which
is one per process. The result looks like this:

    012d9000-0133e000         [heap]
    7f443a800000-7f444196c000 /dev/zero (deleted)
    [...free space...]
    7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
    7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2

This approach do not impact the actual memory usage as reported by the kernel.
Here is the output of /proc/$PID/status for the master version with
shared_buffers = 128 MB:

    // Peak virtual memory size, which is described as total pages mapped in 
mm_struct
    VmPeak:           422780 kB
    // Size of memory portions. It contains RssAnon + RssFile + RssShmem
    VmRSS:             21248 kB
    // Size of resident anonymous memory
    RssAnon:             640 kB
    // Size of resident file mappings
    RssFile:            9728 kB
    // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and
    // shared anonymous mappings)
    RssShmem:          10880 kB

Here is the same for the patch with the shared mapping placed at
an offset 10 GB:

    VmPeak:          1102844 kB
    VmRSS:             21376 kB
    RssAnon:             640 kB
    RssFile:            9856 kB
    RssShmem:          10880 kB

Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup
was created with the memory limit 256 MB, then PostgreSQL was launched withing
this cgroup with shared_buffers = 128 MB:

    $ cd /sys/fs/cgroup
    $ mkdir postgres
    $ cd postres
    $ echo 268435456 > memory.max

    $ echo $MASTER_PID_SHELL > cgroup.procs
    # postgres from the master branch has being successfully launched
    #  from that shell
    $ cat memory.current
    17465344 (~16 MB)
    # stop postgres

    $ echo $PATCH_PID_SHELL > cgroup.procs
    # postgres from the patch has being successfully launched from that shell
    $ cat memory.current
    18219008 (~17 MB)

Note that currently the implementation makes assumptions about the upper limit.
Ideally it should be based on the maximum available memory.
---
 src/backend/port/sysv_shmem.c | 120 +++++++++++++++++++++++++++++++++-
 1 file changed, 119 insertions(+), 1 deletion(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 843b1b3220f..62f01d8218a 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS];
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
+/*
+ * Anonymous mapping placing (/dev/zero (deleted) below) looks like this:
+ *
+ * 00400000-00490000         /path/bin/postgres
+ * ...
+ * 012d9000-0133e000         [heap]
+ * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842
+ * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted)
+ * ...
+ *
+ * We would like to place multiple mappings in such a way, that there will be
+ * enough space between them in the address space to be able to resize up to
+ * certain size, but without counting towards the total memory consumption.
+ *
+ * By letting Linux to chose a mapping address, it will pick up the lowest
+ * possible address that do not clash with any other mappings, which will be
+ * right before locales in the example above. This information (maximum allowed
+ * size of mappings and the lowest mapping address) is enough to place every
+ * mapping as follow:
+ *
+ * - Take the lowest mapping address, which we call later the probe address.
+ * - Substract the offset of the previous mapping.
+ * - Substract the maximum allowed size for the current mapping from the
+ *   address.
+ * - Place the mapping by the resulting address.
+ *
+ * The result would look like this:
+ *
+ * 012d9000-0133e000         [heap]
+ * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * [...free space...]
+ * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
+ * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
+ * ...
+ */
+Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
+       0,                                                                      
/* MAIN_SHMEM_SLOT */
+};
+
+/* Remembers offset of the last mapping from the probe address */
+static Size last_offset = 0;
+
+/*
+ * Size of the mapping, which will be used to calculate anonymous mapping
+ * address. It should not be too small, otherwise there is a chance the probe
+ * mapping will be created between other mappings, leaving no room extending
+ * it. But it should not be too large either, in case if there are limitations
+ * on the mapping size. Current value is the default shared_buffers.
+ */
+#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024
+
 static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size);
 static void IpcMemoryDetach(int status, Datum shmaddr);
 static void IpcMemoryDelete(int status, Datum shmId);
@@ -673,13 +730,74 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
        if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON)
        {
+               void *probe = NULL;
+
                /*
                 * Use the original size, not the rounded-up value, when 
falling back
                 * to non-huge pages.
                 */
                allocsize = mapping->shmem_size;
-               ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+
+               /*
+                * Try to create mapping at an address, which will allow to 
extend it
+                * later:
+                *
+                * - First create the temporary probe mapping of a fixed size 
and let
+                *   kernel to place it at address of its choice. By the virtue 
of the
+                *   probe mapping size we expect it to be located at the lowest
+                *   possible address, expecting some non mapped space above.
+                *
+                * - Unmap the probe mapping, remember the address.
+                *
+                * - Create an actual anonymous mapping at that address with the
+                *   offset. The offset is calculated in such a way to allow 
growing
+                *   the mapping withing certain boundaries. For this mapping 
we use
+                *   MAP_FIXED_NOREPLACE, which will error out with EEXIST if 
there is
+                *   any mapping clash.
+                *
+                * - If the last step has failed, fallback to the regular 
mapping
+                *   creation and signal that shared buffers could not be 
resized
+                *   without a restart.
+                */
+               probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
                                   PG_MMAP_FLAGS, -1, 0);
+
+               if (probe == MAP_FAILED)
+               {
+                       mmap_errno = errno;
+                       DebugMappings();
+                       elog(DEBUG1, "segment[%s]: probe mmap(%zu) failed: %m",
+                                       MappingName(mapping->shmem_segment), 
allocsize);
+               }
+               else
+               {
+                       Size offset = last_offset + 
SHMEM_EXTRA_SIZE_LIMIT[next_free_segment] + allocsize;
+                       last_offset = offset;
+
+                       munmap(probe, PROBE_MAPPING_SIZE);
+
+                       ptr = mmap(probe - offset, allocsize, PROT_READ | 
PROT_WRITE,
+                                          PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, 
-1, 0);
+                       mmap_errno = errno;
+                       if (ptr == MAP_FAILED)
+                       {
+                               DebugMappings();
+                               elog(DEBUG1, "segment[%s]: mmap(%zu) at address 
%p failed: %m",
+                                        MappingName(mapping->shmem_segment), 
allocsize, probe - offset);
+                       }
+
+               }
+       }
+
+       if (ptr == MAP_FAILED)
+       {
+               /*
+                * Fallback to the portable way of creating a mapping.
+                */
+               allocsize = mapping->shmem_size;
+
+               ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
+                                                  PG_MMAP_FLAGS, -1, 0);
                mmap_errno = errno;
        }
 
-- 
2.45.1

>From d7af86299878acb73019ac699ef57c120199f1ee Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Mon, 24 Feb 2025 20:08:28 +0100
Subject: [PATCH v2 3/6] Introduce multiple shmem segments for shared buffers

Add more shmem segments to split shared buffers into following chunks:
* BUFFERS_SHMEM_SEGMENT: contains buffer blocks
* BUFFER_DESCRIPTORS_SHMEM_SEGMENT: contains buffer descriptors
* BUFFER_IOCV_SHMEM_SEGMENT: contains condition variables for buffers
* CHECKPOINT_BUFFERS_SHMEM_SEGMENT: contains checkpoint buffer ids
* STRATEGY_SHMEM_SEGMENT: contains buffer strategy status

Size of the corresponding shared data directly depends on NBuffers,
meaning that if we would like to change NBuffers, they have to be
resized correspondingly. Placing each of them in a separate shmem
segment allows to achieve that.

There are some asumptions made about each of shmem segments upper size
limit. The buffer blocks have the largest, while the rest claim less
extra room for resize. Ideally those limits have to be deduced from the
maximum allowed shared memory.
---
 src/backend/port/sysv_shmem.c          | 19 ++++++-
 src/backend/storage/buffer/buf_init.c  | 79 +++++++++++++++++---------
 src/backend/storage/buffer/buf_table.c |  5 +-
 src/backend/storage/buffer/freelist.c  |  4 +-
 src/backend/storage/ipc/ipci.c         |  2 +-
 src/include/storage/bufmgr.h           |  2 +-
 src/include/storage/pg_shmem.h         | 24 +++++++-
 7 files changed, 99 insertions(+), 36 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 62f01d8218a..59aa67cb135 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -149,8 +149,13 @@ static int next_free_segment = 0;
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
  */
-Size SHMEM_EXTRA_SIZE_LIMIT[1] = {
-       0,                                                                      
/* MAIN_SHMEM_SLOT */
+Size SHMEM_EXTRA_SIZE_LIMIT[6] = {
+       0,                                                                      
/* MAIN_SHMEM_SEGMENT */
+       (Size) 1024 * 1024 * 1024 * 10,         /* BUFFERS_SHMEM_SEGMENT */
+       (Size) 1024 * 1024 * 1024 * 1,          /* 
BUFFER_DESCRIPTORS_SHMEM_SEGMENT */
+       (Size) 1024 * 1024 * 100,                       /* 
BUFFER_IOCV_SHMEM_SEGMENT */
+       (Size) 1024 * 1024 * 100,                       /* 
CHECKPOINT_BUFFERS_SHMEM_SEGMENT */
+       (Size) 1024 * 1024 * 100,                       /* 
STRATEGY_SHMEM_SEGMENT */
 };
 
 /* Remembers offset of the last mapping from the probe address */
@@ -179,6 +184,16 @@ MappingName(int shmem_segment)
        {
                case MAIN_SHMEM_SEGMENT:
                        return "main";
+               case BUFFERS_SHMEM_SEGMENT:
+                       return "buffers";
+               case BUFFER_DESCRIPTORS_SHMEM_SEGMENT:
+                       return "descriptors";
+               case BUFFER_IOCV_SHMEM_SEGMENT:
+                       return "iocv";
+               case CHECKPOINT_BUFFERS_SHMEM_SEGMENT:
+                       return "checkpoint";
+               case STRATEGY_SHMEM_SEGMENT:
+                       return "strategy";
                default:
                        return "unknown";
        }
diff --git a/src/backend/storage/buffer/buf_init.c 
b/src/backend/storage/buffer/buf_init.c
index ed1f8e03190..f5b9290a640 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -61,7 +61,10 @@ CkptSortItem *CkptBufferIds;
  * Initialize shared buffer pool
  *
  * This is called once during shared-memory initialization (either in the
- * postmaster, or in a standalone backend).
+ * postmaster, or in a standalone backend). Size of data structures initialized
+ * here depends on NBuffers, and to be able to change NBuffers without a
+ * restart we store each structure into a separate shared memory segment, which
+ * could be resized on demand.
  */
 void
 BufferManagerShmemInit(void)
@@ -73,22 +76,22 @@ BufferManagerShmemInit(void)
 
        /* Align descriptors to a cacheline boundary. */
        BufferDescriptors = (BufferDescPadded *)
-               ShmemInitStruct("Buffer Descriptors",
+               ShmemInitStructInSegment("Buffer Descriptors",
                                                NBuffers * 
sizeof(BufferDescPadded),
-                                               &foundDescs);
+                                               &foundDescs, 
BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
 
        /* Align buffer pool on IO page size boundary. */
        BufferBlocks = (char *)
                TYPEALIGN(PG_IO_ALIGN_SIZE,
-                                 ShmemInitStruct("Buffer Blocks",
+                                 ShmemInitStructInSegment("Buffer Blocks",
                                                                  NBuffers * 
(Size) BLCKSZ + PG_IO_ALIGN_SIZE,
-                                                                 &foundBufs));
+                                                                 &foundBufs, 
BUFFERS_SHMEM_SEGMENT));
 
        /* Align condition variables to cacheline boundary. */
        BufferIOCVArray = (ConditionVariableMinimallyPadded *)
-               ShmemInitStruct("Buffer IO Condition Variables",
+               ShmemInitStructInSegment("Buffer IO Condition Variables",
                                                NBuffers * 
sizeof(ConditionVariableMinimallyPadded),
-                                               &foundIOCV);
+                                               &foundIOCV, 
BUFFER_IOCV_SHMEM_SEGMENT);
 
        /*
         * The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,8 +101,9 @@ BufferManagerShmemInit(void)
         * painful.
         */
        CkptBufferIds = (CkptSortItem *)
-               ShmemInitStruct("Checkpoint BufferIds",
-                                               NBuffers * 
sizeof(CkptSortItem), &foundBufCkpt);
+               ShmemInitStructInSegment("Checkpoint BufferIds",
+                                               NBuffers * 
sizeof(CkptSortItem), &foundBufCkpt,
+                                               
CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
 
        if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
        {
@@ -153,33 +157,54 @@ BufferManagerShmemInit(void)
  * BufferManagerShmemSize
  *
  * compute the size of shared memory for the buffer pool including
- * data pages, buffer descriptors, hash tables, etc.
+ * data pages, buffer descriptors, hash tables, etc. based on the
+ * shared memory segment. The main segment must not allocate anything
+ * related to buffers, every other segment will receive part of the
+ * data.
  */
 Size
-BufferManagerShmemSize(void)
+BufferManagerShmemSize(int shmem_segment)
 {
        Size            size = 0;
 
-       /* size of buffer descriptors */
-       size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded)));
-       /* to allow aligning buffer descriptors */
-       size = add_size(size, PG_CACHE_LINE_SIZE);
+       if (shmem_segment == MAIN_SHMEM_SEGMENT)
+               return size;
 
-       /* size of data pages, plus alignment padding */
-       size = add_size(size, PG_IO_ALIGN_SIZE);
-       size = add_size(size, mul_size(NBuffers, BLCKSZ));
+       if (shmem_segment == BUFFER_DESCRIPTORS_SHMEM_SEGMENT)
+       {
+               /* size of buffer descriptors */
+               size = add_size(size, mul_size(NBuffers, 
sizeof(BufferDescPadded)));
+               /* to allow aligning buffer descriptors */
+               size = add_size(size, PG_CACHE_LINE_SIZE);
+       }
 
-       /* size of stuff controlled by freelist.c */
-       size = add_size(size, StrategyShmemSize());
+       if (shmem_segment == BUFFERS_SHMEM_SEGMENT)
+       {
+               /* size of data pages, plus alignment padding */
+               size = add_size(size, PG_IO_ALIGN_SIZE);
+               size = add_size(size, mul_size(NBuffers, BLCKSZ));
+       }
 
-       /* size of I/O condition variables */
-       size = add_size(size, mul_size(NBuffers,
-                                                                  
sizeof(ConditionVariableMinimallyPadded)));
-       /* to allow aligning the above */
-       size = add_size(size, PG_CACHE_LINE_SIZE);
+       if (shmem_segment == STRATEGY_SHMEM_SEGMENT)
+       {
+               /* size of stuff controlled by freelist.c */
+               size = add_size(size, StrategyShmemSize());
+       }
 
-       /* size of checkpoint sort array in bufmgr.c */
-       size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+       if (shmem_segment == BUFFER_IOCV_SHMEM_SEGMENT)
+       {
+               /* size of I/O condition variables */
+               size = add_size(size, mul_size(NBuffers,
+                                                                          
sizeof(ConditionVariableMinimallyPadded)));
+               /* to allow aligning the above */
+               size = add_size(size, PG_CACHE_LINE_SIZE);
+       }
+
+       if (shmem_segment == CHECKPOINT_BUFFERS_SHMEM_SEGMENT)
+       {
+               /* size of checkpoint sort array in bufmgr.c */
+               size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
+       }
 
        return size;
 }
diff --git a/src/backend/storage/buffer/buf_table.c 
b/src/backend/storage/buffer/buf_table.c
index a50955d5286..ac449954dab 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -59,10 +59,11 @@ InitBufTable(int size)
        info.entrysize = sizeof(BufferLookupEnt);
        info.num_partitions = NUM_BUFFER_PARTITIONS;
 
-       SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
+       SharedBufHash = ShmemInitHashInSegment("Shared Buffer Lookup Table",
                                                                  size, size,
                                                                  &info,
-                                                                 HASH_ELEM | 
HASH_BLOBS | HASH_PARTITION);
+                                                                 HASH_ELEM | 
HASH_BLOBS | HASH_PARTITION,
+                                                                 
STRATEGY_SHMEM_SEGMENT);
 }
 
 /*
diff --git a/src/backend/storage/buffer/freelist.c 
b/src/backend/storage/buffer/freelist.c
index 336715b6c63..4919a92f2be 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -491,9 +491,9 @@ StrategyInitialize(bool init)
         * Get or create the shared strategy control block
         */
        StrategyControl = (BufferStrategyControl *)
-               ShmemInitStruct("Buffer Strategy Status",
+               ShmemInitStructInSegment("Buffer Strategy Status",
                                                sizeof(BufferStrategyControl),
-                                               &found);
+                                               &found, STRATEGY_SHMEM_SEGMENT);
 
        if (!found)
        {
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 4f6c707c204..68778522591 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -112,7 +112,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
                                                                                
         sizeof(ShmemIndexEnt)));
        size = add_size(size, dsm_estimate_size());
        size = add_size(size, DSMRegistryShmemSize());
-       size = add_size(size, BufferManagerShmemSize());
+       size = add_size(size, BufferManagerShmemSize(shmem_segment));
        size = add_size(size, LockManagerShmemSize());
        size = add_size(size, PredicateLockShmemSize());
        size = add_size(size, ProcGlobalShmemSize());
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 7c1e4316dde..bb7fe02e243 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -297,7 +297,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
-extern Size BufferManagerShmemSize(void);
+extern Size BufferManagerShmemSize(int);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index 138078c29c5..ba0192baf95 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -52,7 +52,7 @@ typedef struct ShmemSegment
 } ShmemSegment;
 
 /* Number of available segments for anonymous memory mappings */
-#define ANON_MAPPINGS 1
+#define ANON_MAPPINGS 6
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
@@ -105,7 +105,29 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, 
unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+/*
+ * To be able to dynamically resize largest parts of the data stored in shared
+ * memory, we split it into multiple shared memory mappings segments. Each
+ * segment contains only certain part of the data, which size depends on
+ * NBuffers.
+ */
+
 /* The main segment, contains everything except buffer blocks and related 
data. */
 #define MAIN_SHMEM_SEGMENT 0
 
+/* Buffer blocks */
+#define BUFFERS_SHMEM_SEGMENT 1
+
+/* Buffer descriptors */
+#define BUFFER_DESCRIPTORS_SHMEM_SEGMENT 2
+
+/* Condition variables for buffers */
+#define BUFFER_IOCV_SHMEM_SEGMENT 3
+
+/* Checkpoint BufferIds */
+#define CHECKPOINT_BUFFERS_SHMEM_SEGMENT 4
+
+/* Buffer strategy status */
+#define STRATEGY_SHMEM_SEGMENT 5
+
 #endif                                                 /* PG_SHMEM_H */
-- 
2.45.1

>From 0173967e8b0fd6c23b158c34b92651fc37ab7660 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Wed, 19 Feb 2025 17:45:40 +0100
Subject: [PATCH v2 4/6] Introduce pending flag for GUC assign hooks

Currently an assing hook can perform some preprocessing of a new value,
but it cannot change the behavior, which dictates that the new value
will be applied immediately after the hook. Certain GUC options (like
shared_buffers, coming in subsequent patches) may need coordinating work
between backends to change, meaning we cannot apply it right away.

Add a new flag "pending" for an assign hook to allow the hook indicate
exactly that. If the pending flag is set after the hook, the new value
will not be applied and it's handling becomes the hook's implementation
responsibility.

Note, that this also requires changes in the way how GUCs are getting
reported, but the patch does not cover that yet.
---
 src/backend/access/transam/xlog.c    |  2 +-
 src/backend/commands/variable.c      |  2 +-
 src/backend/libpq/pqcomm.c           |  8 ++--
 src/backend/tcop/postgres.c          |  2 +-
 src/backend/utils/misc/guc.c         | 59 +++++++++++++++++++---------
 src/backend/utils/misc/stack_depth.c |  2 +-
 src/include/utils/guc.h              |  2 +-
 src/include/utils/guc_hooks.h        | 16 ++++----
 8 files changed, 57 insertions(+), 36 deletions(-)

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index f9bf5ba7509..ff82ba0a53d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2188,7 +2188,7 @@ CalculateCheckpointSegments(void)
 }
 
 void
-assign_max_wal_size(int newval, void *extra)
+assign_max_wal_size(int newval, void *extra, bool *pending)
 {
        max_wal_size_mb = newval;
        CalculateCheckpointSegments();
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 4ad6e236d69..f24c2a0d252 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -1143,7 +1143,7 @@ check_cluster_name(char **newval, void **extra, GucSource 
source)
  * GUC assign_hook for maintenance_io_concurrency
  */
 void
-assign_maintenance_io_concurrency(int newval, void *extra)
+assign_maintenance_io_concurrency(int newval, void *extra, bool *pending)
 {
 #ifdef USE_PREFETCH
        /*
diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 61ea3722ae2..cdf21847d7e 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -1949,7 +1949,7 @@ pq_settcpusertimeout(int timeout, Port *port)
  * GUC assign_hook for tcp_keepalives_idle
  */
 void
-assign_tcp_keepalives_idle(int newval, void *extra)
+assign_tcp_keepalives_idle(int newval, void *extra, bool *pending)
 {
        /*
         * The kernel API provides no way to test a value without setting it; 
and
@@ -1982,7 +1982,7 @@ show_tcp_keepalives_idle(void)
  * GUC assign_hook for tcp_keepalives_interval
  */
 void
-assign_tcp_keepalives_interval(int newval, void *extra)
+assign_tcp_keepalives_interval(int newval, void *extra, bool *pending)
 {
        /* See comments in assign_tcp_keepalives_idle */
        (void) pq_setkeepalivesinterval(newval, MyProcPort);
@@ -2005,7 +2005,7 @@ show_tcp_keepalives_interval(void)
  * GUC assign_hook for tcp_keepalives_count
  */
 void
-assign_tcp_keepalives_count(int newval, void *extra)
+assign_tcp_keepalives_count(int newval, void *extra, bool *pending)
 {
        /* See comments in assign_tcp_keepalives_idle */
        (void) pq_setkeepalivescount(newval, MyProcPort);
@@ -2028,7 +2028,7 @@ show_tcp_keepalives_count(void)
  * GUC assign_hook for tcp_user_timeout
  */
 void
-assign_tcp_user_timeout(int newval, void *extra)
+assign_tcp_user_timeout(int newval, void *extra, bool *pending)
 {
        /* See comments in assign_tcp_keepalives_idle */
        (void) pq_settcpusertimeout(newval, MyProcPort);
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 1149d89d7a1..13fb8c31702 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -3555,7 +3555,7 @@ check_log_stats(bool *newval, void **extra, GucSource 
source)
 
 /* GUC assign hook for transaction_timeout */
 void
-assign_transaction_timeout(int newval, void *extra)
+assign_transaction_timeout(int newval, void *extra, bool *pending)
 {
        if (IsTransactionState())
        {
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 12192445218..bab1c5d08f6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1679,6 +1679,7 @@ InitializeOneGUCOption(struct config_generic *gconf)
                                struct config_int *conf = (struct config_int *) 
gconf;
                                int                     newval = conf->boot_val;
                                void       *extra = NULL;
+                               bool       pending = false;
 
                                Assert(newval >= conf->min);
                                Assert(newval <= conf->max);
@@ -1687,9 +1688,13 @@ InitializeOneGUCOption(struct config_generic *gconf)
                                        elog(FATAL, "failed to initialize %s to 
%d",
                                                 conf->gen.name, newval);
                                if (conf->assign_hook)
-                                       conf->assign_hook(newval, extra);
-                               *conf->variable = conf->reset_val = newval;
-                               conf->gen.extra = conf->reset_extra = extra;
+                                       conf->assign_hook(newval, extra, 
&pending);
+
+                               if (!pending)
+                               {
+                                       *conf->variable = conf->reset_val = 
newval;
+                                       conf->gen.extra = conf->reset_extra = 
extra;
+                               }
                                break;
                        }
                case PGC_REAL:
@@ -2041,13 +2046,18 @@ ResetAllOptions(void)
                        case PGC_INT:
                                {
                                        struct config_int *conf = (struct 
config_int *) gconf;
+                                       bool                      pending = 
false;
 
                                        if (conf->assign_hook)
                                                
conf->assign_hook(conf->reset_val,
-                                                                               
  conf->reset_extra);
-                                       *conf->variable = conf->reset_val;
-                                       set_extra_field(&conf->gen, 
&conf->gen.extra,
-                                                                       
conf->reset_extra);
+                                                                               
  conf->reset_extra,
+                                                                               
  &pending);
+                                       if (!pending)
+                                       {
+                                               *conf->variable = 
conf->reset_val;
+                                               set_extra_field(&conf->gen, 
&conf->gen.extra,
+                                                                               
conf->reset_extra);
+                                       }
                                        break;
                                }
                        case PGC_REAL:
@@ -2424,16 +2434,21 @@ AtEOXact_GUC(bool isCommit, int nestLevel)
                                                        struct config_int *conf 
= (struct config_int *) gconf;
                                                        int                     
newval = newvalue.val.intval;
                                                        void       *newextra = 
newvalue.extra;
+                                                       bool        pending = 
false;
 
                                                        if (*conf->variable != 
newval ||
                                                                conf->gen.extra 
!= newextra)
                                                        {
                                                                if 
(conf->assign_hook)
-                                                                       
conf->assign_hook(newval, newextra);
-                                                               *conf->variable 
= newval;
-                                                               
set_extra_field(&conf->gen, &conf->gen.extra,
-                                                                               
                newextra);
-                                                               changed = true;
+                                                                       
conf->assign_hook(newval, newextra, &pending);
+
+                                                               if (!pending)
+                                                               {
+                                                                       
*conf->variable = newval;
+                                                                       
set_extra_field(&conf->gen, &conf->gen.extra,
+                                                                               
                        newextra);
+                                                                       changed 
= true;
+                                                               }
                                                        }
                                                        break;
                                                }
@@ -3850,18 +3865,24 @@ set_config_with_handle(const char *name, config_handle 
*handle,
 
                                if (changeVal)
                                {
+                                       bool pending = false;
+
                                        /* Save old value to support 
transaction abort */
                                        if (!makeDefault)
                                                push_old_value(&conf->gen, 
action);
 
                                        if (conf->assign_hook)
-                                               conf->assign_hook(newval, 
newextra);
-                                       *conf->variable = newval;
-                                       set_extra_field(&conf->gen, 
&conf->gen.extra,
-                                                                       
newextra);
-                                       set_guc_source(&conf->gen, source);
-                                       conf->gen.scontext = context;
-                                       conf->gen.srole = srole;
+                                               conf->assign_hook(newval, 
newextra, &pending);
+
+                                       if (!pending)
+                                       {
+                                               *conf->variable = newval;
+                                               set_extra_field(&conf->gen, 
&conf->gen.extra,
+                                                                               
newextra);
+                                               set_guc_source(&conf->gen, 
source);
+                                               conf->gen.scontext = context;
+                                               conf->gen.srole = srole;
+                                       }
                                }
                                if (makeDefault)
                                {
diff --git a/src/backend/utils/misc/stack_depth.c 
b/src/backend/utils/misc/stack_depth.c
index 8f7cf531fbc..ef59ae62008 100644
--- a/src/backend/utils/misc/stack_depth.c
+++ b/src/backend/utils/misc/stack_depth.c
@@ -156,7 +156,7 @@ check_max_stack_depth(int *newval, void **extra, GucSource 
source)
 
 /* GUC assign hook for max_stack_depth */
 void
-assign_max_stack_depth(int newval, void *extra)
+assign_max_stack_depth(int newval, void *extra, bool *pending)
 {
        ssize_t         newval_bytes = newval * (ssize_t) 1024;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 1233e07d7da..ce9f258100d 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -187,7 +187,7 @@ typedef bool (*GucStringCheckHook) (char **newval, void 
**extra, GucSource sourc
 typedef bool (*GucEnumCheckHook) (int *newval, void **extra, GucSource source);
 
 typedef void (*GucBoolAssignHook) (bool newval, void *extra);
-typedef void (*GucIntAssignHook) (int newval, void *extra);
+typedef void (*GucIntAssignHook) (int newval, void *extra, bool *pending);
 typedef void (*GucRealAssignHook) (double newval, void *extra);
 typedef void (*GucStringAssignHook) (const char *newval, void *extra);
 typedef void (*GucEnumAssignHook) (int newval, void *extra);
diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h
index 951451a9765..3e380f29e5a 100644
--- a/src/include/utils/guc_hooks.h
+++ b/src/include/utils/guc_hooks.h
@@ -81,12 +81,12 @@ extern void assign_log_timezone(const char *newval, void 
*extra);
 extern const char *show_log_timezone(void);
 extern bool check_maintenance_io_concurrency(int *newval, void **extra,
                                                                                
         GucSource source);
-extern void assign_maintenance_io_concurrency(int newval, void *extra);
+extern void assign_maintenance_io_concurrency(int newval, void *extra, bool 
*pending);
 extern bool check_max_slot_wal_keep_size(int *newval, void **extra,
                                                                                
 GucSource source);
-extern void assign_max_wal_size(int newval, void *extra);
+extern void assign_max_wal_size(int newval, void *extra, bool *pending);
 extern bool check_max_stack_depth(int *newval, void **extra, GucSource source);
-extern void assign_max_stack_depth(int newval, void *extra);
+extern void assign_max_stack_depth(int newval, void *extra, bool *pending);
 extern bool check_multixact_member_buffers(int *newval, void **extra,
                                                                                
   GucSource source);
 extern bool check_multixact_offset_buffers(int *newval, void **extra,
@@ -141,13 +141,13 @@ extern void assign_synchronous_standby_names(const char 
*newval, void *extra);
 extern void assign_synchronous_commit(int newval, void *extra);
 extern void assign_syslog_facility(int newval, void *extra);
 extern void assign_syslog_ident(const char *newval, void *extra);
-extern void assign_tcp_keepalives_count(int newval, void *extra);
+extern void assign_tcp_keepalives_count(int newval, void *extra, bool 
*pending);
 extern const char *show_tcp_keepalives_count(void);
-extern void assign_tcp_keepalives_idle(int newval, void *extra);
+extern void assign_tcp_keepalives_idle(int newval, void *extra, bool *pending);
 extern const char *show_tcp_keepalives_idle(void);
-extern void assign_tcp_keepalives_interval(int newval, void *extra);
+extern void assign_tcp_keepalives_interval(int newval, void *extra, bool 
*pending);
 extern const char *show_tcp_keepalives_interval(void);
-extern void assign_tcp_user_timeout(int newval, void *extra);
+extern void assign_tcp_user_timeout(int newval, void *extra, bool *pending);
 extern const char *show_tcp_user_timeout(void);
 extern bool check_temp_buffers(int *newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra,
@@ -163,7 +163,7 @@ extern bool check_transaction_buffers(int *newval, void 
**extra, GucSource sourc
 extern bool check_transaction_deferrable(bool *newval, void **extra, GucSource 
source);
 extern bool check_transaction_isolation(int *newval, void **extra, GucSource 
source);
 extern bool check_transaction_read_only(bool *newval, void **extra, GucSource 
source);
-extern void assign_transaction_timeout(int newval, void *extra);
+extern void assign_transaction_timeout(int newval, void *extra, bool *pending);
 extern const char *show_unix_socket_permissions(void);
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
 extern bool check_wal_consistency_checking(char **newval, void **extra,
-- 
2.45.1

>From 78ea0efde8799445b90a70ca321e40b75fea52c9 Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Thu, 20 Feb 2025 21:12:26 +0100
Subject: [PATCH v2 5/6] Allow to resize shared memory without restart

Add assing hook for shared_buffers to resize shared memory using space,
introduced in the previous commits without requiring PostgreSQL restart.
Essentially the implementation is based on two mechanisms: a global
Barrier to coordinate backends that simultaneously change
shared_buffers, and pieces in shared memory to coordinate backends that
are too late to the party for some reason.

The resize process looks like this:

* The GUC assign hook sets a flag to let the Postmaster know that resize
  was requested.

* Postmaster verifies the flag in the event loop, and starts the resize
  by emitting a ProcSignal barrier. Afterwards it does shared memory
  resize itself.

* All the backends, that participate in ProcSignal mechanism,
  recalculate shared memory size based on the new NBuffers and extend it
  using mremap.

* When finished, a backend waits on a global ShmemControl barrier,
  untill all backends will be finished as well. This way we ensure three
  stages with clear boundaries: before the resize, when all processes
  use old NBuffers; during the resize, when processes have mix of old
  and new NBuffers, and wait until it's done; after the resize, when all
  processes use new NBuffers.

* After all backends are using new value, one backend will initialize
  new shared structures (buffer blocks, descriptors, etc) as needed and
  broadcast new value of NBuffers via ShmemControl in shared memory.
  Other backends are waiting for this operation to finish as well. Then
  the barrier is lifted and everything goes as usual.

Here is how it looks like after raising shared_buffers from 128 MB to
512 MB and calling pg_reload_conf():

    -- 128 MB
    7f5a2bd04000-7f5a32e52000  /dev/zero (deleted)
    7f5a39252000-7f5a4030e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d7ba000  /dev/zero (deleted)
    7f5a53bba000-7f5a5ad26000  /dev/zero (deleted)
    7f5a9ad26000-7f5aa9d94000  /dev/zero (deleted)
    ^ buffers mapping, ~240 MB
    7f5d29d94000-7f5d30e00000  /dev/zero (deleted)

    -- 512 MB
    7f5a2bd04000-7f5a33274000  /dev/zero (deleted)
    7f5a39252000-7f5a4057e000  /dev/zero (deleted)
    7f5a4670e000-7f5a4d9fa000  /dev/zero (deleted)
    7f5a53bba000-7f5a5b1a6000  /dev/zero (deleted)
    7f5a9ad26000-7f5ac1f14000  /dev/zero (deleted)
    ^ buffers mapping, ~625 MB
    7f5d29d94000-7f5d30f80000  /dev/zero (deleted)

The implementation supports only increasing of shared_buffers. For
decreasing the value a similar procedure is needed. But the buffer
blocks with data have to be drained first, so that the actual data set
fits into the new smaller space.

>From experiment it turns out that shared mappings have to be extended
separately for each process that uses them. Another rough edge is that a
backend blocked on ReadCommand will not apply shared_buffers change
until it reads something.

Note, that mremap is Linux specific, thus the implementation not very
portable.

Authors: Dmitrii Dolgov, Ashutosh Bapat
---
 src/backend/port/sysv_shmem.c                 | 300 ++++++++++++++++++
 src/backend/postmaster/postmaster.c           |  15 +
 src/backend/storage/buffer/buf_init.c         | 152 ++++++++-
 src/backend/storage/ipc/ipci.c                |  11 +
 src/backend/storage/ipc/procsignal.c          |  45 +++
 src/backend/storage/ipc/shmem.c               |  14 +-
 src/backend/tcop/postgres.c                   |  15 +
 .../utils/activity/wait_event_names.txt       |   3 +
 src/backend/utils/misc/guc_tables.c           |   4 +-
 src/include/storage/bufmgr.h                  |   1 +
 src/include/storage/ipc.h                     |   2 +
 src/include/storage/lwlocklist.h              |   1 +
 src/include/storage/pg_shmem.h                |  24 ++
 src/include/storage/procsignal.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 15 files changed, 577 insertions(+), 12 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 59aa67cb135..35a8ff92175 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -30,13 +30,17 @@
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "portability/mem.h"
+#include "storage/bufmgr.h"
 #include "storage/dsm.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lwlock.h"
 #include "storage/pg_shmem.h"
+#include "storage/procsignal.h"
 #include "utils/guc.h"
 #include "utils/guc_hooks.h"
 #include "utils/pidfile.h"
+#include "utils/wait_event.h"
 
 
 /*
@@ -105,6 +109,13 @@ typedef struct AnonymousMapping
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
 
+/* Flag telling postmaster that resize is needed */
+volatile bool pending_pm_shmem_resize = false;
+
+/* Keeps track of the previous NBuffers value */
+static int NBuffersOld = -1;
+static int NBuffersPending = -1;
+
 /* Keeps track of used mapping segments */
 static int next_free_segment = 0;
 
@@ -859,6 +870,274 @@ AnonymousShmemDetach(int status, Datum arg)
        }
 }
 
+/*
+ * Resize all shared memory segments based on the current NBuffers value, which
+ * is is applied from NBuffersPending. The actual segment resizing is done via
+ * mremap, which will fail if is not sufficient space to expand the mapping.
+ * When finished, based on the new and old values initialize new buffer blocks
+ * if any.
+ *
+ * If reinitializing took place, as the last step this function broadcasts
+ * NSharedBuffers to it's new value, allowing any other backends to rely on
+ * this new value and skip buffers reinitialization.
+ */
+static bool
+AnonymousShmemResize(void)
+{
+       int     numSemas;
+       bool reinit = false;
+       NBuffers = NBuffersPending;
+
+       elog(DEBUG1, "Resize shmem from %d to %d", NBuffersOld, NBuffers);
+
+       /*
+        * XXX: Where to reset the flag is still an open question. E.g. do we
+        * consider a no-op when NBuffers is equal to NBuffersOld a genuine 
resize
+        * and reset the flag?
+        */
+       pending_pm_shmem_resize = false;
+
+       /*
+        * XXX: Currently only increasing of shared_buffers is supported. For
+        * decreasing something similar has to be done, but buffer blocks with
+        * data have to be drained first.
+        */
+       if(NBuffersOld > NBuffers)
+               return false;
+
+       for(int i = 0; i < next_free_segment; i++)
+       {
+               /* Note that CalculateShmemSize indirectly depends on NBuffers 
*/
+               Size new_size = CalculateShmemSize(&numSemas, i);
+               AnonymousMapping *m = &Mappings[i];
+
+               if (m->shmem == NULL)
+                       continue;
+
+               if (m->shmem_size == new_size)
+                       continue;
+
+
+               /*
+                * Fail hard if faced any issues. In theory we could try to 
handle this
+                * more gracefully and proceed with shared memory as before, 
but some
+                * other backends might have succeeded and have different size. 
If we
+                * would like to go this way, to be consistent we would need to
+                * synchronize again, and it's not clear if it's worth the 
effort.
+                */
+               if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0)
+                       ereport(FATAL,
+                                       (errcode(ERRCODE_SYSTEM_ERROR),
+                                        errmsg("could not resize shared memory 
%p to %d (%zu): %m",
+                                                       m->shmem, NBuffers, 
m->shmem_size)));
+               else
+               {
+                       reinit = true;
+                       m->shmem_size = new_size;
+               }
+       }
+
+       if (reinit)
+       {
+               if(IsUnderPostmaster &&
+                       LWLockConditionalAcquire(ShmemResizeLock, LW_EXCLUSIVE))
+               {
+                       /*
+                        * If the new NBuffers was already broadcasted, the 
buffer pool was
+                        * already initialized before.
+                        *
+                        * Since we're not on a hot path, we use lwlocks and do 
not need to
+                        * involve memory barrier.
+                        */
+                       if(pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers) != 
NBuffers)
+                       {
+                               /*
+                                * Allow the first backend that managed to get 
the lock to
+                                * reinitialize the new portion of buffer pool. 
Every other
+                                * process will wait on the shared barrier for 
that to finish,
+                                * since it's a part of the SHMEM_RESIZE_DONE 
phase.
+                                *
+                                * XXX: This is the right place for buffer 
eviction as well.
+                                */
+                               ResizeBufferPool(NBuffersOld, true);
+
+                               /* If all fine, broadcast the new value */
+                               pg_atomic_write_u32(&ShmemCtrl->NSharedBuffers, 
NBuffers);
+                       }
+                       else
+                               ResizeBufferPool(NBuffersOld, false);
+
+                       LWLockRelease(ShmemResizeLock);
+               }
+       }
+
+       return true;
+}
+
+/*
+ * We are asked to resize shared memory. Do the resize and make sure to wait on
+ * the provided barrier until all simultaneously participating backends finish
+ * resizing as well, otherwise we face danger of inconsistency between
+ * backends.
+ *
+ * XXX: If a backend is blocked on ReadCommand in PostgresMain, it will not
+ * proceed with AnonymousShmemResize after receiving SIGHUP, until something
+ * will be sent.
+ */
+bool
+ProcessBarrierShmemResize(Barrier *barrier)
+{
+       elog(DEBUG1, "Handle a barrier for shmem resizing from %d to %d, %d",
+                NBuffersOld, NBuffersPending, pending_pm_shmem_resize);
+
+       /* Wait until we have seen the new NBuffers value */
+       if (!pending_pm_shmem_resize)
+               return false;
+
+       /*
+        * After attaching to the barrier we could be in any of states:
+        *
+        * - Initial SHMEM_RESIZE_REQUESTED, nothing has been done yet
+        * - SHMEM_RESIZE_START, some of the backends have started to resize
+        * - SHMEM_RESIZE_DONE, participating backends have finished resizing
+        * - SHMEM_RESIZE_REQUESTED after the reset, the shared memory was 
already
+        *   resized
+        *
+        * The first three states take place while the actual resize is in
+        * progress, and all we need to do is join and proceed with resizing. 
This
+        * way all simultaneously participating backends will remap and wait 
until
+        * one of them initialize new buffers.
+        *
+        * The last state happens when we are too late and everything is already
+        * done. In that case proceed as well, relying on AnonymousShmemResize 
not
+        * reinitialize anything since the NSharedBuffers is already 
broadcasted.
+        */
+       BarrierAttach(barrier);
+
+       /* First phase means the resize has begun, SHMEM_RESIZE_START */
+       BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_START);
+
+       /* XXX: Split mremap and buffer reinitialization into two barrier 
phases */
+       AnonymousShmemResize();
+
+       /* The second phase means the resize has finished, SHMEM_RESIZE_DONE */
+       BarrierArriveAndWait(barrier, WAIT_EVENT_SHMEM_RESIZE_DONE);
+
+       /* Allow the last backend to reset the barrier */
+       if (BarrierArriveAndDetach(barrier))
+               ResetShmemBarrier();
+
+       return true;
+}
+
+/*
+ * GUC assign hook for shared_buffers. It's recommended for an assign hook to
+ * be as minimal as possible, thus we just request shared memory resize and
+ * remember the previous value.
+ */
+void
+assign_shared_buffers(int newval, void *extra, bool *pending)
+{
+       elog(DEBUG1, "Received SIGHUP for shmem resizing");
+
+       /* Request shared memory resize only when it was initialized */
+       if (next_free_segment != 0)
+       {
+               elog(DEBUG1, "Set pending signal");
+               pending_pm_shmem_resize = true;
+               *pending = true;
+               NBuffersPending = newval;
+       }
+
+       NBuffersOld = NBuffers;
+}
+
+/*
+ * Test if we have somehow missed a shmem resize signal and NBuffers value
+ * differs from NSharedBuffers. If yes, catchup and do resize.
+ */
+void
+AdjustShmemSize(void)
+{
+       uint32 NSharedBuffers = pg_atomic_read_u32(&ShmemCtrl->NSharedBuffers);
+
+       if (NSharedBuffers != NBuffers)
+       {
+               /*
+                * If the broadcasted shared_buffers is different from the one 
we see,
+                * it could be that the backend has missed a resize signal. To 
avoid
+                * any inconsistency, adjust the shared mappings, before having 
a
+                * chance to access the buffer pool.
+                */
+               ereport(LOG,
+                               (errmsg("shared_buffers has been changed from 
%d to %d, "
+                                               "resize shared memory",
+                                               NBuffers, NSharedBuffers)));
+               NBuffers = NSharedBuffers;
+               AnonymousShmemResize();
+       }
+}
+
+/*
+ * Coordinate all existing processes to make sure they all will have consistent
+ * view of shared memory size. Must be called only in postmaster.
+ */
+void
+CoordinateShmemResize(void)
+{
+       elog(DEBUG1, "Coordinating shmem resize from %d to %d",
+                NBuffersOld, NBuffers);
+       Assert(!IsUnderPostmaster);
+
+       /*
+        * If the value did not change, or shared memory segments are not
+        * initialized yet, skip the resize.
+        */
+       if (NBuffersPending == NBuffersOld || next_free_segment == 0)
+       {
+               elog(DEBUG1, "Skip resizing, new %d, old %d, free segment %d",
+                        NBuffers, NBuffersOld, next_free_segment);
+               return;
+       }
+
+       /*
+        * Shared memory resize requires some coordination done by postmaster,
+        * and consists of three phases:
+        *
+        * - Before the resize all existing backends have the same old NBuffers.
+        * - When resize is in progress, backends are expected to have a
+        *   mixture of old a new values. They're not allowed to touch buffer
+        *   pool during this time frame.
+        * - After resize has been finished, all existing backends, that can 
access
+        *   the buffer pool, are expected to have the same new value of 
NBuffers.
+        *   There might still be some backends, that are sleeping or for some
+        *   other reason not doing any work yet and have old NBuffers -- but as
+        *   soon as they will get some time slice, they will acquire the new
+        *   value.
+        */
+       elog(DEBUG1, "Emit a barrier for shmem resizing");
+       EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SHMEM_RESIZE);
+
+       AnonymousShmemResize();
+
+       /*
+        * Normally we would call WaitForProcSignalBarrier here to wait until 
every
+        * backend has reported on the ProcSignalBarrier. But for shared memory
+        * resize we don't need this, as every participating backend will
+        * synchronize on the ProcSignal barrier, and there is no sequential 
logic
+        * we have to perform afterwards. In fact even if we would like to wait
+        * here, it wouldn't be possible -- we're in the postmaster, without any
+        * waiting infrastructure available.
+        *
+        * If at some point it will turn out that waiting is essential, we would
+        * need to consider some alternatives. E.g. it could be a designated
+        * coordination process, which is not a postmaster. Another option 
would be
+        * to introduce a CoordinateShmemResize lock and allow only one process 
to
+        * take it (this probably would have to be something different than
+        * LWLocks, since they block interrupts, and coordination relies on 
them).
+        */
+}
+
 /*
  * PGSharedMemoryCreate
  *
@@ -1174,3 +1453,24 @@ PGSharedMemoryDetach(void)
                }
        }
 }
+
+void
+WaitOnShmemBarrier(int phase)
+{
+       Barrier *barrier = &ShmemCtrl->Barrier;
+
+       if (BarrierPhase(barrier) == phase)
+       {
+               ereport(LOG,
+                               (errmsg("ProcSignal barrier is in phase %d, 
waiting", phase)));
+               BarrierAttach(barrier);
+               BarrierArriveAndWait(barrier, 0);
+               BarrierDetach(barrier);
+       }
+}
+
+void
+ResetShmemBarrier(void)
+{
+       BarrierInit(&ShmemCtrl->Barrier, 0);
+}
diff --git a/src/backend/postmaster/postmaster.c 
b/src/backend/postmaster/postmaster.c
index bb22b13adef..f3e508141b2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -418,6 +418,7 @@ static void process_pm_pmsignal(void);
 static void process_pm_child_exit(void);
 static void process_pm_reload_request(void);
 static void process_pm_shutdown_request(void);
+static void process_pm_shmem_resize(void);
 static void dummy_handler(SIGNAL_ARGS);
 static void CleanupBackend(PMChild *bp, int exitstatus);
 static void HandleChildCrash(int pid, int exitstatus, const char *procname);
@@ -1680,6 +1681,9 @@ ServerLoop(void)
                        if (pending_pm_pmsignal)
                                process_pm_pmsignal();
 
+                       if (pending_pm_shmem_resize)
+                               process_pm_shmem_resize();
+
                        if (events[i].events & WL_SOCKET_ACCEPT)
                        {
                                ClientSocket s;
@@ -2026,6 +2030,17 @@ process_pm_reload_request(void)
        }
 }
 
+static void
+process_pm_shmem_resize(void)
+{
+       /*
+        * Failure to resize is considered to be fatal and will not be
+        * retried, which means we can disable pending flag right here.
+        */
+       pending_pm_shmem_resize = false;
+       CoordinateShmemResize();
+}
+
 /*
  * pg_ctl uses SIGTERM, SIGINT and SIGQUIT to request different types of
  * shutdown.
diff --git a/src/backend/storage/buffer/buf_init.c 
b/src/backend/storage/buffer/buf_init.c
index f5b9290a640..b7de0ab6b0d 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -23,6 +23,41 @@ ConditionVariableMinimallyPadded *BufferIOCVArray;
 WritebackContext BackendWritebackContext;
 CkptSortItem *CkptBufferIds;
 
+/*
+ * Currently broadcasted value of NBuffers in shared memory.
+ *
+ * Most of the time this value is going to be equal to NBuffers. But if
+ * postmaster is resizing shared memory and a new backend was created
+ * at the same time, there is a possibility for the new backend to inherit the
+ * old NBuffers value, but miss the resize signal if ProcSignal infrastructure
+ * was not initialized yet. Consider this situation:
+ *
+ *     Postmaster ------> New Backend
+ *         |                   |
+ *         |                Launch
+ *         |                   |
+ *         |             Inherit NBuffers
+ *         |                   |
+ *     Resize NBuffers         |
+ *         |                   |
+ *     Emit Barrier            |
+ *         |            Init ProcSignal
+ *         |                   |
+ *     Finish resize           |
+ *         |                   |
+ *     New NBuffers       Old NBuffers
+ *
+ * In this case the backend is not yet ready to receive a signal from
+ * EmitProcSignalBarrier, and will be ignored. The same happens if ProcSignal
+ * is initialized even later, after the resizing was finished.
+ *
+ * To address resulting inconsistency, postmaster broadcasts the current
+ * NBuffers value via shared memory. Every new backend has to verify this value
+ * before it will access the buffer pool: if it differs from its own value,
+ * this indicates a shared memory resize has happened and the backend has to
+ * first synchronize with rest of the pack.
+ */
+ShmemControl *ShmemCtrl = NULL;
 
 /*
  * Data Structures:
@@ -72,7 +107,19 @@ BufferManagerShmemInit(void)
        bool            foundBufs,
                                foundDescs,
                                foundIOCV,
-                               foundBufCkpt;
+                               foundBufCkpt,
+                               foundShmemCtrl;
+
+       ShmemCtrl = (ShmemControl *)
+               ShmemInitStruct("Shmem Control", sizeof(ShmemControl),
+                                               &foundShmemCtrl);
+
+       if (!foundShmemCtrl)
+       {
+               /* Initialize with the currently known value */
+               pg_atomic_init_u32(&ShmemCtrl->NSharedBuffers, NBuffers);
+               BarrierInit(&ShmemCtrl->Barrier, 0);
+       }
 
        /* Align descriptors to a cacheline boundary. */
        BufferDescriptors = (BufferDescPadded *)
@@ -153,6 +200,109 @@ BufferManagerShmemInit(void)
                                                 &backend_flush_after);
 }
 
+/*
+ * Reinitialize shared memory structures, which size depends on NBuffers. It's
+ * similar to InitBufferPool, but applied only to the buffers in the range
+ * between NBuffersOld and NBuffers.
+ *
+ * NBuffersOld tells what was the original value of NBuffersOld. It will be
+ * used to identify new and not yet initialized buffers.
+ *
+ * initNew flag indicates that the caller wants new buffers to be initialized.
+ * No locks are taking in this function, it is the caller responsibility to
+ * make sure only one backend can work with new buffers.
+ */
+void
+ResizeBufferPool(int NBuffersOld, bool initNew)
+{
+       bool            foundBufs,
+                               foundDescs,
+                               foundIOCV,
+                               foundBufCkpt;
+       int                     i;
+       elog(DEBUG1, "Resizing buffer pool from %d to %d", NBuffersOld, 
NBuffers);
+
+       /* XXX: Only increasing of shared_buffers is supported in this function 
*/
+       if(NBuffersOld > NBuffers)
+               return;
+
+       /* Align descriptors to a cacheline boundary. */
+       BufferDescriptors = (BufferDescPadded *)
+               ShmemInitStructInSegment("Buffer Descriptors",
+                                               NBuffers * 
sizeof(BufferDescPadded),
+                                               &foundDescs, 
BUFFER_DESCRIPTORS_SHMEM_SEGMENT);
+
+       /* Align condition variables to cacheline boundary. */
+       BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+               ShmemInitStructInSegment("Buffer IO Condition Variables",
+                                               NBuffers * 
sizeof(ConditionVariableMinimallyPadded),
+                                               &foundIOCV, 
BUFFER_IOCV_SHMEM_SEGMENT);
+
+       /*
+        * The array used to sort to-be-checkpointed buffer ids is located in
+        * shared memory, to avoid having to allocate significant amounts of
+        * memory at runtime. As that'd be in the middle of a checkpoint, or 
when
+        * the checkpointer is restarted, memory allocation failures would be
+        * painful.
+        */
+       CkptBufferIds = (CkptSortItem *)
+               ShmemInitStructInSegment("Checkpoint BufferIds",
+                                               NBuffers * 
sizeof(CkptSortItem), &foundBufCkpt,
+                                               
CHECKPOINT_BUFFERS_SHMEM_SEGMENT);
+
+       /* Align buffer pool on IO page size boundary. */
+       BufferBlocks = (char *)
+               TYPEALIGN(PG_IO_ALIGN_SIZE,
+                                 ShmemInitStructInSegment("Buffer Blocks",
+                                                                 NBuffers * 
(Size) BLCKSZ + PG_IO_ALIGN_SIZE,
+                                                                 &foundBufs, 
BUFFERS_SHMEM_SEGMENT));
+
+       /*
+        * It's enough to only resize shmem structures, if some other backend 
will
+        * do initialization of new buffers for us.
+        */
+       if (!initNew)
+               return;
+
+       elog(DEBUG1, "Initialize new buffers");
+
+       /*
+        * Initialize the headers for new buffers.
+        */
+       for (i = NBuffersOld; i < NBuffers; i++)
+       {
+               BufferDesc *buf = GetBufferDescriptor(i);
+
+               ClearBufferTag(&buf->tag);
+
+               pg_atomic_init_u32(&buf->state, 0);
+               buf->wait_backend_pgprocno = INVALID_PROC_NUMBER;
+
+               buf->buf_id = i;
+
+               /*
+                * Initially link all the buffers together as unused. Subsequent
+                * management of this list is done by freelist.c.
+                */
+               buf->freeNext = i + 1;
+
+               LWLockInitialize(BufferDescriptorGetContentLock(buf),
+                                                LWTRANCHE_BUFFER_CONTENT);
+
+               ConditionVariableInit(BufferDescriptorGetIOCV(buf));
+       }
+
+       /* Correct last entry of linked list */
+       GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
+
+       /* Init other shared buffer-management stuff */
+       StrategyInitialize(!foundDescs);
+
+       /* Initialize per-backend file flush context */
+       WritebackContextInit(&BackendWritebackContext,
+                                                &backend_flush_after);
+}
+
 /*
  * BufferManagerShmemSize
  *
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 68778522591..a2c635f288e 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -83,6 +83,9 @@ RequestAddinShmemSpace(Size size)
  *
  * If num_semaphores is not NULL, it will be set to the number of semaphores
  * required.
+ *
+ * XXX: Calculation for non main shared memory segments are incorrect, it
+ * includes more than needed for buffers only.
  */
 Size
 CalculateShmemSize(int *num_semaphores, int shmem_segment)
@@ -149,6 +152,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_segment)
        size = add_size(size, InjectionPointShmemSize());
        size = add_size(size, SlotSyncShmemSize());
 
+       /*
+        * XXX: For some reason slightly more memory is needed for larger
+        * shared_buffers, but this size is enough for any large value I've 
tested
+        * with. Is it a mistake in how slots are split, or there was a hidden
+        * inconsistency in shmem calculation?
+        */
+       size = add_size(size, 1024 * 1024 * 100);
+
        /* include additional requested shmem from preload libraries */
        size = add_size(size, total_addin_request);
 
diff --git a/src/backend/storage/ipc/procsignal.c 
b/src/backend/storage/ipc/procsignal.c
index 7401b6e625e..bec0e00f901 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -27,6 +27,7 @@
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
 #include "storage/latch.h"
+#include "storage/pg_shmem.h"
 #include "storage/shmem.h"
 #include "storage/sinval.h"
 #include "storage/smgr.h"
@@ -108,6 +109,10 @@ static bool CheckProcSignal(ProcSignalReason reason);
 static void CleanupProcSignalState(int status, Datum arg);
 static void ResetProcSignalBarrierBits(uint32 flags);
 
+#ifdef DEBUG_SHMEM_RESIZE
+bool delay_proc_signal_init = false;
+#endif
+
 /*
  * ProcSignalShmemSize
  *             Compute space needed for ProcSignal's shared memory
@@ -168,6 +173,42 @@ ProcSignalInit(bool cancel_key_valid, int32 cancel_key)
        ProcSignalSlot *slot;
        uint64          barrier_generation;
 
+#ifdef DEBUG_SHMEM_RESIZE
+       /*
+        * Introduced for debugging purposes. You can change the variable at
+        * runtime using gdb, then start new backends with delayed ProcSignal
+        * initialization. Simple pg_usleep wont work here due to SIGHUP 
interrupt
+        * needed for testing. Taken from pg_sleep;
+        */
+       if (delay_proc_signal_init)
+       {
+#define GetNowFloat()  ((float8) GetCurrentTimestamp() / 1000000.0)
+               float8          endtime = GetNowFloat() + 5;
+
+               for (;;)
+               {
+                       float8          delay;
+                       long            delay_ms;
+
+                       CHECK_FOR_INTERRUPTS();
+
+                       delay = endtime - GetNowFloat();
+                       if (delay >= 600.0)
+                               delay_ms = 600000;
+                       else if (delay > 0.0)
+                               delay_ms = (long) (delay * 1000.0);
+                       else
+                               break;
+
+                       (void) WaitLatch(MyLatch,
+                                                        WL_LATCH_SET | 
WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+                                                        delay_ms,
+                                                        WAIT_EVENT_PG_SLEEP);
+                       ResetLatch(MyLatch);
+               }
+       }
+#endif
+
        if (MyProcNumber < 0)
                elog(ERROR, "MyProcNumber not set");
        if (MyProcNumber >= NumProcSignalSlots)
@@ -573,6 +614,10 @@ ProcessProcSignalBarrier(void)
                                        case PROCSIGNAL_BARRIER_SMGRRELEASE:
                                                processed = 
ProcessBarrierSmgrRelease();
                                                break;
+                                       case PROCSIGNAL_BARRIER_SHMEM_RESIZE:
+                                               processed = 
ProcessBarrierShmemResize(
+                                                               
&ShmemCtrl->Barrier);
+                                               break;
                                }
 
                                /*
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 389abc82519..226b38ba979 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -493,17 +493,13 @@ ShmemInitStructInSegment(const char *name, Size size, 
bool *foundPtr,
        {
                /*
                 * Structure is in the shmem index so someone else has 
allocated it
-                * already.  The size better be the same as the size we are 
trying to
-                * initialize to, or there is a name conflict (or worse).
+                * already. Verify the structure's size:
+                * - If it's the same, we've found the expected structure.
+                * - If it's different, we're resizing the expected structure.
                 */
                if (result->size != size)
-               {
-                       LWLockRelease(ShmemIndexLock);
-                       ereport(ERROR,
-                                       (errmsg("ShmemIndex entry size is wrong 
for data structure"
-                                                       " \"%s\": expected %zu, 
actual %zu",
-                                                       name, size, 
result->size)));
-               }
+                       result->size = size;
+
                structPtr = result->location;
        }
        else
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 13fb8c31702..04cdd0d24d8 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -62,6 +62,7 @@
 #include "rewrite/rewriteHandler.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/pg_shmem.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
@@ -4267,6 +4268,20 @@ PostgresMain(const char *dbname, const char *username)
         */
        BeginReportingGUCOptions();
 
+       /*
+        * Verify the shared barrier, if it's still active: join and wait.
+        *
+        * XXX: Any potential race condition if not a single backend has
+        * incremented the barrier phase?
+        */
+       WaitOnShmemBarrier(SHMEM_RESIZE_START);
+
+       /*
+        * After waiting on the barrier above we guaranteed to have 
NSharedBuffers
+        * broadcasted, so we can use it in the function below.
+        */
+       AdjustShmemSize();
+
        /*
         * Also set up handler to log session end; we have to wait till now to 
be
         * sure Log_disconnections has its final value.
diff --git a/src/backend/utils/activity/wait_event_names.txt 
b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..012acb98169 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -154,6 +154,8 @@ REPLICATION_ORIGIN_DROP     "Waiting for a replication 
origin to become inactive so
 REPLICATION_SLOT_DROP  "Waiting for a replication slot to become inactive so 
it can be dropped."
 RESTORE_COMMAND        "Waiting for <xref linkend="guc-restore-command"/> to 
complete."
 SAFE_SNAPSHOT  "Waiting to obtain a valid snapshot for a <literal>READ ONLY 
DEFERRABLE</literal> transaction."
+SHMEM_RESIZE_START     "Waiting for other backends to start resizing shared 
memory."
+SHMEM_RESIZE_DONE      "Waiting for other backends to finish resizing shared 
memory."
 SYNC_REP       "Waiting for confirmation from a remote server during 
synchronous replication."
 WAL_RECEIVER_EXIT      "Waiting for the WAL receiver to exit."
 WAL_RECEIVER_WAIT_START        "Waiting for startup process to send initial 
data for streaming replication."
@@ -346,6 +348,7 @@ WALSummarizer       "Waiting to read or update WAL 
summarization state."
 DSMRegistry    "Waiting to read or update the dynamic shared memory registry."
 InjectionPoint "Waiting to read or update information related to injection 
points."
 SerialControl  "Waiting to read or update shared 
<filename>pg_serial</filename> state."
+ShmemResize    "Waiting to resize shared memory."
 
 #
 # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE)
diff --git a/src/backend/utils/misc/guc_tables.c 
b/src/backend/utils/misc/guc_tables.c
index 3cde94a1759..efdaa71c8fb 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -2339,14 +2339,14 @@ struct config_int ConfigureNamesInt[] =
         * checking for overflow, so we mustn't allow more than INT_MAX / 2.
         */
        {
-               {"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM,
+               {"shared_buffers", PGC_SIGHUP, RESOURCES_MEM,
                        gettext_noop("Sets the number of shared memory buffers 
used by the server."),
                        NULL,
                        GUC_UNIT_BLOCKS
                },
                &NBuffers,
                16384, 16, INT_MAX / 2,
-               NULL, NULL, NULL
+               NULL, assign_shared_buffers, NULL
        },
 
        {
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index bb7fe02e243..fff80214822 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -298,6 +298,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf);
 /* in buf_init.c */
 extern void BufferManagerShmemInit(void);
 extern Size BufferManagerShmemSize(int);
+extern void ResizeBufferPool(int, bool);
 
 /* in localbuf.c */
 extern void AtProcExit_LocalBuffers(void);
diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h
index c0439f2206b..5f5b45c88bd 100644
--- a/src/include/storage/ipc.h
+++ b/src/include/storage/ipc.h
@@ -64,6 +64,7 @@ typedef void (*shmem_startup_hook_type) (void);
 /* ipc.c */
 extern PGDLLIMPORT bool proc_exit_inprogress;
 extern PGDLLIMPORT bool shmem_exit_inprogress;
+extern PGDLLIMPORT volatile bool pending_pm_shmem_resize;
 
 extern void proc_exit(int code) pg_attribute_noreturn();
 extern void shmem_exit(int code);
@@ -83,5 +84,6 @@ extern void CreateSharedMemoryAndSemaphores(void);
 extern void AttachSharedMemoryStructs(void);
 #endif
 extern void InitializeShmemGUCs(void);
+extern void CoordinateShmemResize(void);
 
 #endif                                                 /* IPC_H */
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..61e89c6e8fd 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -83,3 +83,4 @@ PG_LWLOCK(49, WALSummarizer)
 PG_LWLOCK(50, DSMRegistry)
 PG_LWLOCK(51, InjectionPoint)
 PG_LWLOCK(52, SerialControl)
+PG_LWLOCK(53, ShmemResize)
diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h
index ba0192baf95..b597df0d3a3 100644
--- a/src/include/storage/pg_shmem.h
+++ b/src/include/storage/pg_shmem.h
@@ -24,6 +24,7 @@
 #ifndef PG_SHMEM_H
 #define PG_SHMEM_H
 
+#include "storage/barrier.h"
 #include "storage/dsm_impl.h"
 #include "storage/spin.h"
 
@@ -56,6 +57,23 @@ typedef struct ShmemSegment
 
 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS];
 
+/*
+ * ShmemControl is shared between backends and helps to coordinate shared
+ * memory resize.
+ */
+typedef struct
+{
+       pg_atomic_uint32        NSharedBuffers;
+       Barrier                         Barrier;
+} ShmemControl;
+
+extern PGDLLIMPORT ShmemControl *ShmemCtrl;
+
+/* The phases for shared memory resizing, used by for ProcSignal barrier. */
+#define SHMEM_RESIZE_REQUESTED                 0
+#define SHMEM_RESIZE_START                             1
+#define SHMEM_RESIZE_DONE                              2
+
 /* GUC variables */
 extern PGDLLIMPORT int shared_memory_type;
 extern PGDLLIMPORT int huge_pages;
@@ -105,6 +123,12 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, 
unsigned long id2);
 extern void PGSharedMemoryDetach(void);
 extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags);
 
+bool ProcessBarrierShmemResize(Barrier *barrier);
+void assign_shared_buffers(int newval, void *extra, bool *pending);
+void AdjustShmemSize(void);
+extern void WaitOnShmemBarrier(int phase);
+extern void ResetShmemBarrier(void);
+
 /*
  * To be able to dynamically resize largest parts of the data stored in shared
  * memory, we split it into multiple shared memory mappings segments. Each
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 022fd8ed933..4c9973dc2d9 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -54,6 +54,7 @@ typedef enum
 typedef enum
 {
        PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
+       PROCSIGNAL_BARRIER_SHMEM_RESIZE, /* ask backends to resize shared 
memory */
 } ProcSignalBarrierType;
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fb39c915d76..5bf6d099808 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2671,6 +2671,7 @@ ShellTypeInfo
 ShippableCacheEntry
 ShippableCacheKey
 ShmemIndexEnt
+ShmemControl
 ShutdownForeignScan_function
 ShutdownInformation
 ShutdownMode
-- 
2.45.1

>From e511bab55891a2d60152e913df6c20e20314e71b Mon Sep 17 00:00:00 2001
From: Dmitrii Dolgov <9erthali...@gmail.com>
Date: Sun, 23 Feb 2025 14:42:39 +0100
Subject: [PATCH v2 6/6] Use anonymous files to back shared memory segments

Allow to use anonymous files for shared memory, instead of plain
anonymous memory. Such an anonymous file is created via memfd_create, it
lives in memory, behaves like a regular file and semantically equivalent
to an anonymous memory allocated via mmap with MAP_ANONYMOUS.

Advantages of using anon files are following:

* We've got a file descriptor, which could be used for regular file
  operations (modification, truncation, you name it).

* The file could be given a name, which improves readability when it
  comes to process maps. Here is how it looks like

7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted)
7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted)
7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted)
7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted)
7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted)
7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted)

* By default, Linux will not add file-backed shared mappings into a core dump,
  making it more convenient to work with them in PostgreSQL: no more huge dumps
  to process.

The downside is that memfd_create is Linux specific.
---
 src/backend/port/sysv_shmem.c | 46 +++++++++++++++++++++++++++++------
 src/include/portability/mem.h |  2 +-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c
index 35a8ff92175..8864866f26c 100644
--- a/src/backend/port/sysv_shmem.c
+++ b/src/backend/port/sysv_shmem.c
@@ -105,6 +105,7 @@ typedef struct AnonymousMapping
        void *shmem;                            /* Pointer to the start of the 
mapped memory */
        void *seg_addr;                         /* SysV shared memory for the 
header */
        unsigned long seg_id;           /* IPC key */
+       int segment_fd;                         /* fd for the backing anon file 
*/
 } AnonymousMapping;
 
 static AnonymousMapping Mappings[ANON_MAPPINGS];
@@ -125,7 +126,7 @@ static int next_free_segment = 0;
  * 00400000-00490000         /path/bin/postgres
  * ...
  * 012d9000-0133e000         [heap]
- * 7f443a800000-7f470a800000 /dev/zero (deleted)
+ * 7f443a800000-7f470a800000 /memfd:main (deleted)
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
  * ...
@@ -152,9 +153,9 @@ static int next_free_segment = 0;
  * The result would look like this:
  *
  * 012d9000-0133e000         [heap]
- * 7f4426f54000-7f442e010000 /dev/zero (deleted)
+ * 7f4426f54000-7f442e010000 /memfd:main (deleted)
  * [...free space...]
- * 7f443a800000-7f444196c000 /dev/zero (deleted)
+ * 7f443a800000-7f444196c000 /memfd:buffers (deleted)
  * [...free space...]
  * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive
  * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2
@@ -717,6 +718,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
        void       *ptr = MAP_FAILED;
        int                     mmap_errno = 0;
 
+       /*
+        * Prepare an anonymous file backing the segment. Its size will be
+        * specified later via ftruncate.
+        *
+        * The file behaves like a regular file, but lives in memory. Once all
+        * references to the file are dropped,  it is automatically released.
+        * Anonymous memory is used for all backing pages of the file, thus it 
has
+        * the same semantics as anonymous memory allocations using mmap with 
the
+        * MAP_ANONYMOUS flag.
+        */
+       mapping->segment_fd = memfd_create(MappingName(mapping->shmem_segment), 
0);
+
 #ifndef MAP_HUGETLB
        /* PGSharedMemoryCreate should have dealt with this case */
        Assert(huge_pages != HUGE_PAGES_ON);
@@ -734,8 +747,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
                if (allocsize % hugepagesize != 0)
                        allocsize += hugepagesize - (allocsize % hugepagesize);
 
+               /*
+                * Do not use an anonymous file here yet. When adding it, do 
not forget
+                * to use ftruncate and flags MFD_HUGETLB & 
MFD_HUGE_2MB/MFD_HUGE_1GB
+                * in memfd_create.
+                */
                ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-                                  PG_MMAP_FLAGS | mmap_flags, -1, 0);
+                                  PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, 
-1, 0);
                mmap_errno = errno;
                if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED)
                {
@@ -771,7 +789,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
                 * - First create the temporary probe mapping of a fixed size 
and let
                 *   kernel to place it at address of its choice. By the virtue 
of the
                 *   probe mapping size we expect it to be located at the lowest
-                *   possible address, expecting some non mapped space above.
+                *   possible address, expecting some non mapped space above. 
The probe
+                *   is does not need to be  backed by an anonymous file.
                 *
                 * - Unmap the probe mapping, remember the address.
                 *
@@ -786,7 +805,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
                 *   without a restart.
                 */
                probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE,
-                                  PG_MMAP_FLAGS, -1, 0);
+                                  PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0);
 
                if (probe == MAP_FAILED)
                {
@@ -802,8 +821,14 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
 
                        munmap(probe, PROBE_MAPPING_SIZE);
 
+                       /*
+                        * Specify the segment file size using allocsize, which 
contains
+                        * potentially modified size.
+                        */
+                       ftruncate(mapping->segment_fd, allocsize);
+
                        ptr = mmap(probe - offset, allocsize, PROT_READ | 
PROT_WRITE,
-                                          PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, 
-1, 0);
+                                          PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, 
mapping->segment_fd, 0);
                        mmap_errno = errno;
                        if (ptr == MAP_FAILED)
                        {
@@ -822,8 +847,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping)
                 */
                allocsize = mapping->shmem_size;
 
+               /* Specify the segment file size using allocsize. */
+               ftruncate(mapping->segment_fd, allocsize);
+
                ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE,
-                                                  PG_MMAP_FLAGS, -1, 0);
+                                                  PG_MMAP_FLAGS, 
mapping->segment_fd, 0);
                mmap_errno = errno;
        }
 
@@ -917,6 +945,8 @@ AnonymousShmemResize(void)
                if (m->shmem_size == new_size)
                        continue;
 
+               /* Resize the backing anon file. */
+               ftruncate(m->segment_fd, new_size);
 
                /*
                 * Fail hard if faced any issues. In theory we could try to 
handle this
diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h
index ef9800732d9..40588ff6968 100644
--- a/src/include/portability/mem.h
+++ b/src/include/portability/mem.h
@@ -38,7 +38,7 @@
 #define MAP_NOSYNC                     0
 #endif
 
-#define PG_MMAP_FLAGS                  
(MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE)
+#define PG_MMAP_FLAGS                  (MAP_SHARED|MAP_HASSEMAPHORE)
 
 /* Some really old systems don't define MAP_FAILED. */
 #ifndef MAP_FAILED
-- 
2.45.1

Reply via email to