TL;DR A PoC for changing shared_buffers without PostgreSQL restart, via changing shared memory mapping layout. Any feedback is appreciated.
Hi, Being able to change PostgreSQL configuration on the fly is an important property for performance tuning, since it reduces the feedback time and invasiveness of the process. In certain cases it even becomes highly desired, e.g. when doing automatic tuning. But there are couple of important configuration options that could not be modified without a restart, the most notorious example is shared_buffers. I've been working recently on an idea how to change that, allowing to modify shared_buffers without a restart. To demonstrate the approach, I've prepared a PoC that ignores lots of stuff, but works in a limited set of use cases I was testing. I would like to discuss the idea and get some feedback. Patches 1-3 prepare the infrastructure and shared memory layout. They could be useful even with multithreaded PostgreSQL, when there will be no need for shared memory. I assume, in the multithreaded world there still will be need for a contiguous chunk of memory to share between threads, and its layout would be similar to the one with shared memory mappings. Patch 4 actually does resizing. It's shared memory specific of course, and utilized Linux specific mremap, meaning open portability questions. Patch 5 is somewhat independent, but quite convenient to have. It also utilizes Linux specific call memfd_create. The patch set still doesn't address lots of things, e.g. shared memory segment detach/reattach, portability questions, it doesn't touch EXEC_BACKEND code and huge pages. So far I was doing some rudimentary testing: spinning up PostgreSQL, then increasing shared_buffers and running pgbench with the scale factor large enough to extend the data set into newly allocated buffers: -- shared_buffers 128 MB =# SELECT * FROM pg_buffercache_summary(); buffers_used | buffers_unused | buffers_dirty | buffers_pinned --------------+----------------+---------------+---------------- 134 | 16250 | 1 | 0 -- change shared_buffers to 512 MB =# select pg_reload_conf(); =# SELECT * FROM pg_buffercache_summary(); buffers_used | buffers_unused | buffers_dirty | buffers_pinned --------------+----------------+---------------+--------------- 221 | 65315 | 1 | 0 -- round of pgbench read-only load =# SELECT * FROM pg_buffercache_summary(); buffers_used | buffers_unused | buffers_dirty | buffers_pinned --------------+----------------+---------------+--------------- 41757 | 23779 | 216 | 0 Here is the breakdown: v1-0001-Allow-to-use-multiple-shared-memory-mappings.patch Preparation, introduces the possibility to work with many shmem mappings. To make it less invasive, I've duplicated the shmem API to extend it with the shmem_slot argument, while redirecting the original API to it. There are probably better ways of doing that, I'm open for suggestions. v1-0002-Allow-placing-shared-memory-mapping-with-an-offse.patch Implements a new layout of shared memory mappings to include room for resizing. I've done a couple of tests to verify that such space in between doesn't affect how the kernel calculates actual used memory, to make sure that e.g. cgroup will not trigger OOM. The only change seems to be in VmPeak, which is total mapped pages. v1-0003-Introduce-multiple-shmem-slots-for-shared-buffers.patch Splits shared_buffers into multiple slots, moving out structures that depend on NBuffers into separate mappings. There are two large gaps here: * Shmem size calculation for those mappings is not correct yet, it includes too many other things (no particular issues here, just haven't had time). * It makes hardcoded assumptions about what is the upper limit for resizing, which is currently low purely for experiments. Ideally there should be a new configuration option to specify the total available memory, which would be a base for subsequent calculations. v1-0004-Allow-to-resize-shared-memory-without-restart.patch Do shared_buffers change without a restart. Current approach is clumsy, it adds an assign hook for shared_buffers and goes from there using mremap to resize mappings. But I haven't immediately found any better approach. Currently it supports only an increase of shared_buffers. v1-0005-Use-anonymous-files-to-back-shared-memory-segment.patch Allows an anonyous file to back a shared mapping. This makes certain things easier, e.g. mappings visual representation, and gives an fd for possible future customizations. In this thread I'm hoping to answer following questions: * Are there any concerns about this approach? * What would be a better mechanism to handle resizing than an assign hook? * Assuming I'll be able to address already known missing bits, what are the chances the patch series could be accepted?
>From 954613a63cb1102d7eb88f92e7ff561828bbb5c9 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Wed, 9 Oct 2024 15:41:32 +0200 Subject: [PATCH v1 1/5] Allow to use multiple shared memory mappings Currently all the work with shared memory is done via a single anonymous memory mapping, which limits ways how the shared memory could be organized. Introduce possibility to allocate multiple shared memory mappings, where a single mapping is associated with a specified shared memory slot. There is only fixed amount of available slots, currently only one main shared memory slot is allocated. A new shared memory API is introduces, extended with a slot as a new parameter. As a path of least resistance, the original API is kept in place, utilizing the main shared memory slot. --- src/backend/port/posix_sema.c | 4 +- src/backend/port/sysv_sema.c | 4 +- src/backend/port/sysv_shmem.c | 138 +++++++++++++++++++--------- src/backend/port/win32_sema.c | 2 +- src/backend/storage/ipc/ipc.c | 2 +- src/backend/storage/ipc/ipci.c | 61 ++++++------ src/backend/storage/ipc/shmem.c | 133 ++++++++++++++++++--------- src/backend/storage/lmgr/lwlock.c | 5 +- src/include/storage/buf_internals.h | 1 + src/include/storage/ipc.h | 2 +- src/include/storage/pg_sema.h | 2 +- src/include/storage/pg_shmem.h | 18 ++++ src/include/storage/shmem.h | 10 ++ 13 files changed, 258 insertions(+), 124 deletions(-) diff --git a/src/backend/port/posix_sema.c b/src/backend/port/posix_sema.c index 64186ec0a7..b97723d2ed 100644 --- a/src/backend/port/posix_sema.c +++ b/src/backend/port/posix_sema.c @@ -193,7 +193,7 @@ PGSemaphoreShmemSize(int maxSemas) * we don't have to expose the counters to other processes.) */ void -PGReserveSemaphores(int maxSemas) +PGReserveSemaphores(int maxSemas, int shmem_slot) { struct stat statbuf; @@ -220,7 +220,7 @@ PGReserveSemaphores(int maxSemas) * ShmemAlloc() won't be ready yet. */ sharedSemas = (PGSemaphore) - ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas)); + ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot); #endif numSems = 0; diff --git a/src/backend/port/sysv_sema.c b/src/backend/port/sysv_sema.c index 5b88a92bc9..8ef95b12c9 100644 --- a/src/backend/port/sysv_sema.c +++ b/src/backend/port/sysv_sema.c @@ -307,7 +307,7 @@ PGSemaphoreShmemSize(int maxSemas) * have clobbered.) */ void -PGReserveSemaphores(int maxSemas) +PGReserveSemaphores(int maxSemas, int shmem_slot) { struct stat statbuf; @@ -328,7 +328,7 @@ PGReserveSemaphores(int maxSemas) * ShmemAlloc() won't be ready yet. */ sharedSemas = (PGSemaphore) - ShmemAllocUnlocked(PGSemaphoreShmemSize(maxSemas)); + ShmemAllocUnlockedInSlot(PGSemaphoreShmemSize(maxSemas), shmem_slot); numSharedSemas = 0; maxSharedSemas = maxSemas; diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 362a37d3b3..065a5b63ac 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -94,8 +94,19 @@ typedef enum unsigned long UsedShmemSegID = 0; void *UsedShmemSegAddr = NULL; -static Size AnonymousShmemSize; -static void *AnonymousShmem = NULL; +typedef struct AnonymousMapping +{ + int shmem_slot; + Size shmem_size; /* Size of the mapping */ + void *shmem; /* Pointer to the start of the mapped memory */ + void *seg_addr; /* SysV shared memory for the header */ + unsigned long seg_id; /* IPC key */ +} AnonymousMapping; + +static AnonymousMapping Mappings[ANON_MAPPINGS]; + +/* Keeps track of used mapping slots */ +static int next_free_slot = 0; static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size); static void IpcMemoryDetach(int status, Datum shmaddr); @@ -104,6 +115,28 @@ static IpcMemoryState PGSharedMemoryAttach(IpcMemoryId shmId, void *attachAt, PGShmemHeader **addr); +static const char* +MappingName(int shmem_slot) +{ + switch (shmem_slot) + { + case MAIN_SHMEM_SLOT: + return "main"; + default: + return "unknown"; + } +} + +static void +DebugMappings() +{ + for(int i = 0; i < next_free_slot; i++) + { + AnonymousMapping m = Mappings[i]; + elog(DEBUG1, "Mapping[%s]: addr %p, size %zu", + MappingName(i), m.shmem, m.shmem_size); + } +} /* * InternalIpcMemoryCreate(memKey, size) @@ -591,14 +624,13 @@ check_huge_page_size(int *newval, void **extra, GucSource source) /* * Creates an anonymous mmap()ed shared memory segment. * - * Pass the requested size in *size. This function will modify *size to the - * actual size of the allocation, if it ends up allocating a segment that is - * larger than requested. + * This function will modify mapping size to the actual size of the allocation, + * if it ends up allocating a segment that is larger than requested. */ -static void * -CreateAnonymousSegment(Size *size) +static void +CreateAnonymousSegment(AnonymousMapping *mapping) { - Size allocsize = *size; + Size allocsize = mapping->shmem_size; void *ptr = MAP_FAILED; int mmap_errno = 0; @@ -623,8 +655,11 @@ CreateAnonymousSegment(Size *size) PG_MMAP_FLAGS | mmap_flags, -1, 0); mmap_errno = errno; if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED) - elog(DEBUG1, "mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", - allocsize); + { + DebugMappings(); + elog(DEBUG1, "slot[%s]: mmap(%zu) with MAP_HUGETLB failed, huge pages disabled: %m", + MappingName(mapping->shmem_slot), allocsize); + } } #endif @@ -642,7 +677,7 @@ CreateAnonymousSegment(Size *size) * Use the original size, not the rounded-up value, when falling back * to non-huge pages. */ - allocsize = *size; + allocsize = mapping->shmem_size; ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, -1, 0); mmap_errno = errno; @@ -651,8 +686,10 @@ CreateAnonymousSegment(Size *size) if (ptr == MAP_FAILED) { errno = mmap_errno; + DebugMappings(); ereport(FATAL, - (errmsg("could not map anonymous shared memory: %m"), + (errmsg("slot[%s]: could not map anonymous shared memory: %m", + MappingName(mapping->shmem_slot)), (mmap_errno == ENOMEM) ? errhint("This error usually means that PostgreSQL's request " "for a shared memory segment exceeded available memory, " @@ -663,8 +700,8 @@ CreateAnonymousSegment(Size *size) allocsize) : 0)); } - *size = allocsize; - return ptr; + mapping->shmem = ptr; + mapping->shmem_size = allocsize; } /* @@ -674,13 +711,18 @@ CreateAnonymousSegment(Size *size) static void AnonymousShmemDetach(int status, Datum arg) { - /* Release anonymous shared memory block, if any. */ - if (AnonymousShmem != NULL) + for(int i = 0; i < next_free_slot; i++) { - if (munmap(AnonymousShmem, AnonymousShmemSize) < 0) - elog(LOG, "munmap(%p, %zu) failed: %m", - AnonymousShmem, AnonymousShmemSize); - AnonymousShmem = NULL; + AnonymousMapping m = Mappings[i]; + + /* Release anonymous shared memory block, if any. */ + if (m.shmem != NULL) + { + if (munmap(m.shmem, m.shmem_size) < 0) + elog(LOG, "munmap(%p, %zu) failed: %m", + m.shmem, m.shmem_size); + m.shmem = NULL; + } } } @@ -705,6 +747,7 @@ PGSharedMemoryCreate(Size size, PGShmemHeader *hdr; struct stat statbuf; Size sysvsize; + AnonymousMapping *mapping = &Mappings[next_free_slot]; /* * We use the data directory's ID info (inode and device numbers) to @@ -733,11 +776,15 @@ PGSharedMemoryCreate(Size size, /* Room for a header? */ Assert(size > MAXALIGN(sizeof(PGShmemHeader))); + mapping->shmem_size = size; + mapping->shmem_slot = next_free_slot; if (shared_memory_type == SHMEM_TYPE_MMAP) { - AnonymousShmem = CreateAnonymousSegment(&size); - AnonymousShmemSize = size; + /* On success, mapping data will be modified. */ + CreateAnonymousSegment(mapping); + + next_free_slot++; /* Register on-exit routine to unmap the anonymous segment */ on_shmem_exit(AnonymousShmemDetach, (Datum) 0); @@ -760,7 +807,7 @@ PGSharedMemoryCreate(Size size, * loop simultaneously. (CreateDataDirLockFile() does not entirely ensure * that, but prefer fixing it over coping here.) */ - NextShmemSegID = statbuf.st_ino; + NextShmemSegID = statbuf.st_ino + next_free_slot; for (;;) { @@ -852,13 +899,13 @@ PGSharedMemoryCreate(Size size, /* * Initialize space allocation status for segment. */ - hdr->totalsize = size; + hdr->totalsize = mapping->shmem_size; hdr->freeoffset = MAXALIGN(sizeof(PGShmemHeader)); *shim = hdr; /* Save info for possible future use */ - UsedShmemSegAddr = memAddress; - UsedShmemSegID = (unsigned long) NextShmemSegID; + mapping->seg_addr = memAddress; + mapping->seg_id = (unsigned long) NextShmemSegID; /* * If AnonymousShmem is NULL here, then we're not using anonymous shared @@ -866,10 +913,10 @@ PGSharedMemoryCreate(Size size, * block. Otherwise, the System V shared memory block is only a shim, and * we must return a pointer to the real block. */ - if (AnonymousShmem == NULL) + if (mapping->shmem == NULL) return hdr; - memcpy(AnonymousShmem, hdr, sizeof(PGShmemHeader)); - return (PGShmemHeader *) AnonymousShmem; + memcpy(mapping->shmem, hdr, sizeof(PGShmemHeader)); + return (PGShmemHeader *) mapping->shmem; } #ifdef EXEC_BACKEND @@ -969,23 +1016,28 @@ PGSharedMemoryNoReAttach(void) void PGSharedMemoryDetach(void) { - if (UsedShmemSegAddr != NULL) + for(int i = 0; i < next_free_slot; i++) { - if ((shmdt(UsedShmemSegAddr) < 0) + AnonymousMapping m = Mappings[i]; + + if (m.seg_addr != NULL) + { + if ((shmdt(m.seg_addr) < 0) #if defined(EXEC_BACKEND) && defined(__CYGWIN__) - /* Work-around for cygipc exec bug */ - && shmdt(NULL) < 0 + /* Work-around for cygipc exec bug */ + && shmdt(NULL) < 0 #endif - ) - elog(LOG, "shmdt(%p) failed: %m", UsedShmemSegAddr); - UsedShmemSegAddr = NULL; - } + ) + elog(LOG, "shmdt(%p) failed: %m", m.seg_addr); + m.seg_addr = NULL; + } - if (AnonymousShmem != NULL) - { - if (munmap(AnonymousShmem, AnonymousShmemSize) < 0) - elog(LOG, "munmap(%p, %zu) failed: %m", - AnonymousShmem, AnonymousShmemSize); - AnonymousShmem = NULL; + if (m.shmem != NULL) + { + if (munmap(m.shmem, m.shmem_size) < 0) + elog(LOG, "munmap(%p, %zu) failed: %m", + m.shmem, m.shmem_size); + m.shmem = NULL; + } } } diff --git a/src/backend/port/win32_sema.c b/src/backend/port/win32_sema.c index f2b54bdfda..d62084cc0d 100644 --- a/src/backend/port/win32_sema.c +++ b/src/backend/port/win32_sema.c @@ -44,7 +44,7 @@ PGSemaphoreShmemSize(int maxSemas) * process exits. */ void -PGReserveSemaphores(int maxSemas) +PGReserveSemaphores(int maxSemas, int shmem_slot) { mySemSet = (HANDLE *) malloc(maxSemas * sizeof(HANDLE)); if (mySemSet == NULL) diff --git a/src/backend/storage/ipc/ipc.c b/src/backend/storage/ipc/ipc.c index b06e4b8452..2aabd4a77f 100644 --- a/src/backend/storage/ipc/ipc.c +++ b/src/backend/storage/ipc/ipc.c @@ -68,7 +68,7 @@ static void proc_exit_prepare(int code); * ---------------------------------------------------------------- */ -#define MAX_ON_EXITS 20 +#define MAX_ON_EXITS 40 struct ONEXIT { diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 35fa2e1dda..8224015b53 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -88,7 +88,7 @@ RequestAddinShmemSpace(Size size) * required. */ Size -CalculateShmemSize(int *num_semaphores) +CalculateShmemSize(int *num_semaphores, int shmem_slot) { Size size; int numSemas; @@ -202,33 +202,36 @@ CreateSharedMemoryAndSemaphores(void) Assert(!IsUnderPostmaster); - /* Compute the size of the shared-memory block */ - size = CalculateShmemSize(&numSemas); - elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size); - - /* - * Create the shmem segment - */ - seghdr = PGSharedMemoryCreate(size, &shim); - - /* - * Make sure that huge pages are never reported as "unknown" while the - * server is running. - */ - Assert(strcmp("unknown", - GetConfigOption("huge_pages_status", false, false)) != 0); - - InitShmemAccess(seghdr); - - /* - * Create semaphores - */ - PGReserveSemaphores(numSemas); - - /* - * Set up shared memory allocation mechanism - */ - InitShmemAllocation(); + for(int slot = 0; slot < ANON_MAPPINGS; slot++) + { + /* Compute the size of the shared-memory block */ + size = CalculateShmemSize(&numSemas, slot); + elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size); + + /* + * Create the shmem segment + */ + seghdr = PGSharedMemoryCreate(size, &shim); + + /* + * Make sure that huge pages are never reported as "unknown" while the + * server is running. + */ + Assert(strcmp("unknown", + GetConfigOption("huge_pages_status", false, false)) != 0); + + InitShmemAccessInSlot(seghdr, slot); + + /* + * Create semaphores + */ + PGReserveSemaphores(numSemas, slot); + + /* + * Set up shared memory allocation mechanism + */ + InitShmemAllocationInSlot(slot); + } /* Initialize subsystems */ CreateOrAttachShmemStructs(); @@ -359,7 +362,7 @@ InitializeShmemGUCs(void) /* * Calculate the shared memory size and round up to the nearest megabyte. */ - size_b = CalculateShmemSize(&num_semas); + size_b = CalculateShmemSize(&num_semas, MAIN_SHMEM_SLOT); size_mb = add_size(size_b, (1024 * 1024) - 1) / (1024 * 1024); sprintf(buf, "%zu", size_mb); SetConfigOption("shared_memory_size", buf, diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index 6d5f083986..c670b9cf43 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -75,17 +75,12 @@ #include "utils/builtins.h" static void *ShmemAllocRaw(Size size, Size *allocated_size); +static void *ShmemAllocRawInSlot(Size size, Size *allocated_size, + int shmem_slot); /* shared memory global variables */ -static PGShmemHeader *ShmemSegHdr; /* shared mem segment header */ - -static void *ShmemBase; /* start address of shared memory */ - -static void *ShmemEnd; /* end+1 address of shared memory */ - -slock_t *ShmemLock; /* spinlock for shared memory and LWLock - * allocation */ +ShmemSegment Segments[ANON_MAPPINGS]; static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */ @@ -99,11 +94,17 @@ static HTAB *ShmemIndex = NULL; /* primary index hashtable for shmem */ void InitShmemAccess(void *seghdr) { - PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr; + InitShmemAccessInSlot(seghdr, MAIN_SHMEM_SLOT); +} - ShmemSegHdr = shmhdr; - ShmemBase = (void *) shmhdr; - ShmemEnd = (char *) ShmemBase + shmhdr->totalsize; +void +InitShmemAccessInSlot(void *seghdr, int shmem_slot) +{ + PGShmemHeader *shmhdr = (PGShmemHeader *) seghdr; + ShmemSegment *seg = &Segments[shmem_slot]; + seg->ShmemSegHdr = shmhdr; + seg->ShmemBase = (void *) shmhdr; + seg->ShmemEnd = (char *) seg->ShmemBase + shmhdr->totalsize; } /* @@ -114,7 +115,13 @@ InitShmemAccess(void *seghdr) void InitShmemAllocation(void) { - PGShmemHeader *shmhdr = ShmemSegHdr; + InitShmemAllocationInSlot(MAIN_SHMEM_SLOT); +} + +void +InitShmemAllocationInSlot(int shmem_slot) +{ + PGShmemHeader *shmhdr = Segments[shmem_slot].ShmemSegHdr; char *aligned; Assert(shmhdr != NULL); @@ -123,9 +130,9 @@ InitShmemAllocation(void) * Initialize the spinlock used by ShmemAlloc. We must use * ShmemAllocUnlocked, since obviously ShmemAlloc can't be called yet. */ - ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t)); + Segments[shmem_slot].ShmemLock = (slock_t *) ShmemAllocUnlockedInSlot(sizeof(slock_t), shmem_slot); - SpinLockInit(ShmemLock); + SpinLockInit(Segments[shmem_slot].ShmemLock); /* * Allocations after this point should go through ShmemAlloc, which @@ -150,11 +157,17 @@ InitShmemAllocation(void) */ void * ShmemAlloc(Size size) +{ + return ShmemAllocInSlot(size, MAIN_SHMEM_SLOT); +} + +void * +ShmemAllocInSlot(Size size, int shmem_slot) { void *newSpace; Size allocated_size; - newSpace = ShmemAllocRaw(size, &allocated_size); + newSpace = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot); if (!newSpace) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), @@ -184,6 +197,12 @@ ShmemAllocNoError(Size size) */ static void * ShmemAllocRaw(Size size, Size *allocated_size) +{ + return ShmemAllocRawInSlot(size, allocated_size, MAIN_SHMEM_SLOT); +} + +static void * +ShmemAllocRawInSlot(Size size, Size *allocated_size, int shmem_slot) { Size newStart; Size newFree; @@ -203,22 +222,22 @@ ShmemAllocRaw(Size size, Size *allocated_size) size = CACHELINEALIGN(size); *allocated_size = size; - Assert(ShmemSegHdr != NULL); + Assert(Segments[shmem_slot].ShmemSegHdr != NULL); - SpinLockAcquire(ShmemLock); + SpinLockAcquire(Segments[shmem_slot].ShmemLock); - newStart = ShmemSegHdr->freeoffset; + newStart = Segments[shmem_slot].ShmemSegHdr->freeoffset; newFree = newStart + size; - if (newFree <= ShmemSegHdr->totalsize) + if (newFree <= Segments[shmem_slot].ShmemSegHdr->totalsize) { - newSpace = (void *) ((char *) ShmemBase + newStart); - ShmemSegHdr->freeoffset = newFree; + newSpace = (void *) ((char *) Segments[shmem_slot].ShmemBase + newStart); + Segments[shmem_slot].ShmemSegHdr->freeoffset = newFree; } else newSpace = NULL; - SpinLockRelease(ShmemLock); + SpinLockRelease(Segments[shmem_slot].ShmemLock); /* note this assert is okay with newSpace == NULL */ Assert(newSpace == (void *) CACHELINEALIGN(newSpace)); @@ -236,6 +255,12 @@ ShmemAllocRaw(Size size, Size *allocated_size) */ void * ShmemAllocUnlocked(Size size) +{ + return ShmemAllocUnlockedInSlot(size, MAIN_SHMEM_SLOT); +} + +void * +ShmemAllocUnlockedInSlot(Size size, int shmem_slot) { Size newStart; Size newFree; @@ -246,19 +271,19 @@ ShmemAllocUnlocked(Size size) */ size = MAXALIGN(size); - Assert(ShmemSegHdr != NULL); + Assert(Segments[shmem_slot].ShmemSegHdr != NULL); - newStart = ShmemSegHdr->freeoffset; + newStart = Segments[shmem_slot].ShmemSegHdr->freeoffset; newFree = newStart + size; - if (newFree > ShmemSegHdr->totalsize) + if (newFree > Segments[shmem_slot].ShmemSegHdr->totalsize) ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), errmsg("out of shared memory (%zu bytes requested)", size))); - ShmemSegHdr->freeoffset = newFree; + Segments[shmem_slot].ShmemSegHdr->freeoffset = newFree; - newSpace = (void *) ((char *) ShmemBase + newStart); + newSpace = (void *) ((char *) Segments[shmem_slot].ShmemBase + newStart); Assert(newSpace == (void *) MAXALIGN(newSpace)); @@ -273,7 +298,13 @@ ShmemAllocUnlocked(Size size) bool ShmemAddrIsValid(const void *addr) { - return (addr >= ShmemBase) && (addr < ShmemEnd); + return ShmemAddrIsValidInSlot(addr, MAIN_SHMEM_SLOT); +} + +bool +ShmemAddrIsValidInSlot(const void *addr, int shmem_slot) +{ + return (addr >= Segments[shmem_slot].ShmemBase) && (addr < Segments[shmem_slot].ShmemEnd); } /* @@ -334,6 +365,18 @@ ShmemInitHash(const char *name, /* table string name for shmem index */ long max_size, /* max size of the table */ HASHCTL *infoP, /* info about key and bucket size */ int hash_flags) /* info about infoP */ +{ + return ShmemInitHashInSlot(name, init_size, max_size, infoP, hash_flags, + MAIN_SHMEM_SLOT); +} + +HTAB * +ShmemInitHashInSlot(const char *name, /* table string name for shmem index */ + long init_size, /* initial table size */ + long max_size, /* max size of the table */ + HASHCTL *infoP, /* info about key and bucket size */ + int hash_flags, /* info about infoP */ + int shmem_slot) /* in which slot to keep the table */ { bool found; void *location; @@ -350,9 +393,9 @@ ShmemInitHash(const char *name, /* table string name for shmem index */ hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE; /* look it up in the shmem index */ - location = ShmemInitStruct(name, + location = ShmemInitStructInSlot(name, hash_get_shared_size(infoP, hash_flags), - &found); + &found, shmem_slot); /* * if it already exists, attach to it rather than allocate and initialize @@ -385,6 +428,13 @@ ShmemInitHash(const char *name, /* table string name for shmem index */ */ void * ShmemInitStruct(const char *name, Size size, bool *foundPtr) +{ + return ShmemInitStructInSlot(name, size, foundPtr, MAIN_SHMEM_SLOT); +} + +void * +ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr, + int shmem_slot) { ShmemIndexEnt *result; void *structPtr; @@ -393,7 +443,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr) if (!ShmemIndex) { - PGShmemHeader *shmemseghdr = ShmemSegHdr; + PGShmemHeader *shmemseghdr = Segments[shmem_slot].ShmemSegHdr; /* Must be trying to create/attach to ShmemIndex itself */ Assert(strcmp(name, "ShmemIndex") == 0); @@ -416,7 +466,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr) * process can be accessing shared memory yet. */ Assert(shmemseghdr->index == NULL); - structPtr = ShmemAlloc(size); + structPtr = ShmemAllocInSlot(size, shmem_slot); shmemseghdr->index = structPtr; *foundPtr = false; } @@ -433,8 +483,8 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr) LWLockRelease(ShmemIndexLock); ereport(ERROR, (errcode(ERRCODE_OUT_OF_MEMORY), - errmsg("could not create ShmemIndex entry for data structure \"%s\"", - name))); + errmsg("could not create ShmemIndex entry for data structure \"%s\" in slot %d", + name, shmem_slot))); } if (*foundPtr) @@ -459,7 +509,7 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr) Size allocated_size; /* It isn't in the table yet. allocate and initialize it */ - structPtr = ShmemAllocRaw(size, &allocated_size); + structPtr = ShmemAllocRawInSlot(size, &allocated_size, shmem_slot); if (structPtr == NULL) { /* out of memory; remove the failed ShmemIndex entry */ @@ -478,14 +528,13 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr) LWLockRelease(ShmemIndexLock); - Assert(ShmemAddrIsValid(structPtr)); + Assert(ShmemAddrIsValidInSlot(structPtr, shmem_slot)); Assert(structPtr == (void *) CACHELINEALIGN(structPtr)); return structPtr; } - /* * Add two Size values, checking for overflow */ @@ -545,7 +594,7 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS) while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL) { values[0] = CStringGetTextDatum(ent->key); - values[1] = Int64GetDatum((char *) ent->location - (char *) ShmemSegHdr); + values[1] = Int64GetDatum((char *) ent->location - (char *) Segments[MAIN_SHMEM_SLOT].ShmemSegHdr); values[2] = Int64GetDatum(ent->size); values[3] = Int64GetDatum(ent->allocated_size); named_allocated += ent->allocated_size; @@ -557,15 +606,15 @@ pg_get_shmem_allocations(PG_FUNCTION_ARGS) /* output shared memory allocated but not counted via the shmem index */ values[0] = CStringGetTextDatum("<anonymous>"); nulls[1] = true; - values[2] = Int64GetDatum(ShmemSegHdr->freeoffset - named_allocated); + values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset - named_allocated); values[3] = values[2]; tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls); /* output as-of-yet unused shared memory */ nulls[0] = true; - values[1] = Int64GetDatum(ShmemSegHdr->freeoffset); + values[1] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset); nulls[1] = false; - values[2] = Int64GetDatum(ShmemSegHdr->totalsize - ShmemSegHdr->freeoffset); + values[2] = Int64GetDatum(Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->totalsize - Segments[MAIN_SHMEM_SLOT].ShmemSegHdr->freeoffset); values[3] = values[2]; tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc, values, nulls); diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index e765754d80..fb0c33bf17 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -81,6 +81,7 @@ #include "pgstat.h" #include "port/pg_bitutils.h" #include "postmaster/postmaster.h" +#include "storage/pg_shmem.h" #include "storage/proc.h" #include "storage/proclist.h" #include "storage/spin.h" @@ -607,9 +608,9 @@ LWLockNewTrancheId(void) LWLockCounter = (int *) ((char *) MainLWLockArray - sizeof(int)); /* We use the ShmemLock spinlock to protect LWLockCounter */ - SpinLockAcquire(ShmemLock); + SpinLockAcquire(Segments[MAIN_SHMEM_SLOT].ShmemLock); result = (*LWLockCounter)++; - SpinLockRelease(ShmemLock); + SpinLockRelease(Segments[MAIN_SHMEM_SLOT].ShmemLock); return result; } diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index f190e6e5e4..aef80e049b 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -23,6 +23,7 @@ #include "storage/latch.h" #include "storage/lwlock.h" #include "storage/shmem.h" +#include "storage/pg_shmem.h" #include "storage/smgr.h" #include "storage/spin.h" #include "utils/relcache.h" diff --git a/src/include/storage/ipc.h b/src/include/storage/ipc.h index b2d062781e..be4b131288 100644 --- a/src/include/storage/ipc.h +++ b/src/include/storage/ipc.h @@ -77,7 +77,7 @@ extern void check_on_shmem_exit_lists_are_empty(void); /* ipci.c */ extern PGDLLIMPORT shmem_startup_hook_type shmem_startup_hook; -extern Size CalculateShmemSize(int *num_semaphores); +extern Size CalculateShmemSize(int *num_semaphores, int shmem_slot); extern void CreateSharedMemoryAndSemaphores(void); #ifdef EXEC_BACKEND extern void AttachSharedMemoryStructs(void); diff --git a/src/include/storage/pg_sema.h b/src/include/storage/pg_sema.h index dfef79ac96..081fffaf16 100644 --- a/src/include/storage/pg_sema.h +++ b/src/include/storage/pg_sema.h @@ -41,7 +41,7 @@ typedef HANDLE PGSemaphore; extern Size PGSemaphoreShmemSize(int maxSemas); /* Module initialization (called during postmaster start or shmem reinit) */ -extern void PGReserveSemaphores(int maxSemas); +extern void PGReserveSemaphores(int maxSemas, int shmem_slot); /* Allocate a PGSemaphore structure with initial count 1 */ extern PGSemaphore PGSemaphoreCreate(void); diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h index 3065ff5be7..e968deeef7 100644 --- a/src/include/storage/pg_shmem.h +++ b/src/include/storage/pg_shmem.h @@ -25,6 +25,7 @@ #define PG_SHMEM_H #include "storage/dsm_impl.h" +#include "storage/spin.h" typedef struct PGShmemHeader /* standard header for all Postgres shmem */ { @@ -41,6 +42,20 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */ #endif } PGShmemHeader; +typedef struct ShmemSegment +{ + PGShmemHeader *ShmemSegHdr; /* shared mem segment header */ + void *ShmemBase; /* start address of shared memory */ + void *ShmemEnd; /* end+1 address of shared memory */ + slock_t *ShmemLock; /* spinlock for shared memory and LWLock + * allocation */ +} ShmemSegment; + +// Number of available slots for anonymous memory mappings +#define ANON_MAPPINGS 1 + +extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS]; + /* GUC variables */ extern PGDLLIMPORT int shared_memory_type; extern PGDLLIMPORT int huge_pages; @@ -90,4 +105,7 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2); extern void PGSharedMemoryDetach(void); extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags); +/* The main slot, contains everything except buffer blocks and related data. */ +#define MAIN_SHMEM_SLOT 0 + #endif /* PG_SHMEM_H */ diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h index 842989111c..d3e9cc721d 100644 --- a/src/include/storage/shmem.h +++ b/src/include/storage/shmem.h @@ -28,15 +28,25 @@ /* shmem.c */ extern PGDLLIMPORT slock_t *ShmemLock; extern void InitShmemAccess(void *seghdr); +extern void InitShmemAccessInSlot(void *seghdr, int shmem_slot); extern void InitShmemAllocation(void); +extern void InitShmemAllocationInSlot(int shmem_slot); extern void *ShmemAlloc(Size size); +extern void *ShmemAllocInSlot(Size size, int shmem_slot); extern void *ShmemAllocNoError(Size size); extern void *ShmemAllocUnlocked(Size size); +extern void *ShmemAllocUnlockedInSlot(Size size, int shmem_slot); extern bool ShmemAddrIsValid(const void *addr); +extern bool ShmemAddrIsValidInSlot(const void *addr, int shmem_slot); extern void InitShmemIndex(void); +extern void InitVariableShmemIndex(void); extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size, HASHCTL *infoP, int hash_flags); +extern HTAB *ShmemInitHashInSlot(const char *name, long init_size, long max_size, + HASHCTL *infoP, int hash_flags, int shmem_slot); extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr); +extern void *ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr, + int shmem_slot); extern Size add_size(Size s1, Size s2); extern Size mul_size(Size s1, Size s2); base-commit: 2488058dc356a43455b21a099ea879fff9266634 -- 2.45.1
>From e9980f76cbd1ea6f6d732e2a27dd1342258d26e5 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Wed, 16 Oct 2024 20:21:33 +0200 Subject: [PATCH v1 2/5] Allow placing shared memory mapping with an offset Currently the kernel is responsible to chose an address, where to place each shared memory mapping, which is the lowest possible address that do not clash with any other mappings. This is considered to be the most portable approach, but one of the downsides is that there is no place to resize allocated mappings anymore. Here is how it looks like for one mapping in /proc/$PID/maps, /dev/zero represents the anonymous shared memory we talk about: 00400000-00490000 /path/bin/postgres ... 012d9000-0133e000 [heap] 7f443a800000-7f470a800000 /dev/zero (deleted) 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 ... 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted) By specifying the mapping address directly it's possible to place the mapping in a way that leaves room for resizing. The idea is first to get the address chosen by the kernel, then apply some offset derived from the expected upper limit. Because we base the layout on the address chosen by the kernel, things like address space randomization should not be a problem, since the randomization is applied to the mmap base, which is one per process. The result looks like this: 012d9000-0133e000 [heap] 7f443a800000-7f444196c000 /dev/zero (deleted) [...free space...] 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 This approach do not impact the actual memory usage as reported by the kernel. Here is the output of /proc/$PID/status for the master version with shared_buffers = 128 MB: // Peak virtual memory size, which is described as total pages mapped in mm_struct VmPeak: 422780 kB // Size of memory portions. It contains RssAnon + RssFile + RssShmem VmRSS: 21248 kB // Size of resident anonymous memory RssAnon: 640 kB // Size of resident file mappings RssFile: 9728 kB // Size of resident shmem memory (includes SysV shm, mapping of tmpfs and // shared anonymous mappings) RssShmem: 10880 kB Here is the same for the patch with the shared mapping placed at an offset 10 GB: VmPeak: 1102844 kB VmRSS: 21376 kB RssAnon: 640 kB RssFile: 9856 kB RssShmem: 10880 kB Cgroup v2 doesn't have any problems with that as well. To verify a new cgroup was created with the memory limit 256 MB, then PostgreSQL was launched withing this cgroup with shared_buffers = 128 MB: $ cd /sys/fs/cgroup $ mkdir postgres $ cd postres $ echo 268435456 > memory.max $ echo $MASTER_PID_SHELL > cgroup.procs # postgres from the master branch has being successfully launched # from that shell $ cat memory.current 17465344 (~16 MB) # stop postgres $ echo $PATCH_PID_SHELL > cgroup.procs # postgres from the patch has being successfully launched from that shell $ cat memory.current 18219008 (~17 MB) Note that currently the implementation makes assumptions about the upper limit. Ideally it should be based on the maximum available memory. --- src/backend/port/sysv_shmem.c | 120 +++++++++++++++++++++++++++++++++- 1 file changed, 119 insertions(+), 1 deletion(-) diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 065a5b63ac..7e6c8bb78d 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -108,6 +108,63 @@ static AnonymousMapping Mappings[ANON_MAPPINGS]; /* Keeps track of used mapping slots */ static int next_free_slot = 0; +/* + * Anonymous mapping placing (/dev/zero (deleted) below) looks like this: + * + * 00400000-00490000 /path/bin/postgres + * ... + * 012d9000-0133e000 [heap] + * 7f443a800000-7f470a800000 /dev/zero (deleted) + * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive + * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 + * ... + * 7f471aef2000-7f471aef9000 /dev/shm/PostgreSQL.3859891842 + * 7f471aef9000-7f471aefa000 /SYSV007dbf7d (deleted) + * ... + * + * We would like to place multiple mappings in such a way, that there will be + * enough space between them in the address space to be able to resize up to + * certain size, but without counting towards the total memory consumption. + * + * By letting Linux to chose a mapping address, it will pick up the lowest + * possible address that do not clash with any other mappings, which will be + * right before locales in the example above. This information (maximum allowed + * size of mappings and the lowest mapping address) is enough to place every + * mapping as follow: + * + * - Take the lowest mapping address, which we call later the probe address. + * - Substract the offset of the previous mapping. + * - Substract the maximum allowed size for the current mapping from the + * address. + * - Place the mapping by the resulting address. + * + * The result would look like this: + * + * 012d9000-0133e000 [heap] + * 7f4426f54000-7f442e010000 /dev/zero (deleted) + * [...free space...] + * 7f443a800000-7f444196c000 /dev/zero (deleted) + * [...free space...] + * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive + * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 + * ... + */ +Size SHMEM_EXTRA_SIZE_LIMIT[1] = { + 0, /* MAIN_SHMEM_SLOT */ +}; + +/* Remembers offset of the last mapping from the probe address */ +static Size last_offset = 0; + +/* + * Size of the mapping, which will be used to calculate anonymous mapping + * address. It should not be too small, otherwise there is a chance the probe + * mapping will be created between other mappings, leaving no room extending + * it. But it should not be too large either, in case if there are limitations + * on the mapping size. Current value is the default shared_buffers. + */ +#define PROBE_MAPPING_SIZE (Size) 128 * 1024 * 1024 + static void *InternalIpcMemoryCreate(IpcMemoryKey memKey, Size size); static void IpcMemoryDetach(int status, Datum shmaddr); static void IpcMemoryDelete(int status, Datum shmId); @@ -673,13 +730,74 @@ CreateAnonymousSegment(AnonymousMapping *mapping) if (ptr == MAP_FAILED && huge_pages != HUGE_PAGES_ON) { + void *probe = NULL; + /* * Use the original size, not the rounded-up value, when falling back * to non-huge pages. */ allocsize = mapping->shmem_size; - ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, + + /* + * Try to create mapping at an address, which will allow to extend it + * later: + * + * - First create the temporary probe mapping of a fixed size and let + * kernel to place it at address of its choice. By the virtue of the + * probe mapping size we expect it to be located at the lowest + * possible address, expecting some non mapped space above. + * + * - Unmap the probe mapping, remember the address. + * + * - Create an actual anonymous mapping at that address with the + * offset. The offset is calculated in such a way to allow growing + * the mapping withing certain boundaries. For this mapping we use + * MAP_FIXED_NOREPLACE, which will error out with EEXIST if there is + * any mapping clash. + * + * - If the last step has failed, fallback to the regular mapping + * creation and signal that shared buffers could not be resized + * without a restart. + */ + probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, -1, 0); + + if (probe == MAP_FAILED) + { + mmap_errno = errno; + DebugMappings(); + elog(DEBUG1, "slot[%s]: probe mmap(%zu) failed: %m", + MappingName(mapping->shmem_slot), allocsize); + } + else + { + Size offset = last_offset + SHMEM_EXTRA_SIZE_LIMIT[next_free_slot] + allocsize; + last_offset = offset; + + munmap(probe, PROBE_MAPPING_SIZE); + + ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE, + PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0); + mmap_errno = errno; + if (ptr == MAP_FAILED) + { + DebugMappings(); + elog(DEBUG1, "slot[%s]: mmap(%zu) at address %p failed: %m", + MappingName(mapping->shmem_slot), allocsize, probe - offset); + } + + } + } + + if (ptr == MAP_FAILED) + { + /* + * Fallback to the portable way of creating a mapping. + */ + allocsize = mapping->shmem_size; + + ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, + PG_MMAP_FLAGS, -1, 0); mmap_errno = errno; } -- 2.45.1
>From 62ae567f1c7a56c32722508b60251f9cec245ea3 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Wed, 16 Oct 2024 20:24:04 +0200 Subject: [PATCH v1 3/5] Introduce multiple shmem slots for shared buffers Add more shmem slots to split shared buffers into following chunks: * BUFFERS_SHMEM_SLOT: contains buffer blocks * BUFFER_DESCRIPTORS_SHMEM_SLOT: contains buffer descriptors * BUFFER_IOCV_SHMEM_SLOT: contains condition variables for buffers * CHECKPOINT_BUFFERS_SHMEM_SLOT: contains checkpoint buffer ids * STRATEGY_SHMEM_SLOT: contains buffer strategy status Size of the corresponding shared data directly depends on NBuffers, meaning that if we would like to change NBuffers, they have to be resized correspondingly. Placing each of them in a separate shmem slot allows to achieve that. There are some asumptions made about each of shmem slots upper size limit. The buffer blocks have the largest, while the rest claim less extra room for resize. Ideally those limits have to be deduced from the maximum allowed shared memory. --- src/backend/port/sysv_shmem.c | 17 +++++- src/backend/storage/buffer/buf_init.c | 79 +++++++++++++++++--------- src/backend/storage/buffer/buf_table.c | 5 +- src/backend/storage/buffer/freelist.c | 4 +- src/backend/storage/ipc/ipci.c | 2 +- src/include/storage/bufmgr.h | 2 +- src/include/storage/pg_shmem.h | 23 +++++++- 7 files changed, 97 insertions(+), 35 deletions(-) diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 7e6c8bb78d..beebd4d85e 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -149,8 +149,13 @@ static int next_free_slot = 0; * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 * ... */ -Size SHMEM_EXTRA_SIZE_LIMIT[1] = { +Size SHMEM_EXTRA_SIZE_LIMIT[6] = { 0, /* MAIN_SHMEM_SLOT */ + (Size) 1024 * 1024 * 1024 * 10, /* BUFFERS_SHMEM_SLOT */ + (Size) 1024 * 1024 * 1024 * 1, /* BUFFER_DESCRIPTORS_SHMEM_SLOT */ + (Size) 1024 * 1024 * 100, /* BUFFER_IOCV_SHMEM_SLOT */ + (Size) 1024 * 1024 * 100, /* CHECKPOINT_BUFFERS_SHMEM_SLOT */ + (Size) 1024 * 1024 * 100, /* STRATEGY_SHMEM_SLOT */ }; /* Remembers offset of the last mapping from the probe address */ @@ -179,6 +184,16 @@ MappingName(int shmem_slot) { case MAIN_SHMEM_SLOT: return "main"; + case BUFFERS_SHMEM_SLOT: + return "buffers"; + case BUFFER_DESCRIPTORS_SHMEM_SLOT: + return "descriptors"; + case BUFFER_IOCV_SHMEM_SLOT: + return "iocv"; + case CHECKPOINT_BUFFERS_SHMEM_SLOT: + return "checkpoint"; + case STRATEGY_SHMEM_SLOT: + return "strategy"; default: return "unknown"; } diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index 46116a1f64..6bca286bef 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -62,7 +62,10 @@ CkptSortItem *CkptBufferIds; * Initialize shared buffer pool * * This is called once during shared-memory initialization (either in the - * postmaster, or in a standalone backend). + * postmaster, or in a standalone backend). Size of data structures initialized + * here depends on NBuffers, and to be able to change NBuffers without a + * restart we store each structure into a separate shared memory slot, which + * could be resized on demand. */ void InitBufferPool(void) @@ -74,22 +77,22 @@ InitBufferPool(void) /* Align descriptors to a cacheline boundary. */ BufferDescriptors = (BufferDescPadded *) - ShmemInitStruct("Buffer Descriptors", + ShmemInitStructInSlot("Buffer Descriptors", NBuffers * sizeof(BufferDescPadded), - &foundDescs); + &foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT); /* Align buffer pool on IO page size boundary. */ BufferBlocks = (char *) TYPEALIGN(PG_IO_ALIGN_SIZE, - ShmemInitStruct("Buffer Blocks", + ShmemInitStructInSlot("Buffer Blocks", NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, - &foundBufs)); + &foundBufs, BUFFERS_SHMEM_SLOT)); /* Align condition variables to cacheline boundary. */ BufferIOCVArray = (ConditionVariableMinimallyPadded *) - ShmemInitStruct("Buffer IO Condition Variables", + ShmemInitStructInSlot("Buffer IO Condition Variables", NBuffers * sizeof(ConditionVariableMinimallyPadded), - &foundIOCV); + &foundIOCV, BUFFER_IOCV_SHMEM_SLOT); /* * The array used to sort to-be-checkpointed buffer ids is located in @@ -99,8 +102,9 @@ InitBufferPool(void) * painful. */ CkptBufferIds = (CkptSortItem *) - ShmemInitStruct("Checkpoint BufferIds", - NBuffers * sizeof(CkptSortItem), &foundBufCkpt); + ShmemInitStructInSlot("Checkpoint BufferIds", + NBuffers * sizeof(CkptSortItem), &foundBufCkpt, + CHECKPOINT_BUFFERS_SHMEM_SLOT); if (foundDescs || foundBufs || foundIOCV || foundBufCkpt) { @@ -154,33 +158,54 @@ InitBufferPool(void) * BufferShmemSize * * compute the size of shared memory for the buffer pool including - * data pages, buffer descriptors, hash tables, etc. + * data pages, buffer descriptors, hash tables, etc. based on the + * shared memory slot. The main slot must not allocate anything + * related to buffers, every other slot will receive part of the + * data. */ Size -BufferShmemSize(void) +BufferShmemSize(int shmem_slot) { Size size = 0; - /* size of buffer descriptors */ - size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded))); - /* to allow aligning buffer descriptors */ - size = add_size(size, PG_CACHE_LINE_SIZE); + if (shmem_slot == MAIN_SHMEM_SLOT) + return size; - /* size of data pages, plus alignment padding */ - size = add_size(size, PG_IO_ALIGN_SIZE); - size = add_size(size, mul_size(NBuffers, BLCKSZ)); + if (shmem_slot == BUFFER_DESCRIPTORS_SHMEM_SLOT) + { + /* size of buffer descriptors */ + size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded))); + /* to allow aligning buffer descriptors */ + size = add_size(size, PG_CACHE_LINE_SIZE); + } - /* size of stuff controlled by freelist.c */ - size = add_size(size, StrategyShmemSize()); + if (shmem_slot == BUFFERS_SHMEM_SLOT) + { + /* size of data pages, plus alignment padding */ + size = add_size(size, PG_IO_ALIGN_SIZE); + size = add_size(size, mul_size(NBuffers, BLCKSZ)); + } - /* size of I/O condition variables */ - size = add_size(size, mul_size(NBuffers, - sizeof(ConditionVariableMinimallyPadded))); - /* to allow aligning the above */ - size = add_size(size, PG_CACHE_LINE_SIZE); + if (shmem_slot == STRATEGY_SHMEM_SLOT) + { + /* size of stuff controlled by freelist.c */ + size = add_size(size, StrategyShmemSize()); + } - /* size of checkpoint sort array in bufmgr.c */ - size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem))); + if (shmem_slot == BUFFER_IOCV_SHMEM_SLOT) + { + /* size of I/O condition variables */ + size = add_size(size, mul_size(NBuffers, + sizeof(ConditionVariableMinimallyPadded))); + /* to allow aligning the above */ + size = add_size(size, PG_CACHE_LINE_SIZE); + } + + if (shmem_slot == CHECKPOINT_BUFFERS_SHMEM_SLOT) + { + /* size of checkpoint sort array in bufmgr.c */ + size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem))); + } return size; } diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c index 0fa5468930..ccbaed8010 100644 --- a/src/backend/storage/buffer/buf_table.c +++ b/src/backend/storage/buffer/buf_table.c @@ -59,10 +59,11 @@ InitBufTable(int size) info.entrysize = sizeof(BufferLookupEnt); info.num_partitions = NUM_BUFFER_PARTITIONS; - SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table", + SharedBufHash = ShmemInitHashInSlot("Shared Buffer Lookup Table", size, size, &info, - HASH_ELEM | HASH_BLOBS | HASH_PARTITION); + HASH_ELEM | HASH_BLOBS | HASH_PARTITION, + STRATEGY_SHMEM_SLOT); } /* diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 19797de31a..8ce1611db2 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -491,9 +491,9 @@ StrategyInitialize(bool init) * Get or create the shared strategy control block */ StrategyControl = (BufferStrategyControl *) - ShmemInitStruct("Buffer Strategy Status", + ShmemInitStructInSlot("Buffer Strategy Status", sizeof(BufferStrategyControl), - &found); + &found, STRATEGY_SHMEM_SLOT); if (!found) { diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 8224015b53..fbaddba396 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -115,7 +115,7 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot) sizeof(ShmemIndexEnt))); size = add_size(size, dsm_estimate_size()); size = add_size(size, DSMRegistryShmemSize()); - size = add_size(size, BufferShmemSize()); + size = add_size(size, BufferShmemSize(shmem_slot)); size = add_size(size, LockShmemSize()); size = add_size(size, PredicateLockShmemSize()); size = add_size(size, ProcGlobalShmemSize()); diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index c8422571b7..4c09d270c9 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -301,7 +301,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf); /* in buf_init.c */ extern void InitBufferPool(void); -extern Size BufferShmemSize(void); +extern Size BufferShmemSize(int); /* in localbuf.c */ extern void AtProcExit_LocalBuffers(void); diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h index e968deeef7..c0143e3899 100644 --- a/src/include/storage/pg_shmem.h +++ b/src/include/storage/pg_shmem.h @@ -52,7 +52,7 @@ typedef struct ShmemSegment } ShmemSegment; // Number of available slots for anonymous memory mappings -#define ANON_MAPPINGS 1 +#define ANON_MAPPINGS 6 extern PGDLLIMPORT ShmemSegment Segments[ANON_MAPPINGS]; @@ -105,7 +105,28 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2); extern void PGSharedMemoryDetach(void); extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags); +/* + * To be able to dynamically resize largest parts of the data stored in shared + * memory, we split it into multiple shared memory mappings slots. Each slot + * contains only certain part of the data, which size depends on NBuffers. + */ + /* The main slot, contains everything except buffer blocks and related data. */ #define MAIN_SHMEM_SLOT 0 +/* Buffer blocks */ +#define BUFFERS_SHMEM_SLOT 1 + +/* Buffer descriptors */ +#define BUFFER_DESCRIPTORS_SHMEM_SLOT 2 + +/* Condition variables for buffers */ +#define BUFFER_IOCV_SHMEM_SLOT 3 + +/* Checkpoint BufferIds */ +#define CHECKPOINT_BUFFERS_SHMEM_SLOT 4 + +/* Buffer strategy status */ +#define STRATEGY_SHMEM_SLOT 5 + #endif /* PG_SHMEM_H */ -- 2.45.1
>From 7183999bba1cbeebd059d18e5a590cbef7aff2d1 Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Wed, 16 Oct 2024 20:24:58 +0200 Subject: [PATCH v1 4/5] Allow to resize shared memory without restart Add assing hook for shared_buffers to resize shared memory using space, introduced in the previous commits without requiring PostgreSQL restart. Size for every shared memory slot is recalculated based on the new NBuffers, and extended using mremap. After allocating new space, new shared structures (buffer blocks, descriptors, etc) are allocated as needed. Here is how it looks like after raising shared_buffers from 128 MB to 512 MB and calling pg_reload_conf(): -- 128 MB 7f5a2bd04000-7f5a32e52000 /dev/zero (deleted) 7f5a39252000-7f5a4030e000 /dev/zero (deleted) 7f5a4670e000-7f5a4d7ba000 /dev/zero (deleted) 7f5a53bba000-7f5a5ad26000 /dev/zero (deleted) 7f5a9ad26000-7f5aa9d94000 /dev/zero (deleted) ^ buffers mapping, ~240 MB 7f5d29d94000-7f5d30e00000 /dev/zero (deleted) -- 512 MB 7f5a2bd04000-7f5a33274000 /dev/zero (deleted) 7f5a39252000-7f5a4057e000 /dev/zero (deleted) 7f5a4670e000-7f5a4d9fa000 /dev/zero (deleted) 7f5a53bba000-7f5a5b1a6000 /dev/zero (deleted) 7f5a9ad26000-7f5ac1f14000 /dev/zero (deleted) ^ buffers mapping, ~625 MB 7f5d29d94000-7f5d30f80000 /dev/zero (deleted) The implementation supports only increasing of shared_buffers. For decreasing the value a similar procedure is needed. But the buffer blocks with data have to be drained first, so that the actual data set fits into the new smaller space. >From experiment it turns out that shared mappings have to be extended separately for each process that uses them. Another rough edge is that a backend, executing pg_reload_conf interactively, will not resize mappings immediately, for some reason it will require another command. Note, that mremap is Linux specific, thus the implementation not very portable. --- src/backend/port/sysv_shmem.c | 62 +++++++++++++ src/backend/storage/buffer/buf_init.c | 86 +++++++++++++++++++ src/backend/storage/ipc/ipci.c | 11 +++ src/backend/storage/ipc/shmem.c | 14 ++- .../utils/activity/wait_event_names.txt | 1 + src/backend/utils/misc/guc_tables.c | 4 +- src/include/storage/bufmgr.h | 1 + src/include/storage/lwlocklist.h | 1 + src/include/storage/pg_shmem.h | 2 + 9 files changed, 171 insertions(+), 11 deletions(-) diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index beebd4d85e..4bdadbb0e2 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -30,9 +30,11 @@ #include "miscadmin.h" #include "port/pg_bitutils.h" #include "portability/mem.h" +#include "storage/bufmgr.h" #include "storage/dsm.h" #include "storage/fd.h" #include "storage/ipc.h" +#include "storage/lwlock.h" #include "storage/pg_shmem.h" #include "utils/guc.h" #include "utils/guc_hooks.h" @@ -859,6 +861,66 @@ AnonymousShmemDetach(int status, Datum arg) } } +/* + * An assign callback for shared_buffers GUC -- a somewhat clumsy way of + * resizing shared memory without a restart. On NBuffers change use the new + * value to recalculate required size for every shmem slot, then base on the + * new and old values initialize new buffer blocks. + * + * The actual slot resizing is done via mremap, which will fail if is not + * sufficient space to expand the mapping. + * + * XXX: For some readon in the current implementation the change is applied to + * the backend calling pg_reload_conf only at the backend exit. + */ +void +AnonymousShmemResize(int newval, void *extra) +{ + int numSemas; + bool reinit = false; + int NBuffersOld = NBuffers; + + /* + * XXX: Currently only increasing of shared_buffers is supported. For + * decreasing something similar has to be done, but buffer blocks with + * data have to be drained first. + */ + if(NBuffers > newval) + return; + + /* XXX: Hack, NBuffers has to be exposed in the the interface for + * memory calculation and buffer blocks reinitialization instead. */ + NBuffers = newval; + + for(int i = 0; i < next_free_slot; i++) + { + Size new_size = CalculateShmemSize(&numSemas, i); + AnonymousMapping *m = &Mappings[i]; + + if (m->shmem == NULL) + continue; + + if (m->shmem_size == new_size) + continue; + + if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0) + elog(LOG, "mremap(%p, %zu) failed: %m", + m->shmem, m->shmem_size); + else + { + reinit = true; + m->shmem_size = new_size; + } + } + + if (reinit) + { + LWLockAcquire(ShmemResizeLock, LW_EXCLUSIVE); + ResizeBufferPool(NBuffersOld); + LWLockRelease(ShmemResizeLock); + } +} + /* * PGSharedMemoryCreate * diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index 6bca286bef..4054abf0e8 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -154,6 +154,92 @@ InitBufferPool(void) &backend_flush_after); } +/* + * Reinitialize shared memory structures, which size depends on NBuffers. It's + * similar to InitBufferPool, but applied only to the buffers in the range + * between NBuffersOld and NBuffers. + */ +void +ResizeBufferPool(int NBuffersOld) +{ + bool foundBufs, + foundDescs, + foundIOCV, + foundBufCkpt; + int i; + + /* XXX: Only increasing of shared_buffers is supported in this function */ + if(NBuffersOld > NBuffers) + return; + + /* Align descriptors to a cacheline boundary. */ + BufferDescriptors = (BufferDescPadded *) + ShmemInitStructInSlot("Buffer Descriptors", + NBuffers * sizeof(BufferDescPadded), + &foundDescs, BUFFER_DESCRIPTORS_SHMEM_SLOT); + + /* Align condition variables to cacheline boundary. */ + BufferIOCVArray = (ConditionVariableMinimallyPadded *) + ShmemInitStructInSlot("Buffer IO Condition Variables", + NBuffers * sizeof(ConditionVariableMinimallyPadded), + &foundIOCV, BUFFER_IOCV_SHMEM_SLOT); + + /* + * The array used to sort to-be-checkpointed buffer ids is located in + * shared memory, to avoid having to allocate significant amounts of + * memory at runtime. As that'd be in the middle of a checkpoint, or when + * the checkpointer is restarted, memory allocation failures would be + * painful. + */ + CkptBufferIds = (CkptSortItem *) + ShmemInitStructInSlot("Checkpoint BufferIds", + NBuffers * sizeof(CkptSortItem), &foundBufCkpt, + CHECKPOINT_BUFFERS_SHMEM_SLOT); + + /* Align buffer pool on IO page size boundary. */ + BufferBlocks = (char *) + TYPEALIGN(PG_IO_ALIGN_SIZE, + ShmemInitStructInSlot("Buffer Blocks", + NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, + &foundBufs, BUFFERS_SHMEM_SLOT)); + + /* + * Initialize the headers for new buffers. + */ + for (i = NBuffersOld - 1; i < NBuffers; i++) + { + BufferDesc *buf = GetBufferDescriptor(i); + + ClearBufferTag(&buf->tag); + + pg_atomic_init_u32(&buf->state, 0); + buf->wait_backend_pgprocno = INVALID_PROC_NUMBER; + + buf->buf_id = i; + + /* + * Initially link all the buffers together as unused. Subsequent + * management of this list is done by freelist.c. + */ + buf->freeNext = i + 1; + + LWLockInitialize(BufferDescriptorGetContentLock(buf), + LWTRANCHE_BUFFER_CONTENT); + + ConditionVariableInit(BufferDescriptorGetIOCV(buf)); + } + + /* Correct last entry of linked list */ + GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST; + + /* Init other shared buffer-management stuff */ + StrategyInitialize(!foundDescs); + + /* Initialize per-backend file flush context */ + WritebackContextInit(&BackendWritebackContext, + &backend_flush_after); +} + /* * BufferShmemSize * diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index fbaddba396..56fa339f55 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -86,6 +86,9 @@ RequestAddinShmemSpace(Size size) * * If num_semaphores is not NULL, it will be set to the number of semaphores * required. + * + * XXX: Calculation for non main shared memory slots are incorrect, it includes + * more than needed for buffers only. */ Size CalculateShmemSize(int *num_semaphores, int shmem_slot) @@ -153,6 +156,14 @@ CalculateShmemSize(int *num_semaphores, int shmem_slot) size = add_size(size, SlotSyncShmemSize()); size = add_size(size, WaitLSNShmemSize()); + /* + * XXX: For some reason slightly more memory is needed for larger + * shared_buffers, but this size is enough for any large value I've tested + * with. Is it a mistake in how slots are split, or there was a hidden + * inconsistency in shmem calculation? + */ + size = add_size(size, 1024 * 1024 * 100); + /* include additional requested shmem from preload libraries */ size = add_size(size, total_addin_request); diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c index c670b9cf43..20c4b1d5ad 100644 --- a/src/backend/storage/ipc/shmem.c +++ b/src/backend/storage/ipc/shmem.c @@ -491,17 +491,13 @@ ShmemInitStructInSlot(const char *name, Size size, bool *foundPtr, { /* * Structure is in the shmem index so someone else has allocated it - * already. The size better be the same as the size we are trying to - * initialize to, or there is a name conflict (or worse). + * already. Verify the structure's size: + * - If it's the same, we've found the expected structure. + * - If it's different, we're resizing the expected structure. */ if (result->size != size) - { - LWLockRelease(ShmemIndexLock); - ereport(ERROR, - (errmsg("ShmemIndex entry size is wrong for data structure" - " \"%s\": expected %zu, actual %zu", - name, size, result->size))); - } + result->size = size; + structPtr = result->location; } else diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt index d10ca723dc..42296d950e 100644 --- a/src/backend/utils/activity/wait_event_names.txt +++ b/src/backend/utils/activity/wait_event_names.txt @@ -347,6 +347,7 @@ DSMRegistry "Waiting to read or update the dynamic shared memory registry." InjectionPoint "Waiting to read or update information related to injection points." SerialControl "Waiting to read or update shared <filename>pg_serial</filename> state." WaitLSN "Waiting to read or update shared Wait-for-LSN state." +ShmemResize "Waiting to resize shared memory." # # END OF PREDEFINED LWLOCKS (DO NOT CHANGE THIS LINE) diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c index 636780673b..7f2c45b7f9 100644 --- a/src/backend/utils/misc/guc_tables.c +++ b/src/backend/utils/misc/guc_tables.c @@ -2301,14 +2301,14 @@ struct config_int ConfigureNamesInt[] = * checking for overflow, so we mustn't allow more than INT_MAX / 2. */ { - {"shared_buffers", PGC_POSTMASTER, RESOURCES_MEM, + {"shared_buffers", PGC_SIGHUP, RESOURCES_MEM, gettext_noop("Sets the number of shared memory buffers used by the server."), NULL, GUC_UNIT_BLOCKS }, &NBuffers, 16384, 16, INT_MAX / 2, - NULL, NULL, NULL + NULL, AnonymousShmemResize, NULL }, { diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h index 4c09d270c9..ff75c46307 100644 --- a/src/include/storage/bufmgr.h +++ b/src/include/storage/bufmgr.h @@ -302,6 +302,7 @@ extern bool EvictUnpinnedBuffer(Buffer buf); /* in buf_init.c */ extern void InitBufferPool(void); extern Size BufferShmemSize(int); +extern void ResizeBufferPool(int); /* in localbuf.c */ extern void AtProcExit_LocalBuffers(void); diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h index 88dc79b2bd..fb310e8b9d 100644 --- a/src/include/storage/lwlocklist.h +++ b/src/include/storage/lwlocklist.h @@ -84,3 +84,4 @@ PG_LWLOCK(50, DSMRegistry) PG_LWLOCK(51, InjectionPoint) PG_LWLOCK(52, SerialControl) PG_LWLOCK(53, WaitLSN) +PG_LWLOCK(54, ShmemResize) diff --git a/src/include/storage/pg_shmem.h b/src/include/storage/pg_shmem.h index c0143e3899..ff4736c6c8 100644 --- a/src/include/storage/pg_shmem.h +++ b/src/include/storage/pg_shmem.h @@ -105,6 +105,8 @@ extern bool PGSharedMemoryIsInUse(unsigned long id1, unsigned long id2); extern void PGSharedMemoryDetach(void); extern void GetHugePageSize(Size *hugepagesize, int *mmap_flags); +void AnonymousShmemResize(int newval, void *extra); + /* * To be able to dynamically resize largest parts of the data stored in shared * memory, we split it into multiple shared memory mappings slots. Each slot -- 2.45.1
>From 6df85a35e8f6cca94a963d516f1b6974850ba05b Mon Sep 17 00:00:00 2001 From: Dmitrii Dolgov <9erthali...@gmail.com> Date: Tue, 15 Oct 2024 16:18:45 +0200 Subject: [PATCH v1 5/5] Use anonymous files to back shared memory segments Allow to use anonymous files for shared memory, instead of plain anonymous memory. Such an anonymous file is created via memfd_create, it lives in memory, behaves like a regular file and semantically equivalent to an anonymous memory allocated via mmap with MAP_ANONYMOUS. Advantages of using anon files are following: * We've got a file descriptor, which could be used for regular file operations (modification, truncation, you name it). * The file could be given a name, which improves readability when it comes to process maps. Here is how it looks like 7f5a2bd04000-7f5a32e52000 rw-s 00000000 00:01 1845 /memfd:strategy (deleted) 7f5a39252000-7f5a4030e000 rw-s 00000000 00:01 1842 /memfd:checkpoint (deleted) 7f5a4670e000-7f5a4d7ba000 rw-s 00000000 00:01 1839 /memfd:iocv (deleted) 7f5a53bba000-7f5a5ad26000 rw-s 00000000 00:01 1836 /memfd:descriptors (deleted) 7f5a9ad26000-7f5aa9d94000 rw-s 00000000 00:01 1833 /memfd:buffers (deleted) 7f5d29d94000-7f5d30e00000 rw-s 00000000 00:01 1830 /memfd:main (deleted) * By default, Linux will not add file-backed shared mappings into a core dump, making it more convenient to work with them in PostgreSQL: no more huge dumps to process. The downside is that memfd_create is Linux specific. --- src/backend/port/sysv_shmem.c | 47 +++++++++++++++++++++++++++++------ src/include/portability/mem.h | 2 +- 2 files changed, 40 insertions(+), 9 deletions(-) diff --git a/src/backend/port/sysv_shmem.c b/src/backend/port/sysv_shmem.c index 4bdadbb0e2..a01c3e4789 100644 --- a/src/backend/port/sysv_shmem.c +++ b/src/backend/port/sysv_shmem.c @@ -103,6 +103,7 @@ typedef struct AnonymousMapping void *shmem; /* Pointer to the start of the mapped memory */ void *seg_addr; /* SysV shared memory for the header */ unsigned long seg_id; /* IPC key */ + int segment_fd; /* fd for the backing anon file */ } AnonymousMapping; static AnonymousMapping Mappings[ANON_MAPPINGS]; @@ -116,7 +117,7 @@ static int next_free_slot = 0; * 00400000-00490000 /path/bin/postgres * ... * 012d9000-0133e000 [heap] - * 7f443a800000-7f470a800000 /dev/zero (deleted) + * 7f443a800000-7f470a800000 /memfd:main (deleted) * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 * ... @@ -143,9 +144,9 @@ static int next_free_slot = 0; * The result would look like this: * * 012d9000-0133e000 [heap] - * 7f4426f54000-7f442e010000 /dev/zero (deleted) + * 7f4426f54000-7f442e010000 /memfd:main (deleted) * [...free space...] - * 7f443a800000-7f444196c000 /dev/zero (deleted) + * 7f443a800000-7f444196c000 /memfd:buffers (deleted) * [...free space...] * 7f470a800000-7f471831d000 /usr/lib/locale/locale-archive * 7f4718400000-7f4718401000 /usr/lib64/libicudata.so.74.2 @@ -708,6 +709,18 @@ CreateAnonymousSegment(AnonymousMapping *mapping) void *ptr = MAP_FAILED; int mmap_errno = 0; + /* + * Prepare an anonymous file backing the segment. Its size will be + * specified later via ftruncate. + * + * The file behaves like a regular file, but lives in memory. Once all + * references to the file are dropped, it is automatically released. + * Anonymous memory is used for all backing pages of the file, thus it has + * the same semantics as anonymous memory allocations using mmap with the + * MAP_ANONYMOUS flag. + */ + mapping->segment_fd = memfd_create(MappingName(mapping->shmem_slot), 0); + #ifndef MAP_HUGETLB /* PGSharedMemoryCreate should have dealt with this case */ Assert(huge_pages != HUGE_PAGES_ON); @@ -725,8 +738,13 @@ CreateAnonymousSegment(AnonymousMapping *mapping) if (allocsize % hugepagesize != 0) allocsize += hugepagesize - (allocsize % hugepagesize); + /* + * Do not use an anonymous file here yet. When adding it, do not forget + * to use ftruncate and flags MFD_HUGETLB & MFD_HUGE_2MB/MFD_HUGE_1GB + * in memfd_create. + */ ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, - PG_MMAP_FLAGS | mmap_flags, -1, 0); + PG_MMAP_FLAGS | MAP_ANONYMOUS | mmap_flags, -1, 0); mmap_errno = errno; if (huge_pages == HUGE_PAGES_TRY && ptr == MAP_FAILED) { @@ -762,7 +780,8 @@ CreateAnonymousSegment(AnonymousMapping *mapping) * - First create the temporary probe mapping of a fixed size and let * kernel to place it at address of its choice. By the virtue of the * probe mapping size we expect it to be located at the lowest - * possible address, expecting some non mapped space above. + * possible address, expecting some non mapped space above. The probe + * is does not need to be backed by an anonymous file. * * - Unmap the probe mapping, remember the address. * @@ -777,7 +796,7 @@ CreateAnonymousSegment(AnonymousMapping *mapping) * without a restart. */ probe = mmap(NULL, PROBE_MAPPING_SIZE, PROT_READ | PROT_WRITE, - PG_MMAP_FLAGS, -1, 0); + PG_MMAP_FLAGS | MAP_ANONYMOUS, -1, 0); if (probe == MAP_FAILED) { @@ -793,8 +812,14 @@ CreateAnonymousSegment(AnonymousMapping *mapping) munmap(probe, PROBE_MAPPING_SIZE); + /* + * Specify the segment file size using allocsize, which contains + * potentially modified size. + */ + ftruncate(mapping->segment_fd, allocsize); + ptr = mmap(probe - offset, allocsize, PROT_READ | PROT_WRITE, - PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, -1, 0); + PG_MMAP_FLAGS | MAP_FIXED_NOREPLACE, mapping->segment_fd, 0); mmap_errno = errno; if (ptr == MAP_FAILED) { @@ -813,8 +838,11 @@ CreateAnonymousSegment(AnonymousMapping *mapping) */ allocsize = mapping->shmem_size; + /* Specify the segment file size using allocsize. */ + ftruncate(mapping->segment_fd, allocsize); + ptr = mmap(NULL, allocsize, PROT_READ | PROT_WRITE, - PG_MMAP_FLAGS, -1, 0); + PG_MMAP_FLAGS, mapping->segment_fd, 0); mmap_errno = errno; } @@ -903,6 +931,9 @@ AnonymousShmemResize(int newval, void *extra) if (m->shmem_size == new_size) continue; + /* Resize the backing anon file. */ + ftruncate(m->segment_fd, new_size); + if (mremap(m->shmem, m->shmem_size, new_size, 0) < 0) elog(LOG, "mremap(%p, %zu) failed: %m", m->shmem, m->shmem_size); diff --git a/src/include/portability/mem.h b/src/include/portability/mem.h index 2cd05313b8..50db0da28d 100644 --- a/src/include/portability/mem.h +++ b/src/include/portability/mem.h @@ -38,7 +38,7 @@ #define MAP_NOSYNC 0 #endif -#define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE) +#define PG_MMAP_FLAGS (MAP_SHARED|MAP_HASSEMAPHORE) /* Some really old systems don't define MAP_FAILED. */ #ifndef MAP_FAILED -- 2.45.1