On Tue, Dec 3, 2024 at 8:01 PM Robert Haas <robertmh...@gmail.com> wrote: > > On Mon, Dec 2, 2024 at 2:18 PM Dmitry Dolgov <9erthali...@gmail.com> wrote: > > I've asked about that in linux-mm [1]. To my surprise, the > > recommendations were to stick to creating a large mapping in advance, > > and slice smaller mappings out of that, which could be resized later. > > The OOM score should not be affected, and hugetlb could be avoided using > > MAP_NORESERVE flag for the initial mapping (I've experimented with that, > > seems to be working just fine, even if the slices are not using > > MAP_NORESERVE). > > > > I guess that would mean I'll try to experiment with this approach as > > well. But what others think? How much research do we need to do, to gain > > some confidence about large shared mappings and make it realistically > > acceptable? > > Personally, I like this approach. It seems to me that this opens up > the possibility of a system where the virtual addresses of data > structures in shared memory never change, which I think will avoid an > absolutely massive amount of implementation complexity. It's obviously > not ideal that we have to specify in advance an upper limit on the > potential size of shared_buffers, but we can live with it. It's better > than what we have today; and certainly cloud providers will have no > issue with pre-setting that to a reasonable value. I don't know if we > can port it to other operating systems, but it seems at least possible > that they offer similar primitives, or will in the future; if not, we > can disable the feature on those platforms. > > I still think the synchronization is going to be tricky. For example > when you go to shrink a mapping, you need to make sure that it's free > of buffers that anyone might touch; and when you grow a mapping, you > need to make sure that nobody tries to touch that address space before > they grow the mapping, which goes back to my earlier point about > someone doing a lookup into the buffer mapping table and finding a > buffer number that is beyond the end of what they've already mapped. > But I think it may be doable with sufficient cleverness. >
>From the discussion so far, the protocol for each shared memory slot (or segment as suggested by Robert) seems to be the following. 1. At the start create a memory mapping using mmap with maximum allocation (maxsize) with PROT_READ/PROT_WRITE and MAP_NORESERVE to reserve address space. Assume this is created at virtual address maddr. 2. Resize it to the required size (size) using mremap() - this will be used to create shared memory objects 3. Map a segment with PROT_NONE and MAP_NORESERVE at maddr + size. This segment would not allow any other mapping to be added in the required space. PROT_NONE will protect from unintentional writes/reads from this space. 4. When resizing the segment remove the mapping created in step 3 and execute step 2 and 3 again. Synchronization, mentioned by Robert, should be carried out somewhere in this step. Note that the addresses need to be aligned as per mmap and mremap requirements. Please correct me if I am wrong. I wrote the attached simple program simulating this protocol. It seems to work as expected. However, mmap'ing with MAP_FIXED would still be able to dislodge the reserved memory. But that's true with any mapped segment; not just with reserved memory. A bit about the program: It reserves a 3MB memory segment and resizes it to 1MB, 2MB and back to 3MB, thus exercising both shrinking and enlarging the memory. It forks a child process after resizing the the memory segment first time. At every step it makes sure that the parent and child programs can write and read at the boundaries of the resized memory segment. The program waits for getchar() at these steps. So in case the program seems to be stuck, try pressing Enter once or twice. I could verify the memory mappings, their sizes etc. by looking at /proc/PID/maps and /proc/PID/status but I did not find a way to verify the amount of memory actually allocated and verify that it's actually shrinking and expanding. Please let me know how to verify that. -- Best Wishes, Ashutosh Bapat
#define _GNU_SOURCE 1 /* See feature_test_macros(7) */ #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> #include <stdbool.h> void unmap_memory(void *memaddr, size_t size, const char *tag) { if (munmap(memaddr, size) < 0) { printf("%s: shared memory unmapping failed with eno %d\n", tag, errno); exit(__LINE__); } printf("%s: unmapped memory of size %lu from %p.\n", tag, size, memaddr); } void unmap_on_exit(void *memaddr, size_t size, int exit_code, const char *tag) { unmap_memory(memaddr, size, tag); exit(exit_code); } void * map_memory(void *addr, size_t size, bool noreserve, bool fixed, int protection) { void *memaddr; int flags = MAP_SHARED|MAP_ANONYMOUS; if (noreserve) flags = flags | MAP_NORESERVE; if (fixed) flags = flags | MAP_FIXED_NOREPLACE; memaddr = mmap(addr, size, protection, flags, -1, 0); if (memaddr == MAP_FAILED ) { printf("shared memory mapping at %p of size %lu with %s failed with eno %d\n", addr, size, noreserve ? "no reservation" : "reservation", errno); return NULL; } if (addr != NULL && memaddr != addr) { printf("Expected memory of size %lu with %s to be mapped at %p but got mapped at %p", size, noreserve ? "no reservation" : "reservation", addr, memaddr); } printf("mapped memory of size %lu with %s at %p.\n", size, noreserve ? "no reservation" : "reservation", memaddr); return memaddr; } void p_write_and_readwait(int *addr, int wsign, int readsign) { *addr = wsign; printf("parent wrote value %d at %p\n", wsign, addr); while (*addr != readsign) { printf("parent is sleeping for child to increment signature value to %d at %p\n", readsign, addr); sleep(1); } printf("parent found signature value of %d at %p.\n", *addr, addr); } void c_readwait_and_write(int *addr, int readsign, int wsign) { while (*addr != readsign) { printf("child is sleeping for parent to write signature value of %d at %p\n", readsign, addr); sleep(1); } printf("child found signature value of %d at %p.\n", *addr, addr); *addr = wsign; printf("child wrote value %d at %p\n", wsign, addr); } /* * Unmap reserved space, resize memory and add reserved space. */ void resize_memory(void *addr, size_t oldsize, size_t newsize, size_t maxsize, int sign, const char *tag) { int *signaddr = addr; /* Unmap existing unreserved memory first */ if (oldsize != maxsize) unmap_memory(addr + oldsize, maxsize - oldsize, tag); /* Resize memory and check sanity */ void *oldaddr = addr; addr = mremap(addr, oldsize, newsize, 0); if (addr == MAP_FAILED) { printf("%s: resizing memory at %p from %lu to %lu failed with errno %d\n", tag, oldaddr, oldsize, newsize, errno); return; } if (*signaddr != sign) { printf("%s: didn't find expected value %d at %p after resizing, Instead found %d\n", tag, sign, signaddr, *signaddr); return; } if (addr != oldaddr) { printf("%s: remapped to %p instead of %p", tag, addr, oldaddr); return; } if (newsize != maxsize) map_memory(addr + newsize, maxsize - newsize, true, false, PROT_NONE); printf("%s: resized memory at %p from %lu to %lu successfully retaining old value %d at %p. Press Enter\n", tag, addr, oldsize, newsize, sign, signaddr); getchar(); } void parent_process(void *memaddr, size_t *sizes, int *signs, int numsizes, size_t maxsize) { void *otheraddr; printf("parent: *** checking consistency before resizing ***\n"); p_write_and_readwait(memaddr, signs[0], signs[0] + 1); p_write_and_readwait(memaddr + sizes[0] - sizeof(int) - 1, signs[0] + 2, signs[0] + 3); getchar(); resize_memory(memaddr, sizes[0], sizes[1], maxsize, signs[0] + 1, "parent"); printf("parent: *** checking consistency after resizing ***\n"); p_write_and_readwait(memaddr + sizes[0], signs[1], signs[1] + 2); p_write_and_readwait(memaddr + sizes[1] - sizeof(int) - 1, signs[1] + 3, signs[1] + 4); getchar(); /* * Try adding a mapping between current boundary and max boundary. This * should not succeed because of reserved space at the end. */ otheraddr = map_memory(memaddr + sizes[1], maxsize - sizes[1] - 1024, false, true, PROT_WRITE | PROT_READ); if (otheraddr != NULL) { printf("Extra memory segment mapped in the reserved space from %p to %p.\n", memaddr, memaddr + maxsize); unmap_on_exit(memaddr, maxsize, __LINE__, "child"); } resize_memory(memaddr, sizes[1], sizes[2], maxsize, signs[0] + 1, "parent"); printf("parent: ***** checking consistency after 2nd resizing *****\n"); p_write_and_readwait(memaddr + sizes[1], signs[2], signs[2] + 5); p_write_and_readwait(memaddr + sizes[2] - sizeof(int) - 1, signs[2] + 6, signs[2] + 7); getchar(); } void child_process(void *memaddr, size_t *sizes, int *signs, int numsizes, size_t maxsize) { void *otheraddr; printf("child: check memory mapping /proc/%d/maps and status /proc/%d/status\n", getpid(), getpid()); /* Read and write: at boundaries */ printf("child: *** checking consistency before resizing ***\n"); c_readwait_and_write(memaddr, signs[0], signs[0] + 1); c_readwait_and_write(memaddr + sizes[0] - sizeof(int) - 1, signs[0] + 2, signs[0] + 3); getchar(); resize_memory(memaddr, sizes[0], sizes[1], maxsize, signs[0] + 1, "child"); printf("child: *** checking consistency after resizing ***\n"); c_readwait_and_write(memaddr + sizes[0], signs[1], signs[1] + 2); c_readwait_and_write(memaddr + sizes[1] - sizeof(int) - 1, signs[1] + 3, signs[1] + 4); getchar(); /* * Try adding a mapping between current boundary and max boundary. This * should not succeed because of reserved space at the end. */ otheraddr = map_memory(memaddr + sizes[1], maxsize - sizes[1] - 1024, false, true, PROT_WRITE | PROT_READ); if (otheraddr != NULL) { if (otheraddr >= memaddr && otheraddr <= memaddr + maxsize) printf("Extra memory segment mapped in the reserved space from %p to %p.\n", memaddr, memaddr + maxsize); unmap_on_exit(memaddr, maxsize, __LINE__, "child"); } resize_memory(memaddr, sizes[1], sizes[2], maxsize, signs[0] + 1, "child"); printf("child: *** checking consistency after 2nd resizing ***\n"); c_readwait_and_write(memaddr + sizes[1], signs[2], signs[2] + 5); c_readwait_and_write(memaddr + sizes[2] - sizeof(int) - 1, signs[2] + 6, signs[2] + 7); getchar(); } int main(int argc, char **argv) { size_t sizes[] = {100 * 1024 * 1024, 200 * 1024 * 1024, 300 * 1024 * 1024}; int signs[] = {435, 643, 586}; int numsizes = sizeof(sizes)/sizeof(sizes[0]); void *memaddr; size_t maxsize = 0; pid_t chldpid; char *tag = "parent"; #define FIRST_SIGN 100 if (numsizes != sizeof(signs)/sizeof(signs[0])) printf("mismatch in number of sizes and number of signs %d vs %ld", numsizes, sizeof(signs)/sizeof(signs[0])); for (int i = 0; i < numsizes; i++) { if (maxsize < sizes[i]) maxsize = sizes[i]; } printf("parent: check memory mapping /proc/%d/maps and status /proc/%d/status\n", getpid(), getpid()); /* Reserve memory but don't allocate */ memaddr = map_memory(NULL, maxsize, true, false, PROT_WRITE | PROT_READ); if (memaddr == NULL) exit(1); *(int *)memaddr = FIRST_SIGN; getchar(); resize_memory(memaddr, maxsize, sizes[0], maxsize, FIRST_SIGN, "parent"); chldpid = fork(); if (chldpid < 0) { printf("forking a child failed\n"); unmap_on_exit(memaddr, maxsize, __LINE__, tag); } else if (chldpid == 0) { child_process(memaddr, sizes, signs, numsizes, maxsize); tag = "child"; } else { parent_process(memaddr, sizes, signs, numsizes, maxsize); } unmap_on_exit(memaddr, maxsize, __LINE__, tag); }