On Sun, Dec 8, 2024 at 7:48 PM Tomas Vondra <to...@vondra.me> wrote: [..] > >> I have previously encountered situations where the non-garbage-collected > >> memory of wal_sender was approximately hundreds of megabytes or even > >> exceeded 1GB, but I was unable to reproduce this situation using simple > >> SQL. Therefore, I introduced an asynchronous processing function, hoping > >> to manage memory more efficiently without affecting performance. > >> > > > > I doubt a system function is the right approach to deal with these > > memory allocation issues. The function has to be called by the user, > > which means the user is expected to monitor the system and decide when > > to invoke the function. That seems far from trivial - it would require > > collecting OS-level information about memory usage, and I suppose it'd > > need to happen fairly often to actually help with OOM reliably. [..]
> > Sure, forcing the system to release memory more aggressively may affect > > performance - that's the tradeoff done by glibc. But calling the new > > pg_trim_backend_heap_free_memory() function is not free either. > > > > But why would it force returning the memory to be returned immediately? > > The decision whether to trim memory is driven by M_TRIM_THRESHOLD, and > > that does not need to be 0. In fact, it's 128kB by default, i.e. glibc > > trims memory automatically, if it can trim at least 128kB. [..] > To propose something less abstract / more tangible, I think we should do > something like this: > > 1) add a bit of code for glibc-based systems, that adjusts selected > malloc parameters using mallopt() during startup > > 2) add a GUC that enables this, with the default being the regular glibc > behavior (with dynamic adjustment of various thresholds) > > > Which exact parameters would this set is an open question, but based on > my earlier experiments, Ronan's earlier patches, etc. I think it should > adjust at least > > M_TRIM_THRESHOLD - to make sure we trim heap regularly > M_TOP_PAD - to make sure we cache some allocated memory > > I wonder if maybe we should tune M_MMAP_THRESHOLD, which on 64-bit > systems defaults to 32MB, so we don't really mmap() very often for > regular memory contexts. But I don't know if that's a good idea, that > would need some experiments. > > I believe that's essentially what Ronan Dunklau proposed, but it > stalled. Not because of some inherent complexity, but because of > concerns about introducing glibc-specific code. > > Based on my recent experiments I think it's clearly worth it (esp. with > high concurrency workloads). If glibc was a niche, it'd be a different > situation, but I'd guess vast majority of databases runs on glibc. Yes, > it's possible to do these changes without new code (e.g. by setting the > environment variables), but that's rather inconvenient. > > Perhaps it'd be possible to make it a bit smarter by looking at malloc > stats, and adjust the trim/pad thresholds, but I'd leave that for the > future. It might even lead to similar issues with excessive memory usage > just like the logic built into glibc. > > But maybe we could at least print / provide some debugging information? > That would help with adjusting the GUC ... Hi all, Thread bump. Just to add one single data point to this discussion, we have been chasing some ghost memory leaks that apparently were not memory leaks after all (they stop at certain threshold like 1.2GB) but there were still OOMs present, and after some experimentation it seemed that memory ended up being used in MemoryContexts, but afterwards it was released (so outside of TopMemoryContext) when session went idle/idle in transaction, but the processes was *still* having it allocated. Injecting a call to `malloc_trim()` released backend memory for sessions that were idle for some time. E.g. with PG 13.x I've got more or less sample reproducer (thanks to my colleague Matthew Gwillam-Kelly who was working on initial identification of the problem): DROP TABLE p; CREATE TABLE p ( id int not null, sensor_id bigint not null, val bigint ) PARTITION BY HASH (sensor_id); CREATE INDEX p_idx ON P (val); SELECT 'CREATE TABLE p_'||g||' PARTITION OF p FOR VALUES WITH (MODULUS 1000, REMAINDER ' || g || ');' FROM generate_series(0, 999) g; \gexec INSERT INTO p SELECT g, g, g FROM generate_series(1, 1000000) g; ANALYZE p; Run `UPDATE p SET val = val;` minium 3 or 4 times in new session, the backend will use in my case like ~400MB and stay (!) like that for infinite time: $ grep ^Pss /proc/27421/smaps_rollup Pss: 399291 kB Pss_Dirty: 397351 kB Pss_Anon: 353859 kB Pss_File: 1939 kB Pss_Shmem: 43492 kB After injecting call to malloc_trim(0) it shows much lower Pss_Anon: $ grep ^Pss /proc/27421/smaps_rollup Pss: 65904 kB Pss_Dirty: 64189 kB Pss_Anon: 23231 kB Pss_File: 1715 kB Pss_Shmem: 40957 kB NOTE: it is not depending on (maintenance_)work_mem variables, more like PG version involved, extensions, encoding probably, partitions count, triggers maybe. That's like ~353MB wasted above (but our customer was hitting it in ~1.2 GB range but they were having additional extensions loaded which could further amplify the effect) with fully allocated memory without usage in memory contexts (pfree() were successful, free() done nothing, it's just it's not returned back to the OS), so before the trim it is like that this: TopMemoryContext: 801664 total in 29 blocks; 498048 free (2033 chunks); 303616 used [..] Grand total: 22213784 bytes in 3129 blocks; 9674384 free (3393 chunks); 12539400 used Such single UPDATE causes the following malloc frequency histogram of sizes in malloc(): @: [1] 1 | | [2, 4) 43 | | [4, 8) 81 | | [8, 16) 261 |@ | [16, 32) 10049 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32, 64) 8951 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [64, 128) 446 |@@ | [128, 256) 133 | | [256, 512) 118 | | [512, 1K) 11 | | [1K, 2K) 134 | | [2K, 4K) 5 | | [4K, 8K) 94 | | [8K, 16K) 1020 |@@@@@ | [16K, 32K) 4122 |@@@@@@@@@@@@@@@@@@@@@ | [32K, 64K) 29 | | [64K, 128K) 14 | | [128K, 256K) 2196 |@@@@@@@@@@@ | [256K, 512K) 2 | | [..] E.g one of the hot paths for this there is (remember it's still PG13) heap_update->RelationGetBufferForTuple->GetPageWithFreeSpace->fsm_search->fsm_readbuf->mdopenfork->mdopenfork->PathNameOpenFile->PathNameOpenFilePerm->__GI___strdup . Here's it's strdup() but it could be anything and that's the point. This effect in libc is completley reproducible, please see attached, any use of allocating small (<= 120 bytes) ends up not releasing memory for the program. $ gcc mwr.c -o mwr -DMALLOC_SIZE=120 && ./mwr done Rss: 1251460 kB Pss: 1250136 kB Pss_Dirty: 1250112 kB Pss_Anon: 1250100 kB Pss_File: 36 kB Pss_Shmem: 0 kB after malloc_trim: Rss: 1460 kB Pss: 136 kB Pss_Dirty: 100 kB Pss_Anon: 100 kB Pss_File: 36 kB Pss_Shmem: 0 kB $ gcc mwr.c -o mwr -DMALLOC_SIZE=121 && ./mwr # 120+8 >= 128 done Rss: 1676 kB Pss: 259 kB Pss_Dirty: 224 kB Pss_Anon: 224 kB Pss_File: 35 kB Pss_Shmem: 0 kB after malloc_trim: Rss: 1548 kB Pss: 131 kB Pss_Dirty: 96 kB Pss_Anon: 96 kB Pss_File: 35 kB Pss_Shmem: 0 kB Now, the current PG18 behaved much better in that regard without that many small mallocs during runtime (strdup() is still there, it's just that hotpath not exercised that often): @: [8, 16) 2697 |@@@@@@@ | [16, 32) 2203 |@@@@@@ | [32, 64) 0 | | [64, 128) 0 | | [128, 256) 0 | | [256, 512) 0 | | [512, 1K) 0 | | [1K, 2K) 3014 |@@@@@@@@ | [2K, 4K) 5 | | [4K, 8K) 2 | | [8K, 16K) 1107 |@@@ | [16K, 32K) 18112 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [32K, 64K) 12 | | [..] Yet I still could drop Pss_Anon by using malloc_trim(0) from ~44MB to ~13MB. Assume we have 1k idle connections like this and you end up wasting 30GB RAM theoretically. So we basically we have two generic solutions to this class of problems to avoid OOMs due to GNU libc's malloc() not releasing memory: 0. Disconnecting the backend (I'm not counting it as it doesn't seem to be a solid long term solution, but it explains why people push for poolers with refreshable connection pools). 1. Call malloc_trim(0), but Tom stated it might be not portable. So maybe there is a chance for extension or #ifdefs . I do think that calling it after every query might be not ideal due to overheads, but perhaps after query is done we could schedule interrupt aimed at now()+X seconds (where X>= 5?), so execute it only when the backend went really inactive (to avoid re-allocating the memory again), but abort launching this it if we have started next query. I haven't looked at the code so i don't know if that can be done cheaply. 2. Or use GLIBC_TUNABLES e.g. disable mxfast bin allocations shows some promise even still with many small allocations $ gcc mwr.c -o mwr -DMALLOC_SIZE=120 && GLIBC_TUNABLES=glibc.malloc.mxfast=0 ./mwr done Rss: 1680 kB Pss: 257 kB Pss_Dirty: 236 kB # no need for malloc_trim() Pss_Anon: 224 kB Pss_File: 33 kB Pss_Shmem: 0 kB [..] >From my side also -1 to the idea of pg_trim_backend_heap_free_memory() exposed function as per original patch proposal, as how is the user supposed to embed this within his application? I have not quantified the overhead for #1 and #2. -J.
/* malloc wont release.c */ #include <unistd.h> #include <malloc.h> #include <stdio.h> #include <stdlib.h> #define NUM 10000000 //#define MALLOC_SIZE 128 int main(int a, char *b[]) { char cmd[128]; void **memtab = (void **)malloc(sizeof(void *) * NUM); int i; for (i = 0; i < NUM; i++) { memtab[i] = malloc(MALLOC_SIZE); } for (i = 0; i < NUM; i++) { free(memtab[i]); } void *first_address = memtab[0]; void *last_address = memtab[NUM - 1]; free(memtab); printf("done\n"); snprintf(cmd, sizeof(cmd) - 1, "grep -e ^Pss -e ^Rss /proc/%d/smaps_rollup", getpid()); system(cmd); getchar(); malloc_trim(0); printf("after malloc_trim:\n"); system(cmd); getchar(); return 0; }