Hi, ## Thomas Ziegler (thomas.zieg...@holmsecurity.com):
There's a lot of information missing here. Let's start from the top. > I have had my database killed by the kernel oom-killer. After that I > set turned off memory over-committing and that is where things got weird. What exactly did you set? When playing with vm.overcommit, did you understand "Committed Address Space" and the workings of the overcommit accounting? This is the document: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/mm/overcommit-accounting.rst Hint: when setting overcommit_memory=2 you might end up with way less available adress space than you thought you would. Also keep an eye on /proc/meminfo - it's sometimes hard to estimate "just off your cuff" what's in memory and how it's mapped. (Also, anything else on that machine which might hog memory?). > I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers` > at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`. > So all in all, I get to roughly 42GB of max memory usage > (`16000+(80+8)*300=42400`). That work_mem is "per query operation", you can have multiple of those in a single query: https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM Also, there's hash_mem_multiplier: https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-HASH-MEM-MULTIPLIER Then I've seen that your query uses parallel workers, remember that each worker requests memory. Next, maintenance_work_mem is a per process limit, and depending on what's running at any given time, that can add up. Finally, there's this: > 2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel > worker" (PID 3808076) exited with exit code 1 > terminate called after throwing an instance of 'std::bad_alloc' > what(): std::bad_alloc > 2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel > worker" (PID 3808077) was terminated by signal 6: Aborted That "std::bad_alloc" sounds a lot like C++ and not like the C our database is written in. My first suspicion would be that you're using LLVM-JIT (unless you have other - maybe even your own - C++ extensions in the database?) and that in itself can use a good chunk of memory. And it looks like that exception bubbled up as a signal 6 (SIGABRT) which made the process terminate immediately without any cleanup, and after that the server has no other chance than to crash-restart. I recommend starting with understanding the actual memory limits as set by your configuration (personally I believe that memory overcommit is less evil than some people think). Have a close look at /proc/meminfo and if possible disable JIT and check if it changes anything. Also if possible try starting with only a few active connections and increase load carefully once a steady state (in terms of memory usage) has been reached. Regards, Christoph -- Spare Space