Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

Frits Hoogland Fri, 08 Aug 2025 01:22:08 -0700

Joe, I am trying to help, and make people think about things correctly.

The linux kernel is actually constantly changing, sometimes subtle and 
sometimes less subtle, and there is a general lack of very clear statistics 
indicating the more nuanced memory operations, and the documentation about it.
And: there are a lot of myths about memory management, which either are myths 
because it's a situation that was once true but given the changes of the kernel 
code is not true anymore, but also sometimes just a myth.

The best technical description of recent memory management that I could find 
is: 
https://lpc.events/event/11/contributions/896/attachments/793/1493/slides-r2.pdf

>>> Op 6 aug 2025 om 18:33 heeft Joe Conway <[email protected] 
>>> <mailto:[email protected]>> het volgende geschreven:
> 
>>> * Swap is what is used when anonymous memory must be reclaimed to
>>> allow for an allocation of anonymous memory.

Correct. Swapped out pages are anonymous memory pages exclusively.
It's the result of memory reclaim for anonymous pages, which cannot be 
discarded like (non dirty and non-pinned) file pages, which don't need saving 
the page content.

>>> * The Linux kernel will aggressively use all available memory for
>>> file buffers, pushing usage against the limits.

It's an explicit design of the linux kernel to not reclaim file pages when they 
are unpinned/not used anymore, leaving them as a cached page.
(anonymous pages are freed explciitly when released by the ower and put on the 
free list)
There is no aggresive push, file pages are left after use, so there is no 
pushing usage against the limits.
It's the swapper ('page daemon') that eventually (based on a zone limit called 
'memory low', which is vm.min_free_kbytes *2), based on LRU, frees file pages, 
and when free memory gets to vm.min_free_kbytes*1 (called 'pages min') forces 
tasks to free memory theirselves (called 'direct reclaim').

>>> * Especially in the older 4 series kernels, file buffers often
>>> cannot be reclaimed fast enough

I am not sure what is described here, and whether this is about the swapper or 
direct reclaim.
There is no need to do this 'fast enough', see the above slide deck.
This probably is aimed at the swapper not reclaiming 'fast enough', however, 
that is not how this works: if memory requests makes free memory go to 'pages 
min', a task will perform 'direct reclaim'.

>>> * With no swap and a large-ish anonymous memory request, it is
>>> easy to push over the limit to cause the OOM killer to strike.

I am afraid that this is not a correct representation of the actual mechanism, 
again: look at the slide deck and explanations above.
The swapper frees memory, which is used by a task requesting pages at page 
fault, for which it doesn't matter if that is anonymous memory or file memory.. 
If memory gets down to pages min, the swapper did not reclaim memory fast 
enough, and a task will perform direct reclaim.

The decision on what memory type to reclaim in case of direct reclaim is file 
memory or anonymous memory.
If there is no swap, the option to use anonymous memory is not available, 
because anonymous pages cannot be discarded like non-dirty, unpinned file pages 
can, they have to be preserved.
If swappiness is set to 0, but swap is available, some documentation suggests 
it will never use anonymous memory, however I found this not to be true, linux 
might still choose anonymous memory to reclaim. Obviously, the lower 
swappiness, the lesser reclaim will choose anonymous memory pages.

What you seem to suggest, is that with no swap, and thus the option to use 
anonymous pages for reclaim the reclaim mechanism is dependent on the speed of 
(file) reclaim, possibly from the swapper. I hope it's clear this is not true.

Obviously, when there is swap, the total amount of pages that become 
potentially available for reclaim becomes higher, because the size of swap 
anonymous pages can be reclaimed.
But then if that amount is set to a low amount (as suggested: You don't need a 
huge amount'), the actual increase in pages availability for reclaim is 
negligible, and thus the benefit that it provides for not running out of memory.

>>> * On the other hand, with swap enabled anon memory can be
>>> reclaimed giving the kernel more time to deal with file buffer
>>> reclamation.

See the explanation with the previous comments. Time is not a component in 
reclaim for failure to find pages for a task that page faults for memory 
addition, because a task will do direct reclaim if it exhausts free memory 
provided by the swapper.

>>> At least that is what I have observed.

The kernel code for direct reclaim shows that when direct reclaim has finished 
scanning memory pages (either only file pages with no swap, of in the case of 
having swap, the file and anonymous pages), and wasn't able to satisfy the 
request for the pages it needs, it will trigger the kernel Out of memory 
thread, because it has run out of available pages it needs.

Again, like I mentioned in the beginning, there are lots and lots of nuances 
and mechanisms in play, this is a reasonable basic explanation of the mechanism 
based on the above slide deck and reading the kernel code.
One thing that can very easily be misleading is that memory is not a general, 
system wide, pool, but instead separated by zones. This might lead to situation 
where there still is memory available for reclaim system wide, but not in the 
zone the process is scanning, and thus might seem to run out of memory 
triggering the OOM killer when there still is memory, which can be very 
confusing if you're not aware of these details.

I do have read, experimented, searched, tested and diagnosed a lot of issues. 
And this is what have come up with, which does fit the kernel code, and 
documentation that I trust.

Based on these mechanisms, and especially for database systems, removing swap 
is a way to take away a mechanism that has no benefit for database systems on 
modern, high memory, systems.
That does not mean it's not beneficial in other cases. If memory usage is very 
dynamic, memory is more constrained, and the operation is less latency 
sensitive, it might be a good idea to have an overflow, with all the downsides 
that it brings.

Frits Hoogland

> On 7 Aug 2025, at 03:12, Joe Conway <[email protected]> wrote:
> 
> On 8/6/25 17:14, Frits Hoogland wrote:
>>> As I said, do not disable swap. You don't need a huge amount, but
>>> maybe 16 GB or so would do it.
> 
>> Joe, please, can you state a technical reason for saying this?
>> All you are saying is ‘don’t do this’.
>> I’ve stated my reasons for why this doesn’t make sense, and you don’t give 
>> any reason.
> 
> What do you call the below?
> 
>>> Op 6 aug 2025 om 18:33 heeft Joe Conway <[email protected]> het volgende 
>>> geschreven:
> 
>>> * Swap is what is used when anonymous memory must be reclaimed to
>>> allow for an allocation of anonymous memory.
>>> * The Linux kernel will aggressively use all available memory for
>>> file buffers, pushing usage against the limits.
>>> * Especially in the older 4 series kernels, file buffers often
>>> cannot be reclaimed fast enough
>>> * With no swap and a large-ish anonymous memory request, it is
>>> easy to push over the limit to cause the OOM killer to strike.
>>> * On the other hand, with swap enabled anon memory can be
>>> reclaimed giving the kernel more time to deal with file buffer
>>> reclamation.
>>> At least that is what I have observed.
> 
> If you don't think that is adequate technical reason, feel free to ignore my 
> advice.
> 
> -- 
> Joe Conway
> PostgreSQL Contributors Team
> Amazon Web Services: https://aws.amazon.com

Re: Safe vm.overcommit_ratio for Large Multi-Instance PostgreSQL Fleet

Reply via email to