Garrett Wollman <wollman_at_bimajority.org> wrote on Date: Tue, 09 Sep 2025 16:19:42 UTC :
> On some of our newer large-memory NFS servers, we are seeing services > killed with "failed to reclaim memory". According to our monitoring, > the server has >100G of physmem free at the time, Was that 100G+ somewhat before any reclaiming of memory started, the lead-up to the notice? Any likelihood of sudden, rapid, huge drops in free RAM based on workload behavior? Some other figures from the lead-up to the OOM activity would be snapshots of the likes of top's: Active, Inact, Laundry, Wired, and Free (things in Buf also show up in the other categories) Is NUMA involved? > and the only > solution seems to be rebooting. (There is a small amount of swap > configured and even less of it in use.) That swap is in use at all could be of interest. I wonder whaat it was doing when the swap was put to use or laundry was growing that lead to swap being put to use. > Does this sound familiar to > anyone? What should we be monitoring that we evidently aren't now? I'll note that you can delay the "failed to reclaim memory" OOM activity via the use of the likes of: # sysctl vm.pageout_oom_seq=120 FYI: # sysctl -d vm.pageout_oom_seq vm.pageout_oom_seq: back-to-back calls to oom detector to start OOM The default is 12 and larger gives more delay by causing more attempts to meet the threshold involved before OOM is used. No figure gives an unbounded delay so far as I know. (I do not know anything about the "counts wrap" behavior.) But if the conditions have a bounded duration, vm.pageout_oom_seq can make OOM activity be avoided over that duration fairly generally. (Even just one thread can keep the Active memory so large as to not meet the free RAM threshold(s) involved, even if swap is unused.) Someone might want to see some of the output from the likes of something like: # sysctl vm | grep -v "^vm\.uma\." | grep -e "\.v_" -e stats -e oom_seq | sort from the lead-up to a "failed to reclaim memory". Having a larger vm.pageout_oom_seq can make it easier to observe the lead-up time frame. === Mark Millard marklmi at yahoo.com