Re: Chasing OOM Issues - good sysctl metrics to use?

Mark Millard Fri, 29 Apr 2022 14:05:23 -0700

On 2022-Apr-29, at 13:41, Pete Wright <p...@nomadlogic.org> wrote:
> 
> On 4/29/22 11:38, Mark Millard wrote:
>> On 2022-Apr-29, at 11:08, Pete Wright <p...@nomadlogic.org> wrote:
>> 
>>> On 4/23/22 19:20, Pete Wright wrote:
>>>>> The developers handbook has a section debugging deadlocks that he
>>>>> referenced in a response to another report (on freebsd-hackers).
>>>>> 
>>>>> https://docs.freebsd.org/en/books/developers-handbook/kerneldebug/#kerneldebug-deadlocks
>>>> d'oh - thanks for the correction!
>>>> 
>>>> -pete
>>>> 
>>>> 
>>> hello, i just wanted to provide an update on this issue.  so the good news 
>>> is that by removing the file backed swap the deadlocks have indeed gone 
>>> away!  thanks for sorting me out on that front Mark!
>> Glad it helped.
> 
> d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't 
> have said anything lol.


Any interesting console messages ( or dmesg -a or /var/log/messages )?

>>> i still am seeing a memory leak with either firefox or chrome (maybe both 
>>> where they create a voltron of memory leaks?).  this morning firefox and 
>>> chrome had been killed when i first logged in. fortunately the system has 
>>> remained responsive for several hours which was not the case previously.
>>> 
>>> when looking at my metrics i see vm.domain.0.stats.inactive take a nose 
>>> dive from around 9GB to 0 over the course of 1min.  the timing seems to 
>>> align with around the time when firefox crashed, and is proceeded by a 
>>> large spike in vm.domain.0.stats.active from ~1GB to 7GB 40mins before the 
>>> apps crashed.  after the binaries were killed memory metrics seem to have 
>>> recovered (laundry size grew, and inactive size grew by several gigs for 
>>> example).
>> Since the form of kill here is tied to sustained low free memory
>> ("failed to reclaim memory"), you might want to report the
>> vm.domain.0.stats.free_count figures from various time frames as
>> well:
>> 
>> vm.domain.0.stats.free_count: Free pages
>> 
>> (It seems you are converting pages to byte counts in your report,
>> the units I'm not really worried about so long as they are
>> obvious.)
>> 
>> There are also figures possibly tied to the handling of the kill
>> activity but some being more like thresholds than usage figures,
>> such as:
>> 
>> vm.domain.0.stats.free_severe: Severe free pages
>> vm.domain.0.stats.free_min: Minimum free pages
>> vm.domain.0.stats.free_reserved: Reserved free pages
>> vm.domain.0.stats.free_target: Target free pages
>> vm.domain.0.stats.inactive_target: Target inactive pages
> ok thanks Mark, based on this input and the fact i did manage to lock up my 
> system, i'm going to get some metrics up on my website and share them 
> publicly when i have time.  i'll definitely take you input into account when 
> sharing this info.
> 
>> 
>> Also, what value were you using for:
>> 
>> vm.pageout_oom_seq
> $ sysctl vm.pageout_oom_seq
> vm.pageout_oom_seq: 120
> $

Without knowing vm.domain.0.stats.free_count it is hard to
tell, but you might try, say, sysctl vm.pageout_oom_seq=12000
in hopes of getting notably more time with the
vm.domain.0.stats.free_count staying small. That may give
you more time to notice the low free RAM (if you are checking
periodically, rather than just waiting for failure to make
it obvious).


===
Mark Millard
marklmi at yahoo.com

Re: Chasing OOM Issues - good sysctl metrics to use?

Reply via email to