Yes, it continuously crashes for OOM, 10 to 15 min once. 
Unfortunately, this time also someone deleted those files, to recover the 
prometheus, still it crashes and auto restarts. 

Feb 18 05:49:18 kernel: Out of memory: Kill process 61845 (prometheus) 
score 844 or sacrifice child
Feb 18 05:52:26 kernel: Out of memory: Kill process 63185 (prometheus) 
score 844 or sacrifice child
Feb 18 05:55:47 kernel: Out of memory: Kill process 64500 (prometheus) 
score 844 or sacrifice child
Feb 18 05:58:51 kernel: Out of memory: Kill process 875 (prometheus) score 
844 or sacrifice child
Feb 18 05:58:51 kernel: Out of memory: Kill process 1754 (prometheus) score 
844 or sacrifice child
Feb 18 06:02:05 kernel: Out of memory: Kill process 2328 (prometheus) score 
845 or sacrifice child
Feb 18 06:05:39 kernel: Out of memory: Kill process 3155 (prometheus) score 
844 or sacrifice child
Feb 18 06:09:06 kernel: Out of memory: Kill process 5273 (prometheus) score 
845 or sacrifice child
Feb 18 06:12:24 kernel: Out of memory: Kill process 6549 (prometheus) score 
844 or sacrifice child
Feb 18 06:15:29 kernel: Out of memory: Kill process 6756 (prometheus) score 
845 or sacrifice child
Feb 18 06:18:28 kernel: Out of memory: Kill process 8474 (prometheus) score 
844 or sacrifice child
Feb 18 06:21:36 kernel: Out of memory: Kill process 8649 (prometheus) score 
845 or sacrifice child
Feb 18 06:24:41 kernel: Out of memory: Kill process 9708 (prometheus) score 
844 or sacrifice child
Feb 18 06:27:52 kernel: Out of memory: Kill process 11003 (prometheus) 
score 844 or sacrifice child
Feb 18 06:30:50 kernel: Out of memory: Kill process 11189 (prometheus) 
score 844 or sacrifice child
Feb 18 06:33:47 kernel: Out of memory: Kill process 12210 (prometheus) 
score 844 or sacrifice child

On Thursday, February 17, 2022 at 5:09:54 PM UTC-5 Brian Candler wrote:

> Now would be a good time to do:
>
> ls -l /var/lib/prometheus/data/chunks_head/
> du -sck /var/lib/prometheus/data/chunks_head/*
>
> My suspicion is your out-of-memory condition is messing up the writing of 
> chunks.  Are you using cgroups/containers?
>
> Also, is prometheus continually crashing and being restarted by systemd? 
> Try looking in "journalctl -eu prometheus".  That might explain why you see 
> lots of free memory most of the time (when prometheus is stopped).
>
> On Thursday, 17 February 2022 at 14:57:25 UTC Senthil wrote:
>
>> The issue started again. 
>>
>> 629G    chunks_head
>> 0       lock
>> 4.0K    queries.active
>> 9.3G    wal
>>
>> There is numerous restart of Prometheus
>> Feb 17 09:02:02 kernel: Out of memory: Kill process 36580 (prometheus) 
>> score 844 or sacrifice child
>> Feb 17 09:08:36 kernel: Out of memory: Kill process 39001 (prometheus) 
>> score 846 or sacrifice child
>> Feb 17 09:16:02 kernel: Out of memory: Kill process 41074 (prometheus) 
>> score 845 or sacrifice child
>> Feb 17 09:22:17 kernel: Out of memory: Kill process 44665 (prometheus) 
>> score 844 or sacrifice child
>> Feb 17 09:29:25 kernel: Out of memory: Kill process 47234 (prometheus) 
>> score 844 or sacrifice child
>> Feb 17 09:36:06 kernel: Out of memory: Kill process 48970 (prometheus) 
>> score 846 or sacrifice child
>> Feb 17 09:43:21 kernel: Out of memory: Kill process 50661 (prometheus) 
>> score 844 or sacrifice child
>>
>> but there is plenty of mem available in the servers.
>>
>>               total        used        free      shared  buff/cache   
>> available
>> Mem:             47           5          31           0          10       
>>    40
>> Swap:             5           1           3
>> Total:           52           7          35
>>
>> On Tuesday, February 1, 2022 at 5:21:32 PM UTC-5 Brian Candler wrote:
>>
>>> On Tuesday, 1 February 2022 at 21:52:30 UTC Senthil wrote:
>>>
>>>> I started on Jan 31, so it's a day.
>>>>
>>>> # du -sck chunks_head/*
>>>> 54140   chunks_head/024326
>>>> 4       chunks_head/024327
>>>> 54144   total
>>>>
>>>
>>> That's perfectly reasonable: it's only 54MB (which is a long way from 
>>> 689GB!)
>>>
>>> Here's what I see on a moderately busy system:
>>>
>>> root@ldex-prometheus:~# du -sck /var/lib/prometheus/data/chunks_head/*
>>> 81004        /var/lib/prometheus/data/chunks_head/006831
>>> 77824        /var/lib/prometheus/data/chunks_head/006832
>>> 158828        total
>>>
>>> That's comparable to yours.
>>>
>>> Therefore, I think you need to keep an eye on this periodically.  If 
>>> only you had a monitoring system which could do this for you :-)
>>>
>>> If it does start to rise, that's when you'll need to check prometheus 
>>> log output and find out what's happening.  But this is very strange, and it 
>>> does seem to be something specific to your system.
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0ea90196-5fca-46f5-ae06-373c42abb410n%40googlegroups.com.

Reply via email to