Yes, it continuously crashes for OOM, 10 to 15 min once. Unfortunately, this time also someone deleted those files, to recover the prometheus, still it crashes and auto restarts.
Feb 18 05:49:18 kernel: Out of memory: Kill process 61845 (prometheus) score 844 or sacrifice child Feb 18 05:52:26 kernel: Out of memory: Kill process 63185 (prometheus) score 844 or sacrifice child Feb 18 05:55:47 kernel: Out of memory: Kill process 64500 (prometheus) score 844 or sacrifice child Feb 18 05:58:51 kernel: Out of memory: Kill process 875 (prometheus) score 844 or sacrifice child Feb 18 05:58:51 kernel: Out of memory: Kill process 1754 (prometheus) score 844 or sacrifice child Feb 18 06:02:05 kernel: Out of memory: Kill process 2328 (prometheus) score 845 or sacrifice child Feb 18 06:05:39 kernel: Out of memory: Kill process 3155 (prometheus) score 844 or sacrifice child Feb 18 06:09:06 kernel: Out of memory: Kill process 5273 (prometheus) score 845 or sacrifice child Feb 18 06:12:24 kernel: Out of memory: Kill process 6549 (prometheus) score 844 or sacrifice child Feb 18 06:15:29 kernel: Out of memory: Kill process 6756 (prometheus) score 845 or sacrifice child Feb 18 06:18:28 kernel: Out of memory: Kill process 8474 (prometheus) score 844 or sacrifice child Feb 18 06:21:36 kernel: Out of memory: Kill process 8649 (prometheus) score 845 or sacrifice child Feb 18 06:24:41 kernel: Out of memory: Kill process 9708 (prometheus) score 844 or sacrifice child Feb 18 06:27:52 kernel: Out of memory: Kill process 11003 (prometheus) score 844 or sacrifice child Feb 18 06:30:50 kernel: Out of memory: Kill process 11189 (prometheus) score 844 or sacrifice child Feb 18 06:33:47 kernel: Out of memory: Kill process 12210 (prometheus) score 844 or sacrifice child On Thursday, February 17, 2022 at 5:09:54 PM UTC-5 Brian Candler wrote: > Now would be a good time to do: > > ls -l /var/lib/prometheus/data/chunks_head/ > du -sck /var/lib/prometheus/data/chunks_head/* > > My suspicion is your out-of-memory condition is messing up the writing of > chunks. Are you using cgroups/containers? > > Also, is prometheus continually crashing and being restarted by systemd? > Try looking in "journalctl -eu prometheus". That might explain why you see > lots of free memory most of the time (when prometheus is stopped). > > On Thursday, 17 February 2022 at 14:57:25 UTC Senthil wrote: > >> The issue started again. >> >> 629G chunks_head >> 0 lock >> 4.0K queries.active >> 9.3G wal >> >> There is numerous restart of Prometheus >> Feb 17 09:02:02 kernel: Out of memory: Kill process 36580 (prometheus) >> score 844 or sacrifice child >> Feb 17 09:08:36 kernel: Out of memory: Kill process 39001 (prometheus) >> score 846 or sacrifice child >> Feb 17 09:16:02 kernel: Out of memory: Kill process 41074 (prometheus) >> score 845 or sacrifice child >> Feb 17 09:22:17 kernel: Out of memory: Kill process 44665 (prometheus) >> score 844 or sacrifice child >> Feb 17 09:29:25 kernel: Out of memory: Kill process 47234 (prometheus) >> score 844 or sacrifice child >> Feb 17 09:36:06 kernel: Out of memory: Kill process 48970 (prometheus) >> score 846 or sacrifice child >> Feb 17 09:43:21 kernel: Out of memory: Kill process 50661 (prometheus) >> score 844 or sacrifice child >> >> but there is plenty of mem available in the servers. >> >> total used free shared buff/cache >> available >> Mem: 47 5 31 0 10 >> 40 >> Swap: 5 1 3 >> Total: 52 7 35 >> >> On Tuesday, February 1, 2022 at 5:21:32 PM UTC-5 Brian Candler wrote: >> >>> On Tuesday, 1 February 2022 at 21:52:30 UTC Senthil wrote: >>> >>>> I started on Jan 31, so it's a day. >>>> >>>> # du -sck chunks_head/* >>>> 54140 chunks_head/024326 >>>> 4 chunks_head/024327 >>>> 54144 total >>>> >>> >>> That's perfectly reasonable: it's only 54MB (which is a long way from >>> 689GB!) >>> >>> Here's what I see on a moderately busy system: >>> >>> root@ldex-prometheus:~# du -sck /var/lib/prometheus/data/chunks_head/* >>> 81004 /var/lib/prometheus/data/chunks_head/006831 >>> 77824 /var/lib/prometheus/data/chunks_head/006832 >>> 158828 total >>> >>> That's comparable to yours. >>> >>> Therefore, I think you need to keep an eye on this periodically. If >>> only you had a monitoring system which could do this for you :-) >>> >>> If it does start to rise, that's when you'll need to check prometheus >>> log output and find out what's happening. But this is very strange, and it >>> does seem to be something specific to your system. >>> >> -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0ea90196-5fca-46f5-ae06-373c42abb410n%40googlegroups.com.

