Not answering your question, but just pointing out that Prometheus 2.19 is from Jun 9 2020, so is now over two and a half years old.
Hence, before reporting a performance problem, I suggest you upgrade to something newer and see if the problem disappears. Version 2.37.x is the current long-term-support release <https://prometheus.io/docs/introduction/release-cycle/> (although it is about to reach the end of its committed end of support date, so I would expect another branch will soon be promoted to LTS) On Saturday, 21 January 2023 at 14:09:39 UTC amar wrote: > Hello All, > > We have the following situation and got hit by Prometheus OOM issues. > > 1. Prometheus(2.19) running on K8s with thanos side car. > 2. CPU/Memory: 1 core/ 60GB > 3. Retention: 1w/75GB > 4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k > to 300k but due to some recent change we have hit with this scenario. > > Prometheus is continuously getting restarted due to OOM. So far, below are > out findings: > 1. The compaction is not happening even after tsdb.min-block-duration(2h > by default). Sometimes it fails resulting in *.tmp files. Changes to > tsdb.min-block-duration and tsdb.max-block-duration is not recommended as > we are running Thanos sidecar. > 2. The WAL kept growing as the compaction is not happening. > 3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due > to large accumulation of events. > 4. We have queries from alerting rules running on Prometheus TSDB which > are timing out.Theoretically, it looks like the memory mapped chunks from > the disk are getting loaded to the memory causing OOM. > > Could you please help me understand with below queries: > 1. Since the data is not compacted and converted to TSDB blocks, I believe > the alerting rules are running on head block ( which has the memory mapped > chunks ) which is in-memory. Initially we have memory set to 30GB and > raised it to 60GB. I don't think we have data beyond that limit even with > 6M to 7M active timeseries. Why would the memory keep growing causing > OOM.(Consider, we are not scraping anything after we have 6 to 7M active > timeseries and WAL is stored on the disk). > 2. We haven't enabled debug log and atleast there is no traces in the log > to understand why the compaction is not happening when the alerting rules > are timing out. > > The issue resolved after we stopped alerting rules and stopped scraping > new metrics. > Please excuse my typos. > > Thanks, > Amar > > > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d5faa8a2-7a98-4bd4-bbad-3079206651a4n%40googlegroups.com.

