Not answering your question, but just pointing out that Prometheus 2.19 is 
from Jun 9  2020, so is now over two and a half years old.

Hence, before reporting a performance problem, I suggest you upgrade to 
something newer and see if the problem disappears.  Version 2.37.x is the 
current long-term-support release 
<https://prometheus.io/docs/introduction/release-cycle/> (although it is 
about to reach the end of its committed end of support date, so I would 
expect another branch will soon be promoted to LTS)

On Saturday, 21 January 2023 at 14:09:39 UTC amar wrote:

> Hello All,
>
> We have the following situation and got hit by Prometheus OOM issues.
>
> 1. Prometheus(2.19)  running on K8s with thanos side car.
> 2. CPU/Memory: 1 core/ 60GB
> 3. Retention: 1w/75GB
> 4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k 
> to 300k but due to some recent change we have hit with this scenario.
>
> Prometheus is continuously getting restarted due to OOM. So far, below are 
> out findings:
> 1. The compaction is not happening even after tsdb.min-block-duration(2h 
> by default). Sometimes it fails resulting in *.tmp files. Changes to 
> tsdb.min-block-duration and tsdb.max-block-duration is not recommended as 
> we are running Thanos sidecar. 
> 2. The WAL kept growing as the compaction is not happening.
> 3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due 
> to large accumulation of events.
> 4. We have queries from alerting rules running on Prometheus TSDB which 
> are timing out.Theoretically,  it looks like the memory mapped chunks from 
> the disk are getting loaded to the memory causing OOM.
>
> Could you please help me understand with below queries:
> 1. Since the data is not compacted and converted to TSDB blocks, I believe 
> the alerting rules are running on head block ( which has the memory mapped 
> chunks ) which is in-memory. Initially we have memory set to 30GB and 
> raised it to 60GB. I don't think we have data beyond that limit even with 
> 6M to 7M active timeseries. Why would the memory keep growing causing 
> OOM.(Consider, we are not scraping anything after we have 6 to 7M active 
> timeseries and WAL is stored on the disk).
> 2. We haven't enabled debug log and atleast there is no traces in the log 
> to understand why the compaction is not happening when the alerting rules 
> are timing out.
>
> The issue resolved after we stopped alerting rules and stopped scraping 
> new metrics.
> Please excuse my typos.
>
> Thanks,
> Amar
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d5faa8a2-7a98-4bd4-bbad-3079206651a4n%40googlegroups.com.

Reply via email to