Hello All,

We have the following situation and got hit by Prometheus OOM issues.

1. Prometheus(2.19)  running on K8s with thanos side car.
2. CPU/Memory: 1 core/ 60GB
3. Retention: 1w/75GB
4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k 
to 300k but due to some recent change we have hit with this scenario.

Prometheus is continuously getting restarted due to OOM. So far, below are 
out findings:
1. The compaction is not happening even after tsdb.min-block-duration(2h by 
default). Sometimes it fails resulting in *.tmp files. Changes to 
tsdb.min-block-duration and tsdb.max-block-duration is not recommended as 
we are running Thanos sidecar. 
2. The WAL kept growing as the compaction is not happening.
3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due to 
large accumulation of events.
4. We have queries from alerting rules running on Prometheus TSDB which are 
timing out.Theoretically,  it looks like the memory mapped chunks from the 
disk are getting loaded to the memory causing OOM.

Could you please help me understand with below queries:
1. Since the data is not compacted and converted to TSDB blocks, I believe 
the alerting rules are running on head block ( which has the memory mapped 
chunks ) which is in-memory. Initially we have memory set to 30GB and 
raised it to 60GB. I don't think we have data beyond that limit even with 
6M to 7M active timeseries. Why would the memory keep growing causing 
OOM.(Consider, we are not scraping anything after we have 6 to 7M active 
timeseries and WAL is stored on the disk).
2. We haven't enabled debug log and atleast there is no traces in the log 
to understand why the compaction is not happening when the alerting rules 
are timing out.

The issue resolved after we stopped alerting rules and stopped scraping new 
metrics.
Please excuse my typos.

Thanks,
Amar


-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a292dd48-9d70-4e04-8ea7-126a033f3755n%40googlegroups.com.

Reply via email to