[prometheus-users] Can Prometheus compaction take a hit if there are long running queries on head block?

amar Sat, 21 Jan 2023 06:09:42 -0800

Hello All,

We have the following situation and got hit by Prometheus OOM issues.

1. Prometheus(2.19) running on K8s with thanos side car.
2. CPU/Memory: 1 core/ 60GB
3. Retention: 1w/75GB
4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k
to 300k but due to some recent change we have hit with this scenario.

Prometheus is continuously getting restarted due to OOM. So far, below are
out findings:
1. The compaction is not happening even after tsdb.min-block-duration(2h by
default). Sometimes it fails resulting in *.tmp files. Changes to
tsdb.min-block-duration and tsdb.max-block-duration is not recommended as
we are running Thanos sidecar.
2. The WAL kept growing as the compaction is not happening.
3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due to
large accumulation of events.
4. We have queries from alerting rules running on Prometheus TSDB which are
timing out.Theoretically, it looks like the memory mapped chunks from the
disk are getting loaded to the memory causing OOM.

Could you please help me understand with below queries:
1. Since the data is not compacted and converted to TSDB blocks, I believe
the alerting rules are running on head block ( which has the memory mapped
chunks ) which is in-memory. Initially we have memory set to 30GB and
raised it to 60GB. I don't think we have data beyond that limit even with
6M to 7M active timeseries. Why would the memory keep growing causing
OOM.(Consider, we are not scraping anything after we have 6 to 7M active
timeseries and WAL is stored on the disk).
2. We haven't enabled debug log and atleast there is no traces in the log
to understand why the compaction is not happening when the alerting rules
are timing out.

The issue resolved after we stopped alerting rules and stopped scraping new
metrics.
Please excuse my typos.

Thanks,
Amar

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/a292dd48-9d70-4e04-8ea7-126a033f3755n%40googlegroups.com.

[prometheus-users] Can Prometheus compaction take a hit if there are long running queries on head block?

Reply via email to