[prometheus-users] Relation between the WAL files and chunk_snapshot directories

hartfordfive Fri, 09 Jun 2023 14:27:36 -0700

I've recently encountered an issue with a Prometheus instance where it 
started experiencing failed compactions.   For some additional context, i'm 
running Prometheus in Kubernetes as a statefulset with a replica count of 2.

I quickly realized the compactions were failing on replica #1 as there was
no space left on the disk. The unusual part is that replica #2 was running
fine and was using only approx. 40% of the disk. The number of series in
the head black was mostly identical, so that lead me to inspect the
contents on the disk.

Upon a more detailed inspection, I noticed the following:

1. Replica #1 had many more *chunk_snapshot.*.tmp* directories compared
to replica #2
2. Replica #1 had a much large number of files in the "*wal/*" directory
compared to replica #2.

Based on what I understand, the chunk_snapshot directories are
automatically deleted upon the successful completion of a snapshot (see source
code
<https://github.com/prometheus/prometheus/blob/v2.44.0/tsdb/head_wal.go#L1051-L1062>).

I assume these were left over because the prometheus instance was in some
kind of state where the snapshots would be started but never successfully
completed, thus never deleted.

Secondly, if my understanding is correct (based on this
<https://www.robustperception.io/new-features-in-prometheus-2-18-0/> and
this <https://github.com/prometheus/prometheus/pull/7098>) , the files in
the "wal/" directory usually shouldn't be older than 3 hours. This seems
to line up with my observations i've made from my other running instances
as none of them had wal files older than 3 hours. In my case, replica #1
had many wal files older than 3 hours, which all combined, was accounting
for a large majority of the disk usage on that pod. Given my
observations, I made the decision to take the following steps which appear
to have essentially brought back the disk usage to the same level as
replica #2:

1. delete all chunk_snapshot directories older than 5 days
2. delete all files in the wal directory which are older than 3 hours

I would like to gain a better understanding of the relation between the
chunk_snapshot directories and the files in the wal directory. I would
also like to better understand any risks involved in deleting old
*chunk_snapshot.*.tmp* directories as well as old wal files beyond 3 hours
of age.

I appreciate any help.

Thank you

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/e72dd827-3a8b-4627-b3bf-7a836b556208n%40googlegroups.com.

[prometheus-users] Relation between the WAL files and chunk_snapshot directories

Reply via email to