I've recently encountered an issue with a Prometheus instance where it started experiencing failed compactions. For some additional context, i'm running Prometheus in Kubernetes as a statefulset with a replica count of 2.
I quickly realized the compactions were failing on replica #1 as there was no space left on the disk. The unusual part is that replica #2 was running fine and was using only approx. 40% of the disk. The number of series in the head black was mostly identical, so that lead me to inspect the contents on the disk. Upon a more detailed inspection, I noticed the following: 1. Replica #1 had many more *chunk_snapshot.*.tmp* directories compared to replica #2 2. Replica #1 had a much large number of files in the "*wal/*" directory compared to replica #2. Based on what I understand, the chunk_snapshot directories are automatically deleted upon the successful completion of a snapshot (see source code <https://github.com/prometheus/prometheus/blob/v2.44.0/tsdb/head_wal.go#L1051-L1062>). I assume these were left over because the prometheus instance was in some kind of state where the snapshots would be started but never successfully completed, thus never deleted. Secondly, if my understanding is correct (based on this <https://www.robustperception.io/new-features-in-prometheus-2-18-0/> and this <https://github.com/prometheus/prometheus/pull/7098>) , the files in the "wal/" directory usually shouldn't be older than 3 hours. This seems to line up with my observations i've made from my other running instances as none of them had wal files older than 3 hours. In my case, replica #1 had many wal files older than 3 hours, which all combined, was accounting for a large majority of the disk usage on that pod. Given my observations, I made the decision to take the following steps which appear to have essentially brought back the disk usage to the same level as replica #2: 1. delete all chunk_snapshot directories older than 5 days 2. delete all files in the wal directory which are older than 3 hours I would like to gain a better understanding of the relation between the chunk_snapshot directories and the files in the wal directory. I would also like to better understand any risks involved in deleting old *chunk_snapshot.*.tmp* directories as well as old wal files beyond 3 hours of age. I appreciate any help. Thank you -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e72dd827-3a8b-4627-b3bf-7a836b556208n%40googlegroups.com.

