I've recently encountered an issue with a Prometheus instance where it 
started experiencing failed compactions.   For some additional context, i'm 
running Prometheus in Kubernetes as a statefulset with a replica count of 2.

I quickly realized the compactions were failing on replica #1 as there was 
no space left on the disk.  The unusual part is that replica #2 was running 
fine and was using only approx. 40% of the disk.   The number of series in 
the head black was mostly identical, so that lead me to inspect the 
contents on the disk.

Upon a more detailed inspection, I noticed the following: 

   1. Replica #1 had many more *chunk_snapshot.*.tmp* directories compared 
   to replica #2
   2. Replica #1 had a much large number of files in the "*wal/*" directory 
   compared to replica #2.

Based on what I understand, the chunk_snapshot directories are 
automatically deleted upon the successful completion of a snapshot (see source 
code 
<https://github.com/prometheus/prometheus/blob/v2.44.0/tsdb/head_wal.go#L1051-L1062>).
  
I assume these were left over because the prometheus instance was in some 
kind of state where the snapshots would be started but never successfully 
completed, thus never deleted.

Secondly, if my understanding is correct (based on this 
<https://www.robustperception.io/new-features-in-prometheus-2-18-0/> and 
this <https://github.com/prometheus/prometheus/pull/7098>) , the files in 
the "wal/" directory usually shouldn't be older than 3 hours.   This seems 
to line up with my observations i've made from my other running instances 
as none of them had wal files older than 3 hours.  In my case, replica #1 
had many wal files older than 3 hours, which all combined, was accounting 
for a large majority of the disk usage on that pod.   Given my 
observations, I made the decision to take the following steps which appear 
to have essentially brought back the disk usage to the same level as 
replica #2:

   1. delete all chunk_snapshot directories older than 5 days
   2. delete all files in the wal directory which are older than 3 hours

I would like to gain a better understanding of the relation between the 
chunk_snapshot directories and the files in the wal directory.  I would 
also like to better understand any risks involved in deleting old 
*chunk_snapshot.*.tmp* directories as well as old wal files beyond 3 hours 
of age.

I appreciate any help.


Thank you

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e72dd827-3a8b-4627-b3bf-7a836b556208n%40googlegroups.com.

Reply via email to