its ext4.

-rw-r--r--. 1 prometheus prometheus 134053888 Nov  2 09:55 00021327
-rw-r--r--. 1 prometheus prometheus 0 Dec  1 09:11 00036028

journalctl -eu prometheus

Initially, we were receiving no space left error.

prometheus: level=error ts=2021-11-30T20:31:29.438Z caller=scrape.go:1088 
component="scrape manager" scrape_pool=proc_exporter target=XXXXXXXX 
msg="Scrape commit failed" err="write to WAL: log samples: write 
/apps/prometheus/prometheus_data/wal/00035966: no space left on device"

We increased to 1TB from 300 GB, and restarted. Then it took a long time to 
load the WAL at last we got 

prometheus: level=error ts=2021-11-30T20:32:26.338Z caller=db.go:766 
component=tsdb msg="compaction failed" err="plan compaction: open 
/apps/prometheus/prometheus_data: too many open files"

I increased the limit and restarted, we got a corruption error.

prometheus: ts=2021-12-01T04:47:51.988Z caller=db.go:756 level=warn 
component=tsdb msg="Encountered WAL read error, attempting repair" 
err="read records: corruption in segment 
/apps/prometheus/prometheus_data/wal/00035966 at 94863471: unexpected full 
record"
prometheus: ts=2021-12-01T04:47:52.017Z caller=wal.go:364 level=warn 
component=tsdb msg="Starting corruption repair" segment=35966 
offset=94863471
prometheus: ts=2021-12-01T04:47:52.356Z caller=wal.go:372 level=warn 
component=tsdb msg="Deleting all segments newer than corrupted segment" 
segment=35966
prometheus: ts=2021-12-01T04:47:52.364Z caller=wal.go:394 level=warn 
component=tsdb msg="Rewrite corrupted segment" segment=35966
prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:866 level=info 
fs_type=EXT4_SUPER_MAGIC
prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:869 level=info 
msg="TSDB started"
prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:996 level=info 
msg="Loading configuration file" 
filename=/apps/prometheus/prometheus_server/prometheus.yml
prometheus: ts=2021-12-01T04:47:54.741Z caller=main.go:1033 level=info 
msg="Completed loading of configuration file" 
filename=/apps/prometheus/prometheus_server/prometheus.yml 
totalDuration=876.279412ms db_storage=1.661153ms remote_storage=6.844µs 
web_handler=24.458µs query_engine=4.525µs scrape=65.638736ms 
scrape_sd=6.201988ms notify=71.586µs notify_sd=3.788352ms rules=735.536426ms
prometheus: ts=2021-12-01T04:47:54.741Z caller=main.go:811 level=info 
msg="Server is ready to receive web requests."

Since we can't wait for a long I just moved the wal directory to wal.bkp 
and restarted the Prometheus now. 

On Thursday, December 2, 2021 at 10:45:26 AM UTC-5 Brian Candler wrote:

> "local storage" means what filesystem - ext4? xfs? something else?
>
> What's the difference between the oldest and newest timestamps?  That will 
> show you how long the problem has been going for.
>
> ls -ld 00021327
> ls -ld 00036028
>
> Are you getting any error output from the prometheus server process? For 
> example, if you're running it under systemd then try
>
> journalctl -eu prometheus
>
> My guess is that prometheus is failing to update the main database from 
> the WAL, and if so will probably be logging some message saying what it's 
> stuck on.  And hence the WAL just keeps growing as new data is scraped.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b53b6140-4875-42f5-b33a-3b5e5ff87144n%40googlegroups.com.

Reply via email to