its ext4. -rw-r--r--. 1 prometheus prometheus 134053888 Nov 2 09:55 00021327 -rw-r--r--. 1 prometheus prometheus 0 Dec 1 09:11 00036028
journalctl -eu prometheus Initially, we were receiving no space left error. prometheus: level=error ts=2021-11-30T20:31:29.438Z caller=scrape.go:1088 component="scrape manager" scrape_pool=proc_exporter target=XXXXXXXX msg="Scrape commit failed" err="write to WAL: log samples: write /apps/prometheus/prometheus_data/wal/00035966: no space left on device" We increased to 1TB from 300 GB, and restarted. Then it took a long time to load the WAL at last we got prometheus: level=error ts=2021-11-30T20:32:26.338Z caller=db.go:766 component=tsdb msg="compaction failed" err="plan compaction: open /apps/prometheus/prometheus_data: too many open files" I increased the limit and restarted, we got a corruption error. prometheus: ts=2021-12-01T04:47:51.988Z caller=db.go:756 level=warn component=tsdb msg="Encountered WAL read error, attempting repair" err="read records: corruption in segment /apps/prometheus/prometheus_data/wal/00035966 at 94863471: unexpected full record" prometheus: ts=2021-12-01T04:47:52.017Z caller=wal.go:364 level=warn component=tsdb msg="Starting corruption repair" segment=35966 offset=94863471 prometheus: ts=2021-12-01T04:47:52.356Z caller=wal.go:372 level=warn component=tsdb msg="Deleting all segments newer than corrupted segment" segment=35966 prometheus: ts=2021-12-01T04:47:52.364Z caller=wal.go:394 level=warn component=tsdb msg="Rewrite corrupted segment" segment=35966 prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:866 level=info fs_type=EXT4_SUPER_MAGIC prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:869 level=info msg="TSDB started" prometheus: ts=2021-12-01T04:47:53.865Z caller=main.go:996 level=info msg="Loading configuration file" filename=/apps/prometheus/prometheus_server/prometheus.yml prometheus: ts=2021-12-01T04:47:54.741Z caller=main.go:1033 level=info msg="Completed loading of configuration file" filename=/apps/prometheus/prometheus_server/prometheus.yml totalDuration=876.279412ms db_storage=1.661153ms remote_storage=6.844µs web_handler=24.458µs query_engine=4.525µs scrape=65.638736ms scrape_sd=6.201988ms notify=71.586µs notify_sd=3.788352ms rules=735.536426ms prometheus: ts=2021-12-01T04:47:54.741Z caller=main.go:811 level=info msg="Server is ready to receive web requests." Since we can't wait for a long I just moved the wal directory to wal.bkp and restarted the Prometheus now. On Thursday, December 2, 2021 at 10:45:26 AM UTC-5 Brian Candler wrote: > "local storage" means what filesystem - ext4? xfs? something else? > > What's the difference between the oldest and newest timestamps? That will > show you how long the problem has been going for. > > ls -ld 00021327 > ls -ld 00036028 > > Are you getting any error output from the prometheus server process? For > example, if you're running it under systemd then try > > journalctl -eu prometheus > > My guess is that prometheus is failing to update the main database from > the WAL, and if so will probably be logging some message saying what it's > stuck on. And hence the WAL just keeps growing as new data is scraped. > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b53b6140-4875-42f5-b33a-3b5e5ff87144n%40googlegroups.com.

