Hello list, last night I got ENOSPACE on the WAL partition while running a huge pg_restore on an empty and idle database. The checkpoint that started 7 minutes earlier never finished:
04:31:09 LOG: checkpoint starting: wal ... 04:38:04 PANIC: could not write to file "pg_wal/xlogtemp.546204": No space left on device :CONTEXT: COPY table_partition_123, line 120872534 This is most likely because most of the data is written to a very slow "archival" tablespace. No I/O errors on the system logs, I know things go slow on that device (iSCSI drive over a shared spinning-disks pool and busy network). This happened even though I'm keeping the WAL in its own partition (128GB partition size), with a lot of free space above max_wal_size (64GB). Somehow it managed to grow above max_wal_size and fill 100% of the partition. I'm running latest PostgreSQL 17 and the settings have been temporarily tweaked for fastest pg_restore: max_wal_size=64GB max_replication_slots = 0 max_logical_replication_workers = 0 max_wal_senders = 0 wal_level = minimal autovacuum = off Several things seem to have gone wrong here. Questions: + The WAL partition is much faster than the archival tablespace. Am I in constant danger of overruning max_wal_size? How to make 100% sure this never happens again? + After recovery, with the database idling, I notice that WAL space usage is constant at 64GB. Why doesn't it free up space down to min_wal_size (1GB)? + I just created a 13GB zstd-compressed tarball of those 64GB WAL files. This indicates that the files are compressible despite using wal_compression=zstd setting. Could it be that postgres ignores the flag and does not compress the WAL? How to check? + I'm using parallel pg_restore --data-only, can't avoid that for now. Even though all the tables are empty (I truncated everything before starting pg_restore), I can't find a way to avoid going through the WAL. Ideas? Thanks in advance, Dimitris