Hi, I investigated what I thought was an issue with my patch to amortize aggressive vacuum [1] and found some interesting behavior also present on master. It seems vacuum will scan many all-frozen pages due to an interaction between the opportunistic freeze heuristic and SKIP_PAGES_THRESHOLD. This is not really an actionable finding (no clear way to fix it) but I wanted to put a repro and analysis on the mailing list in case anyone else ever notices this odd behavior.
Vacuum will opportunistically freeze a page if we just emitted an FPI (or would emit one). This has a very peculiar interaction with several other subsystems that can lead us to eventually scan huge numbers of already frozen pages during vacuum. In an insert-only workload, you would expect that because transaction IDs mostly go from smaller to larger as block numbers become larger that you would end up freezing contiguous ranges of pages once tuples' xmins start being old enough. However, we don't end up freezing contiguous ranges when checksums are enabled. In an insert-only workload (like the one I provide at the bottom of the email), the first time we vacuum a page is often when we set hint bits. If checksums are enabled and this is the first time we are dirtying the buffer since the last checkpoint, this will emit an FPI. So, if bgwriter or checkpointer write out the page after the last tuple is inserted to the page before the page is vacuumed, we will likely emit an FPI when vacuuming it. And because we opportunistically freeze pages if we emitted an FPI setting their hint bits, we end up freezing a number of pages for this reason. What is interesting is that the ranges of contiguous frozen and unfrozen pages are pretty small. For example: start_block | end_block | nblocks | all_visible | all_frozen -------------+-----------+---------+-------------+------------ 1396695 | 1396695 | 1 | t | f 1396696 | 1396710 | 14 | t | t 1396711 | 1396711 | 1 | t | f 1396712 | 1396730 | 18 | t | t 1396731 | 1396731 | 1 | t | f 1396732 | 1396758 | 26 | t | t 1396759 | 1396759 | 1 | t | f 1396760 | 1396767 | 7 | t | t 1396768 | 1396768 | 1 | t | f This fragmentation usually starts with the first vacuum after we have dirtied enough buffers to provide work for bgwriter and at least one checkpoint has happened. The most likely explanation is that because bgwriter writes out buffers in the buffer clock order (not the order the blocks are in the table), there are small random ranges of dirty and clean buffers in the table. Combine that with checkpointer writing out other buffers and the effect can be even more dramatic. Here is some log output for the first vacuum after a checkpoint had started (I patched master to print the number of hint bit fpis in autovacuum log output): automatic vacuum of table "melanieplageman.public.history": pages: 0 removed, 605229 remain, 363538 scanned (60.07% of total) frozen: 190990 pages from table (31.56% of total) had 10503305 tuples frozen WAL usage: 745501 records, 205124 full page images, 190999 hint bit fpis, 205494134 bytes During an aggressive vacuum, all all-visible pages will be scanned (because we must scan every unfrozen tuple to advance relfrozenxid). All-frozen pages can be skipped. However, if the range of skippable blocks is less than SKIP_PAGES_THRESHOLD (hard-coded to 32), then we will scan these skippable blocks anyway -- on the assumption that requesting small ranges from the kernel will mess with readahead. The result of all of this is that aggressive vacuums may read large numbers of frozen pages when checksums are enabled. Here is the autovacuum logging output of one example (I've patched master to print the pages scanned due to SKIP_PAGES_THRESHOLD and how many of those are frozen): automatic aggressive vacuum to prevent wraparound of table "melanieplageman.public.history": pages: 0 removed, 2454172 remain, 1437630 scanned (58.58% of total) frozen skippable blocks scanned: 162335 As I said, I don't think we can do anything about this unless we find a good alternative freeze heuristic (I've been trying and failing at that for two years). I attached a patch to log the frozen skippable blocks scanned by vacuum and a repro. - Melanie -------- Repro: The repro below includes a lot of GUCs that are required to get an aggressive vacuum in a reasonable amount of time and see the effect. It doesn't need to run for the full number of transactions to see the effect -- just until the first aggressive vacuum of the history table. psql -c "ALTER SYSTEM SET shared_buffers = '1GB';" \ -c "ALTER SYSTEM SET log_checkpoints = on;" \ -c "ALTER SYSTEM SET max_wal_size = '150 GB';" \ -c "ALTER SYSTEM SET min_wal_size = '150 GB';" \ -c "ALTER SYSTEM SET autovacuum_naptime = 10;" \ -c "ALTER SYSTEM SET log_autovacuum_min_duration = 0;" \ -c "ALTER SYSTEM SET synchronous_commit = off;" \ -c "ALTER SYSTEM SET checkpoint_timeout = '2min';" \ -c "ALTER SYSTEM SET vacuum_cost_limit = 2000;" \ -c "ALTER SYSTEM SET wal_compression = 'zstd';" \ -c "ALTER SYSTEM SET wal_buffers = '2MB';" \ -c "ALTER SYSTEM SET vacuum_freeze_min_age=10000000;" \ -c "ALTER SYSTEM SET autovacuum_freeze_max_age=10000000;" \ -c "ALTER SYSTEM SET vacuum_freeze_table_age=8000000;" \ -c "ALTER SYSTEM SET autovacuum_vacuum_insert_scale_factor='0.01';" \ -c "ALTER SYSTEM SET maintenance_work_mem = '1GB';" \ -c "ALTER SYSTEM SET autovacuum_vacuum_insert_threshold=10000000;" pg_ctl restart psql -c "CREATE TABLE history( id BIGINT PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY, client_id INT NOT NULL, mtime TIMESTAMPTZ DEFAULT NOW(), data TEXT);" pgbench \ --random-seed=0 \ --no-vacuum \ -M prepared \ -c 8 \ -j 8 \ -t 11100000 \ -f- <<EOF INSERT INTO history(client_id, data) VALUES (:client_id, repeat('a', 90)), (:client_id, repeat('b', 90)), (:client_id, repeat('c', 90)), (:client_id, repeat('d', 90)), (:client_id, repeat('e', 90)), (:client_id, repeat('f', 90)), (:client_id, repeat('g', 90)), (:client_id, repeat('h', 90)), (:client_id, repeat('i', 90)), (:client_id, repeat('j', 90)); EOF [1] https://www.postgresql.org/message-id/flat/CAAKRu_ZF_KCzZuOrPrOqjGVe8iRVWEAJSpzMgRQs%3D5-v84cXUg%40mail.gmail.com
From c861e77d627be8cc8cf2eda540e1f268ce47ed7b Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Wed, 4 Dec 2024 16:40:16 -0500 Subject: [PATCH] Track skippable blocks scanned Add counters and logging to vacuum to track what skippable pages were scanned due to SKIP_PAGES_THRESHOLD and which of those were frozen. --- src/backend/access/heap/vacuumlazy.c | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 6a3588cf817..8af64e1a720 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -214,6 +214,9 @@ typedef struct LVRelState BlockNumber next_unskippable_block; /* next unskippable block */ bool next_unskippable_allvis; /* its visibility status */ Buffer next_unskippable_vmbuffer; /* buffer containing its VM bit */ + + BlockNumber skippable_blocks_scanned; + BlockNumber frozen_skippable_blocks_scanned; } LVRelState; /* Struct for saving and restoring vacuum error information. */ @@ -427,6 +430,8 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, vacrel->live_tuples = 0; vacrel->recently_dead_tuples = 0; vacrel->missed_dead_tuples = 0; + vacrel->skippable_blocks_scanned = 0; + vacrel->frozen_skippable_blocks_scanned = 0; /* * Get cutoffs that determine which deleted tuples are considered DEAD, @@ -664,6 +669,9 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, vacrel->scanned_pages, orig_rel_pages == 0 ? 100.0 : 100.0 * vacrel->scanned_pages / orig_rel_pages); + appendStringInfo(&buf, _("skippable blocks scanned: %u. frozen skippable blocks scanned: %u.\n"), + vacrel->skippable_blocks_scanned, + vacrel->frozen_skippable_blocks_scanned); appendStringInfo(&buf, _("tuples: %lld removed, %lld remain, %lld are dead but not yet removable\n"), (long long) vacrel->tuples_deleted, @@ -1150,13 +1158,28 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno, /* Now we must be in one of the two remaining states: */ if (next_block < vacrel->next_unskippable_block) { + uint8 vmbits = 0; + Buffer vmbuffer = InvalidBuffer; + /* * 2. We are processing a range of blocks that we could have skipped * but chose not to. We know that they are all-visible in the VM, * otherwise they would've been unskippable. */ + vacrel->skippable_blocks_scanned++; + *blkno = vacrel->current_block = next_block; *all_visible_according_to_vm = true; + + vmbits = visibilitymap_get_status(vacrel->rel, + vacrel->current_block, + &vmbuffer); + if (vmbits & VISIBILITYMAP_ALL_FROZEN) + vacrel->frozen_skippable_blocks_scanned++; + + if (vmbuffer != InvalidBuffer) + ReleaseBuffer(vmbuffer); + return true; } else -- 2.34.1