Hi, Because of the way autovacuum_vacuum_[insert]_scale_factor works, autovacuums trigger less frequently as the relation gets larger.
See this math in relation_needs_vacanalyze: vacinsthresh = (float4) vac_ins_base_thresh + vac_ins_scale_factor * reltuples; For an insert-only table, nearly all the unvacuumed pages will be eligible to be set all-visible and many will be eligible to be set all-frozen. Because normal vacuums can skip all-visible pages, proactively setting these pages all-visible by vacuuming them sooner often reduces IO overhead, as they are more likely to still be in shared buffers the sooner they are vacuumed after last being touched. Vacuuming these pages more proactively and setting them frozen also helps amortize the work of aggressive vacuums -- which often negatively impact the performance of the system. By considering only the unfrozen portion of the table when calculating the vacuum insert threshold, we can trigger vacuums more proactively on insert-heavy tables. This changes the definition of insert_scale_factor to a percentage of "active" table size. The attached patch does this. I've estimated the unfrozen percentage of the table by adding a new field to pg_class, relallfrozen, which is updated in the same places as relallvisible. As an example of this patch in action, I designed a benchmark in which a table is bulk-loaded with 1 GB of data with COPY FREEZE. Then I run a custom pgbench with four clients inserting 10 tuples per transaction into the table for 1_000_000 transactions each. Note that I configured Postgres to try and observe the effects of this patch on a compressed timeline. At the bottom of the mail, I've included details on all of the GUCs I set and why. Over the course of the same number of transactions, master triggered 8 autovacuums of the table and the patch triggered 16. With the patch, despite doing twice as many vacuums, autovacuum workers did 10% fewer reads and 93% fewer writes. At the end of the benchmark, the patched version of Postgres had emitted twice as many FPIs as master. More frequent vacuums means each vacuum scans fewer pages, but, more interestingly, the first vacuum after a checkpoint is much more efficient. With the patch, the first vacuum after a checkpoint emits half as many FPIs. You can see that only 18 pages were newly dirtied. So, with the patch, the pages being vacuumed are usually still in shared buffers and still dirty. Master ------ 2024-10-22 13:53:14.293 EDT [3594] LOG: checkpoint starting: time 2024-10-22 13:53:27.849 EDT [3964] LOG: automatic vacuum of table "history" pages: 0 removed, 753901 remain, 151589 scanned (20.11% of total) I/O timings: read: 77.962 ms, write: 92.879 ms avg read rate: 95.840 MB/s, avg write rate: 96.852 MB/s buffer usage: 268318 hits, 35133 reads, 35504 dirtied WAL usage: 218320 records, 98672 full page images, 71314906 bytes Patch ----- 2024-10-22 13:48:43.951 EDT [1471] LOG: checkpoint starting: time 2024-10-22 13:48:59.741 EDT [1802] LOG: automatic vacuum of table "history" pages: 0 removed, 774375 remain, 121367 scanned (15.67% of total) I/O timings: read: 2.974 ms, write: 4.434 ms avg read rate: 1.363 MB/s, avg write rate: 0.126 MB/s buffer usage: 242817 hits, 195 reads, 18 dirtied WAL usage: 121389 records, 49216 full page images, 34408291 bytes While it is true that timing will change significantly from run to run, I observed over many runs that the more frequent vacuums of the table led to less overall overhead due to vacuuming pages before they are evicted from shared buffers. Below is a detailed description of the benchmark and Postgres configuration: Benchmark ========= Set these GUCs: -- initial table data should fill shared buffers shared_buffers=1GB -- give us a chance to try and vacuum the table a bunch of times autovacuum_naptime=2 -- all checkpoints should be triggered by timing max/min_wal_size=150GB -- let's get at least 1 checkpoint during the short benchmark checkpoint_timeout='2min' -- let's not be bottlenecked on WAL I/O wal_buffers='128MB' wal_compression='zstd' -- let's get a lot of inserts done quickly synchronous_commit='off' -- let's not take too many breaks for vacuum delay vacuum_cost_limit = 2000 -- so we can see what happened log_checkpoints = on log_autovacuum_min_duration=0 -- so we can get more stats track_wal_io_timing=on track_io_timing = on First I created the table that you will see later in DDL and loaded it by running pgbench in the same way as I do in the benchmark until there was 1 GB of table data. Then I copied that out to a file 'history.data' I included an index because the more up-to-date visibility map would benefit index-only scans -- which you could add to the benchmark if you want. DDL -- BEGIN; DROP TABLE IF EXISTS history; CREATE TABLE history( id BIGINT, client_id INT NOT NULL, mtime TIMESTAMPTZ DEFAULT NOW(), data TEXT); COPY history FROM 'history.data' WITH (freeze on); CREATE INDEX ON history(id); COMMIT; pgbench \ --random-seed=0 \ --no-vacuum \ -M prepared \ -c 4 \ -j 4 \ -t 1000000 \ -R 27000 \ -f- <<EOF INSERT INTO history(id, client_id, data) VALUES (:client_id, :client_id, repeat('a', 90)), (:client_id, :client_id, repeat('b', 90)), (:client_id, :client_id, repeat('c', 90)), (:client_id, :client_id, repeat('d', 90)), (:client_id, :client_id, repeat('e', 90)), (:client_id, :client_id, repeat('f', 90)), (:client_id, :client_id, repeat('g', 90)), (:client_id, :client_id, repeat('h', 90)), (:client_id, :client_id, repeat('i', 90)), (:client_id, :client_id, repeat('j', 90)); EOF - Melanie
v1-0002-Trigger-more-frequent-autovacuums-with-relallfroz.patch
Description: Binary data
v1-0001-Add-relallfrozen-to-pg_class.patch
Description: Binary data