Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Melanie Plageman Wed, 17 Jul 2024 08:07:46 -0700

On Mon, Jul 15, 2024 at 6:02 PM Peter Geoghegan <p...@bowt.ie> wrote:
>
> On Mon, Jul 8, 2024 at 2:25 PM Melanie Plageman
> <melanieplage...@gmail.com> wrote:
> > I could still use another pair of eyes on the test (looking out for
> > stability enhancing measures I could take).
>
> First, the basics: I found that your test failed reliably without your
> fix, and passed reliably with your fix.


Thanks for the review.

> Minor nitpicking about the comments in your TAP test:
>
> * It is necessary but not sufficient for your test to "skewer"
> maybe_needed, relative to OldestXmin. Obviously, it is not sufficient
> because the test can only fail when VACUUM prunes a heap page after
> the backend's horizons have been "skewered" in this sense.
>
> Pruning is when we get stuck, and if there's no more pruning then
> there's no opportunity for VACUUM to get stuck. Perhaps this point
> should be noted directly in the comments. You could add a sentence
> immediately after the existing sentence "Then vacuum's first pass will
> continue and pruning...". This new sentence would then add commentary
> such as "Finally, vacuum's second pass over the heap...".

I've added a description to the top of the test of the scenario
required and then reworked the comment you are describing to try and
make this more clear.

> * Perhaps you should point out that you're using VACUUM FREEZE for
> this because it'll force the backend to always get a cleanup lock.
> This is something you rely on to make the repro reliable, but that's
> it.
>
> In other words, point out to the reader that this bug has nothing to
> do with freezing; it just so happens to be convenient to use VACUUM
> FREEZE this way, due to implementation details.

I've mentioned this in a comment.

> * The sentence "VACUUM proceeds with pruning and does a visibility
> check on each tuple..." describes the bug in terms of the current
> state of things on Postgres 17, but Postgres 17 hasn't been released
> just yet. Does that really make sense?

In the patch targeted at master, I think it makes sense to describe
the code as it is. In the backpatch versions, I reworked this comment
to be correct for those versions.

> If you want to describe the invariant that caused
> heap_pre_freeze_checks() to error-out on HEAD/Postgres 17, then the
> commit message of your fix seems like the right place for that. You
> could reference these errors in passing. The errors seem fairly
> incidental to the real problem, at least to me.

The errors are mentioned in the fix commit message.

> I think that there is some chance that this test will break the build
> farm in whatever way, since there is a long history of VACUUM not
> quite behaving as expected with these sorts of tests. I think that you
> should commit the test case separately, first thing in the morning,
> and then keep an eye on the build farm for the rest of the day. I
> don't think that it's sensible to bend over backwards, just to avoid
> breaking the build farm in this way.

Sounds good.

- Melanie

From dab4725c8e5dc7103607790d915bb47e2139d0ce Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Wed, 17 Jul 2024 11:01:00 -0400
Subject: [PATCH v4 1/2] Ensure vacuum removes all visibly dead tuples older
 than OldestXmin

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it may attempt to freeze the tuple's xmax and then ERROR out in
pre-freeze checks with "cannot freeze committed xmax".

Fix this by having vacuum always remove tuples older than OldestXmin.

It is possible for GlobalVisState->maybe_needed to precede OldestXmin if
maybe_needed is forced to go backward while vacuum is running. This can
happen if a disconnected standby with a running transaction older than
VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin.

In back branches starting with 14, the first version using
GlobalVisState, failing to remove tuples older than OldestXmin during
pruning caused vacuum to infinitely loop in lazy_scan_prune(), as
investigated on this [1] thread. After 1ccc1e05ae removed the retry loop
in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the
hang could no longer happen, but we could still attempt to freeze dead
tuples with xmax older than OldestXmin -- resulting in an ERROR.

Fix this by always removing dead tuples with xmax older than
VacuumCutoffs->OldestXmin. This is okay because the standby won't replay
the tuple removal until the tuple is removable. Thus, the worst that can
happen is a recovery conflict.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee

Back-patch through 14

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan, Robert Haas, Andres Freund, Heikki Linnakangas, and Noah Misch
Discussion: https://postgr.es/m/CAAKRu_bDD7oq9ZwB2OJqub5BovMG6UjEYsoK2LVttadjEqyRGg%40mail.gmail.com
---
 src/backend/access/heap/pruneheap.c  | 23 ++++++++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c | 14 +++++++-------
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3cdfc5b7f1b..869d82ad667 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -325,6 +325,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning
  * of vacuuming the relation.  Required if HEAP_PRUNE_FREEZE option is set.
+ * cutoffs->OldestXmin is also used to determine if dead tuples are
+ * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -922,8 +924,27 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
 	if (res != HEAPTUPLE_RECENTLY_DEAD)
 		return res;
 
+	/*
+	 * For VACUUM, we must be sure to prune tuples with xmax older than
+	 * OldestXmin -- a visibility cutoff determined at the beginning of
+	 * vacuuming the relation. OldestXmin is used for freezing determination
+	 * and we cannot freeze dead tuples' xmaxes.
+	 */
+	if (prstate->cutoffs &&
+		TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
+		NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
+		return HEAPTUPLE_DEAD;
+
+	/*
+	 * Determine whether or not the tuple is considered dead when compared
+	 * with the provided GlobalVisState. On-access pruning does not provide
+	 * VacuumCutoffs. And for vacuum, even if the tuple's xmax is not older
+	 * than OldestXmin, GlobalVisTestIsRemovableXid() could find the row dead
+	 * if the GlobalVisState has been updated since the beginning of vacuuming
+	 * the relation.
+	 */
 	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
-		res = HEAPTUPLE_DEAD;
+		return HEAPTUPLE_DEAD;
 
 	return res;
 }
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8ef..abb47ae5960 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -438,13 +438,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * as an upper bound on the XIDs stored in the pages we'll actually scan
 	 * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
 	 *
-	 * Next acquire vistest, a related cutoff that's used in pruning.  We
-	 * expect vistest will always make heap_page_prune_and_freeze() remove any
-	 * deleted tuple whose xmax is < OldestXmin.  lazy_scan_prune must never
-	 * become confused about whether a tuple should be frozen or removed.  (In
-	 * the future we might want to teach lazy_scan_prune to recompute vistest
-	 * from time to time, to increase the number of dead tuples it can prune
-	 * away.)
+	 * Next acquire vistest, a related cutoff that's used in pruning.  We use
+	 * vistest in combination with OldestXmin to ensure that
+	 * heap_page_prune_and_freeze() always removes any deleted tuple whose
+	 * xmax is < OldestXmin.  lazy_scan_prune must never become confused about
+	 * whether a tuple should be frozen or removed.  (In the future we might
+	 * want to teach lazy_scan_prune to recompute vistest from time to time,
+	 * to increase the number of dead tuples it can prune away.)
 	 */
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-- 
2.34.1

From 4effac7b206c9bb00b87e748a91ed321d3721789 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Wed, 17 Jul 2024 11:00:53 -0400
Subject: [PATCH v4 2/2] Test that vacuum removes tuples older than OldestXmin

If vacuum fails to prune a tuple killed before OldestXmin, it will
decide to freeze its xmax and later error out in pre-freeze checks.

Add a test reproducing this scenario to the recovery suite which creates
a table on a primary, updates the table to generate dead tuples for
vacuum, and then, during the vacuum, uses a replica to force
GlobalVisState->maybe_needed on the primary to move backwards and
precede the value of OldestXmin set at the beginning of vacuuming the
table.

This commit is separate from the fix in case there are test stability
issues.

Author: Melanie Plageman
Reviewed-by: Peter Geoghegan
Discussion: https://postgr.es/m/CAAKRu_apNU2MPBK96V%2BbXjTq0RiZ-%3DA4ZTaysakpx9jxbq1dbQ%40mail.gmail.com
---
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/043_vacuum_horizon_floor.pl    | 250 ++++++++++++++++++
 2 files changed, 251 insertions(+)
 create mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index b1eb77b1ec1..1d55d6bf560 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,6 +51,7 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
+      't/043_vacuum_horizon_floor.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
new file mode 100644
index 00000000000..d454c8bbaec
--- /dev/null
+++ b/src/test/recovery/t/043_vacuum_horizon_floor.pl
@@ -0,0 +1,250 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use Test::More;
+
+# Test that vacuum prunes away all dead tuples killed before OldestXmin
+#
+# This test creates a table on a primary, updates the table to generate dead
+# tuples for vacuum, and then, during the vacuum, uses the replica to force
+# GlobalVisState->maybe_needed on the primary to move backwards and precede the
+# value of OldestXmin set at the beginning of vacuuming the table.
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+autovacuum = off
+maintenance_work_mem = 1024
+]);
+$node_primary->start;
+
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+
+$node_primary->backup('my_backup');
+$node_replica->init_from_backup($node_primary, 'my_backup',
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+# Save the original connection info for later use
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "vac_horizon_floor_table";
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 1);
+
+# Because vacuum's first pass, pruning, is where we use the GlobalVisState to
+# check tuple visibility, GlobalVisState->maybe_needed must move backwards
+# during pruning before checking the visibility for a tuple which would have
+# been considered HEAPTUPLE_DEAD prior to maybe_needed moving backwards but
+# HEAPTUPLE_RECENTLY_DEAD compared to the new, older value of maybe_needed.
+#
+# We must not only force the horizon on the primary to move backwards but also
+# force the vacuuming backend's GlobalVisState to be updated. GlobalVisState
+# is forced to update during index vacuuming.
+#
+# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
+# end of a round of index vacuuming, updating the backend's GlobalVisState
+# and, in our case, moving maybe_needed backwards.
+#
+# Then vacuum's first (pruning) pass will continue and pruning will find our
+# later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
+# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
+#
+# Thus, we must force at least two rounds of index vacuuming to ensure that
+# some tuple visibility checks will happen after a round of index vacuuming. To
+# accomplish this, we set maintenance_work_mem to its minimum value and
+# insert and update enough rows that we force at least one round of index
+# vacuuming before getting to a dead tuple which was killed after the standby
+# is disconnected.
+$node_primary->safe_psql($test_db, qq[
+	CREATE TABLE ${table1}(col1 int) with (autovacuum_enabled=false, fillfactor=10);
+	INSERT INTO $table1 VALUES(7);
+	INSERT INTO $table1 SELECT generate_series(1, 400000) % 3;
+	CREATE INDEX on ${table1}(col1);
+	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
+	INSERT INTO $table1 VALUES(7);
+]);
+
+# We will later move the primary forward while the standby is disconnected. For
+# now, however, there is no reason not to wait for the standby to catch up.
+my $primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+# Test that the WAL receiver is up and running.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 't');
+
+# Set primary_conninfo to something invalid on the replica and reload the
+# config. Once the config is reloaded, the startup process will force the WAL
+# receiver to restart and it will be unable to reconnect because of the invalid
+# connection information.
+$node_replica->safe_psql($test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+	]);
+
+# Wait until the WAL receiver has shut down and been unable to start up again.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 'f');
+
+# Now insert and update a tuple which will be visible to the vacuum on the
+# primary but which will have xmax newer than the oldest xmin on the standby
+# that was recently disconnected.
+my $res = $psql_primaryA->query_safe(
+	qq[
+		INSERT INTO $table1 VALUES (99);
+		UPDATE $table1 SET col1 = 100 WHERE col1 = 99;
+		SELECT 'after_update';
+        ]
+	);
+
+# Make sure the UPDATE finished
+like($res, qr/^after_update$/m, "UPDATE occurred on primary session A");
+
+# Open a cursor on the primary whose pin will keep VACUUM from getting a
+# cleanup lock on the first page of the relation. We want VACUUM to be able to
+# start, calculate initial values for OldestXmin and GlobalVisState and then be
+# unable to proceed with pruning our dead tuples. This will allow us to
+# reconnect the standby and push the horizon back before we start actual
+# pruning and vacuuming.
+my $primary_cursor1 = "vac_horizon_floor_cursor1";
+
+# The first value inserted into the table was a 7, so FETCH FORWARD should
+# return a 7. That's how we know the cursor has a pin.
+$res = $psql_primaryB->query_safe(
+	qq[
+        BEGIN;
+			DECLARE $primary_cursor1 CURSOR FOR SELECT col1 FROM $table1 WHERE col1 = 7;
+			FETCH $primary_cursor1;
+        ]
+	);
+
+is($res, 7, qq[Cursor query returned $res. Expected value 7.]);
+
+# Get the PID of the session which will run the VACUUM FREEZE so that we can
+# use it to filter pg_stat_activity later.
+my $vacuum_pid = $psql_primaryA->query_safe("SELECT pg_backend_pid();");
+
+# Now start a VACUUM FREEZE on the primary. It will call vacuum_get_cutoffs()
+# and establish values of OldestXmin and GlobalVisState which are newer than
+# all of our dead tuples. Then it will be unable to get a cleanup lock to start
+# pruning, so it will hang. We use VACUUM FREEZE because it will wait for a
+# cleanup lock instead of skipping the page pinned by the cursor.
+$psql_primaryA->{stdin} .= qq[
+		VACUUM FREEZE $table1;
+		\\echo VACUUM
+        ];
+
+# Make sure the VACUUM command makes it to the server.
+$psql_primaryA->{run}->pump_nb();
+
+# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is just
+# waiting on the lock to start vacuuming. We don't want the standby to
+# re-establish a connection to the primary and push the horizon back until
+# we've saved initial values in GlobalVisState and calculated OldestXmin.
+$node_primary->poll_query_until($test_db,
+	qq[
+	SELECT count(*) >= 1 FROM pg_stat_activity
+		WHERE pid = $vacuum_pid
+		AND wait_event = 'BufferPin';
+	],
+	't');
+
+# Ensure the WAL receiver is still not active on the replica.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 'f');
+
+# Allow the WAL receiver connection to re-establish.
+$node_replica->safe_psql(
+	$test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+# Ensure the new WAL receiver has connected.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 't');
+
+# Once the WAL sender is shown on the primary, the replica should have
+# connected with the primary and pushed the horizon backward. Primary Session A
+# won't see that until the VACUUM FREEZE proceeds and does its first round of
+# index vacuuming.
+$node_primary->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_replication);] , 't');
+
+# Move the cursor forward to the next 7. We inserted the 7 much later, so
+# advancing the cursor should allow vacuum to proceed vacuuming most pages of
+# the relation. Because we set maintanence_work_mem sufficiently low, we expect
+# that a round of index vacuuming has happened and that the vacuum is now
+# waiting for the cursor to release its pin on the last page of the relation.
+$res = $psql_primaryB->query_safe("FETCH $primary_cursor1");
+is($res, 7, qq[Cursor query returned $res from second fetch. Expected value 7.]);
+
+# Prevent the test from incorrectly passing by confirming that we did indeed do
+# a pass of index vacuuming.
+$node_primary->poll_query_until($test_db, qq[
+	SELECT index_vacuum_count > 0
+	FROM pg_stat_progress_vacuum
+	WHERE datname='$test_db' AND relid::regclass = '$table1'::regclass;
+	] , 't');
+
+# Commit the transaction with the open cursor so that the VACUUM can finish.
+$psql_primaryB->query_until(
+		qr/^commit$/m,
+		qq[
+			COMMIT;
+			\\echo commit
+        ]
+	);
+
+# VACUUM proceeds with pruning and does a visibility check on each tuple. In
+# older versions of Postgres, pruning found our final dead tuple
+# non-removable (HEAPTUPLE_RECENTLY_DEAD) since its xmax is after the new
+# value of maybe_needed. Then heap_prepare_freeze_tuple() would decide the
+# tuple xmax should be frozen because it precedes OldestXmin. Vacuum would
+# then error out in heap_pre_freeze_checks() with "cannot freeze committed
+# xmax". This was fixed by changing pruning to find all
+# HEAPTUPLE_RECENTLY_DEAD tuples with xmaxes preceding OldestXmin
+# HEAPTUPLE_DEAD and removing them.
+
+# With the fix, VACUUM should finish successfully, incrementing the table
+# vacuum_count.
+$node_primary->poll_query_until($test_db,
+	qq[
+	SELECT vacuum_count > 0
+	FROM pg_stat_all_tables WHERE relname = '${table1}';
+	]
+	, 't');
+
+$primary_lsn = $node_primary->lsn('flush');
+
+# Make sure something causes us to flush
+$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);");
+
+# Nothing on the replica should cause a recovery conflict, so this should
+# finish successfully.
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+## Shut down psqls
+$psql_primaryA->quit;
+$psql_primaryB->quit;
+
+$node_replica->stop();
+$node_primary->stop();
+
+done_testing();
-- 
2.34.1

Re: Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Reply via email to