Hi,
On 5/26/23 9:27 AM, Yu Shi (Fujitsu) wrote:
Hi hackers,
I saw a buildfarm failure on "dikkop"[1]. It failed in
035_standby_logical_decoding.pl, because the slots row_removal_inactiveslot and
row_removal_activeslot are not invalidated after vacuum.
Thanks for sharing!
regress_log_035_standby_logical_decoding:
```
[12:15:05.943](4.442s) not ok 22 - inactiveslot slot invalidation is logged
with vacuum on pg_class
[12:15:05.945](0.003s)
[12:15:05.946](0.000s) # Failed test 'inactiveslot slot invalidation is
logged with vacuum on pg_class'
# at t/035_standby_logical_decoding.pl line 238.
[12:15:05.948](0.002s) not ok 23 - activeslot slot invalidation is logged with
vacuum on pg_class
[12:15:05.949](0.001s)
[12:15:05.950](0.000s) # Failed test 'activeslot slot invalidation is logged
with vacuum on pg_class'
# at t/035_standby_logical_decoding.pl line 244.
[13:38:26.977](5001.028s) # poll_query_until timed out executing this query:
# select (confl_active_logicalslot = 1) from pg_stat_database_conflicts where
datname = 'testdb'
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[13:38:26.980](0.003s) not ok 24 - confl_active_logicalslot updated
[13:38:26.982](0.002s)
[13:38:26.982](0.000s) # Failed test 'confl_active_logicalslot updated'
# at t/035_standby_logical_decoding.pl line 251.
Timed out waiting confl_active_logicalslot to be updated at
t/035_standby_logical_decoding.pl line 251.
```
035_standby_logical_decoding.pl:
```
# This should trigger the conflict
$node_primary->safe_psql(
'testdb', qq[
CREATE TABLE conflict_test(x integer, y text);
DROP TABLE conflict_test;
VACUUM pg_class;
INSERT INTO flush_wal DEFAULT VALUES; -- see create table flush_wal
]);
$node_primary->wait_for_replay_catchup($node_standby);
# Check invalidation in the logfile and in pg_stat_database_conflicts
check_for_invalidation('row_removal_', $logstart, 'with vacuum on pg_class');
```
Is it possible that the vacuum command didn't remove tuples and then the
conflict was not triggered?
The flush_wal table added by Andres should guarantee that the WAL is flushed, so
the only reason I can think about is indeed that the vacuum did not remove
tuples (
but I don't get why/how that could be the case).
It seems we can't confirm this because there is not
enough information.
Right, and looking at its status history most of the time the test is green
(making it
even more difficult to diagnose).
Maybe "vacuum verbose" can be used to provide more
information.
I can see that dikkop "belongs" to Tomas (adding Tomas to this thread).
Tomas, do you think it would be possible to run some
035_standby_logical_decoding.pl
manually with "vacuum verbose" in the test mentioned above? (or any other way
you can think
about that would help diagnose this random failure?).
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com