On Tue, Jun 8, 2021 at 11:13 AM Dilip Kumar <dilipbal...@gmail.com> wrote: > > # Wait until the node exits recovery. > $standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';") > or die "Timed out while waiting for promotion"; > > I will try to generate a version for 9.6 based on this idea and see how it > goes
I have changed for as per 9.6 but I am seeing some crash (both with/without fix), I could not figure out the reason, it did not generate any core dump, although I changed pg_ctl in PostgresNode.pm to use "-c" so that it can generate core but it did not generate any core file. This is log from cascading node (025_stuck_on_old_timeline_cascade.log) ------------- cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_52dW/archives/000000010000000000000003’: No such file or directory WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: could not receive database system identifier and timeline ID from the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. -------------- The attached logs are when I ran without a fix. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
From 42886664062e4aae35ee33a6b3a87bc242ec86ad Mon Sep 17 00:00:00 2001 From: Robert Haas <rh...@postgresql.org> Date: Fri, 4 Jun 2021 10:55:11 -0400 Subject: [PATCH v6] Fix corner case failure of new standby to follow new primary. This only happens if (1) the new standby has no WAL available locally, (2) the new standby is starting from the old timeline, (3) the promotion happened in the WAL segment from which the new standby is starting, (4) the timeline history file for the new timeline is available from the archive but the WAL files for are not (i.e. this is a race), (5) the WAL files for the new timeline are available via streaming, and (6) recovery_target_timeline='latest'. Commit ee994272ca50f70b53074f0febaec97e28f83c4e introduced this logic and was an improvement over the previous code, but it mishandled this case. If recovery_target_timeline='latest' and restore_command is set, validateRecoveryParameters() can change recoveryTargetTLI to be different from receiveTLI. If streaming is then tried afterward, expectedTLEs gets initialized with the history of the wrong timeline. It's supposed to be a list of entries explaining how to get to the target timeline, but in this case it ends up with a list of entries explaining how to get to the new standby's original timeline, which isn't right. Dilip Kumar and Robert Haas, reviewed by Kyotaro Horiguchi. Discussion: http://postgr.es/m/CAFiTN-sE-jr=lb8jquxeqikd-ux+jhixyh4ydizmpedgqku...@mail.gmail.com --- src/backend/access/transam/xlog.c | 2 +- src/test/recovery/t/025_stuck_on_old_timeline.pl | 96 ++++++++++++++++++++++++ src/test/recovery/t/cp_history_files | 10 +++ 3 files changed, 107 insertions(+), 1 deletion(-) create mode 100644 src/test/recovery/t/025_stuck_on_old_timeline.pl create mode 100644 src/test/recovery/t/cp_history_files diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 9ad6da9..a7ee14b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -11839,7 +11839,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess, if (readFile < 0) { if (!expectedTLEs) - expectedTLEs = readTimeLineHistory(receiveTLI); + expectedTLEs = readTimeLineHistory(recoveryTargetTLI); readFile = XLogFileRead(readSegNo, PANIC, receiveTLI, XLOG_FROM_STREAM, false); diff --git a/src/test/recovery/t/025_stuck_on_old_timeline.pl b/src/test/recovery/t/025_stuck_on_old_timeline.pl new file mode 100644 index 0000000..1eafbbd --- /dev/null +++ b/src/test/recovery/t/025_stuck_on_old_timeline.pl @@ -0,0 +1,96 @@ + +# Copyright (c) 2021, PostgreSQL Global Development Group + +# Testing streaming replication where standby is promoted and a new cascading +# standby (without WAL) is connected to the promoted standby. Both archiving +# and streaming are enabled, but only the history file is available from the +# archive, so the WAL files all have to be streamed. Test that the cascading +# standby can follow the new primary (promoted standby). +use strict; +use warnings; +use PostgresNode; +use TestLib; +use FindBin; +use Test::More tests => 1; + +# Initialize primary node +my $node_primary = get_new_node('primary'); + +# Set up an archive command that will copy the history file but not the WAL +# files. No real archive command should behave this way; the point is to +# simulate a race condition where the new cascading standby starts up after +# the timeline history file reaches the archive but before any of the WAL files +# get there. +$node_primary->init(allows_streaming => 1, has_archiving => 1); +my $archivedir_primary = $node_primary->archive_dir; +$node_primary->append_conf( + 'postgresql.conf', qq( +archive_command = '$^X "$FindBin::RealBin/cp_history_files" "%p" "$archivedir_primary/%f"' +)); +$node_primary->start; + +# Take backup from primary +my $backup_name = 'my_backup'; +$node_primary->backup($backup_name); + +# Create streaming standby linking to primary +my $node_standby = get_new_node('standby'); +$node_standby->init_from_backup($node_primary, $backup_name, + allows_streaming => 1, has_streaming => 1, has_archiving => 1); +$node_standby->start; + +# Take backup of standby, use -Xnone so that pg_wal is empty. +$node_standby->backup($backup_name, backup_options => ['-Xnone']); + +# Create cascading standby but don't start it yet. +# Must set up both streaming and archiving. +my $node_cascade = get_new_node('cascade'); +$node_cascade->init_from_backup($node_standby, $backup_name, + has_streaming => 1); +$node_cascade->enable_restoring($node_primary); + +# Promote the standby. +$node_standby->promote; + +# Wait until the node exits recovery. +$node_standby->poll_query_until('postgres', "SELECT pg_is_in_recovery() = 'f';") +or die "Timed out while waiting for promotion"; + +# Find next WAL segment to be archived +my $walfile_to_be_archived = $node_standby->safe_psql('postgres', + "SELECT pg_xlogfile_name(pg_current_xlog_location());"); + +# Make WAL segment eligible for archival +$node_standby->safe_psql('postgres', 'SELECT pg_switch_xlog()'); + +# Wait until the WAL segment has been archived. +# Since the history file gets created on promotion and is archived before any +# WAL segment, this is enough to guarantee that the history file was +# archived. +my $archive_wait_query = + "SELECT '$walfile_to_be_archived' <= last_archived_wal FROM pg_stat_archiver;"; +$node_standby->poll_query_until('postgres', $archive_wait_query) + or die "Timed out while waiting for WAL segment to be archived"; +my $last_archived_wal_file = $walfile_to_be_archived; + +# Start cascade node +$node_cascade->start; + +# Create some content on promoted standby and check its presence on the +# cascading standby. +$node_standby->safe_psql('postgres', "CREATE TABLE tab_int AS SELECT 1 AS a"); + +# Wait for the replication to catch up +$node_standby->wait_for_catchup($node_cascade, 'replay', + $node_standby->lsn('insert')); + +# Check that cascading standby has the new content +my $result = + $node_cascade->safe_psql('postgres', "SELECT count(*) FROM tab_int"); +print "cascade: $result\n"; +is($result, 1, 'check streamed content on cascade standby'); + +# clean up +$node_primary->teardown_node; +$node_standby->teardown_node; +$node_cascade->teardown_node; diff --git a/src/test/recovery/t/cp_history_files b/src/test/recovery/t/cp_history_files new file mode 100644 index 0000000..cfeea41 --- /dev/null +++ b/src/test/recovery/t/cp_history_files @@ -0,0 +1,10 @@ +#!/usr/bin/perl + +use File::Copy; +use strict; +use warnings; + +die "wrong number of arguments" if @ARGV != 2; +my ($source, $target) = @ARGV; +exit if $source !~ /history/; +copy($source, $target) or die "couldn't copy $source to $target: $!"; -- 1.8.3.1
LOG: database system was interrupted while in recovery at log time 2021-06-08 14:13:52 IST HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. LOG: entering standby mode cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000002’: No such file or directory LOG: redo starts at 0/2000028 LOG: consistent recovery state reached at 0/3000000 LOG: database system is ready to accept read only connections cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: fetching timeline history file for timeline 2 from primary server LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory LOG: primary server contains no more WAL on requested timeline 1 cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. FATAL: could not receive database system identifier and timeline ID from the primary server: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. cp: cannot stat ‘/home/dilipkumar/work/PG/postgresql/src/test/recovery/tmp_check/data_primary_4Evv/archives/000000010000000000000003’: No such file or directory FATAL: could not connect to the primary server: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/H1JFqD4gm3/.s.PGSQL.62229"? LOG: received immediate shutdown request LOG: database system is shut down
LOG: database system was shut down at 2021-06-08 14:13:51 IST LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: received immediate shutdown request WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: archiver process (PID 66818) exited with exit code 2 LOG: database system is shut down
LOG: database system was interrupted; last known up at 2021-06-08 14:13:52 IST LOG: entering standby mode LOG: redo starts at 0/2000028 LOG: consistent recovery state reached at 0/20000F8 LOG: database system is ready to accept read only connections LOG: started streaming WAL from primary at 0/3000000 on timeline 1 LOG: received promote request FATAL: terminating walreceiver process due to administrator command LOG: redo done at 0/20000F8 LOG: selected new timeline ID: 2 LOG: statement: SELECT pg_is_in_recovery() = 'f'; LOG: archive recovery complete LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: statement: SELECT pg_is_in_recovery() = 'f'; LOG: statement: SELECT pg_xlogfile_name(pg_current_xlog_location()); LOG: statement: SELECT pg_switch_xlog() LOG: statement: SELECT '000000020000000000000003' <= last_archived_wal FROM pg_stat_archiver; LOG: statement: CREATE TABLE tab_int AS SELECT 1 AS a LOG: received immediate shutdown request WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: archiver process (PID 66845) exited with exit code 2 LOG: database system is shut down
regress_log_025_stuck_on_old_timeline
Description: Binary data