Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Roman Eskin Thu, 16 Jan 2025 17:06:25 -0800

Hello pgsql-hackers,

It appears to exist an issue with timeline switching in archive
recovery mode if the startup process is interrupted by some external
reason (for ex. system power loss) in a specific tight time window
(after a new history file is created but before end-of-recovery record
is done). If this happens, and the Postgres instance is launched
again, it carries out the common crash recovery, starts successfully
to 'in production' state, but continues the previous timeline. Thus,
we are having inconsistent WAL state, where we are writing WAL segment
files to old timeline and having presented new orphaned timeline in
the history file.

This issue can reveal itself in streaming replication scenario, for
ex.:
1. Assume we have instance_1 (Master) and instance_2 (Replica);
2. instance_1 (Master) fails for any reason, instance_2 (Replica)
starts to promote. But while instance_2 is still in `StartupXLOG()`
somewhere between `writeTimeLineHistory()` and
`CreateEndOfRecoveryRecord()`, we are getting, for example, power
loss on instance_2 server.
3. Once instance_2 server is up again, instance_2 is restarted. The
problem with timeline already exists here, but very likely to be
unnoticed, as instance_2 (now Master) is accepting connections and
working.
4. Bring back instance_1 as Replica.
5. Switch back instance_1 and instance_2 to the original
configuration. And here, when we try to start instance_2 as Replica,
we'll get a FATAL:
"FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3023538."

I agree that one have to be very unlucky to get into such situation
with 2 consequent failures and to hit the specific time window, but if
this happens, it will look quite severe and hard to debug. We faced
this situation on our synthetic tests.

To reproduce the issue on latest code from Postgres master branch in a
stable manner, I used the attached
'test-code-for-timeline-issue.patch'. It postpones execution in
`StartupXLOG()` if we need to hit the requested time window. Detailed
steps to reproduce I've used with the patch are in the attached file
'steps - NG scenario.txt'.

Possible way to fix this issue, that we now consider, is to postpone
the removal of signal files ("recovery.signal" and "standby.signal")
until the end-of-recovery record is done, and thus retry interrupted
promotion in case previous one failed. Looks like, the only
insignificant negative outcome for this approach is possible
assignment of one more timeline identity. But we would like to get
more feedback from the community. Do you foresee any possible
drawbacks for this approach? Please advise.

Many thanks in advance!

P.S. The topic above is a bit similar to this old one -
https://www.postgresql.org/message-id/flat/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq%3DXxmXLmF6ipNFecbShQ%40mail.gmail.com.

Best regards,
Roman Eskin

TEST_PG_PORT_MASTER=5433
TEST_PG_PORT_REPLICA=5434

TEST_PG_DIR_MASTER=~/postgres_data_master
TEST_PG_DIR_REPLICA=~/postgres_data_replica


# Setup master
mkdir $TEST_PG_DIR_MASTER

pg_ctl -D $TEST_PG_DIR_MASTER initdb

sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/g" 
$TEST_PG_DIR_MASTER/postgresql.conf
sed -i "s/#port = 5432/port = $TEST_PG_PORT_MASTER/g" 
$TEST_PG_DIR_MASTER/postgresql.conf

pg_ctl -D $TEST_PG_DIR_MASTER -l logfile start

# Setup replica
pg_basebackup -h localhost --checkpoint=fast -D $TEST_PG_DIR_REPLICA -R -X 
stream --port=$TEST_PG_PORT_MASTER
sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/g" 
$TEST_PG_DIR_REPLICA/postgresql.conf
sed -i "s/port = $TEST_PG_PORT_MASTER/port = $TEST_PG_PORT_REPLICA/g" 
$TEST_PG_DIR_REPLICA/postgresql.conf

pg_ctl -D $TEST_PG_DIR_REPLICA -l logfile_replica start

# Check replication is working
psql postgres --port=$TEST_PG_PORT_MASTER -c "
create table t (
  a integer,
  b text,
  d timestamp without time zone
);
insert into t select i, 'test-master-2-replica'||i , now() from 
generate_series(1, 5)i;
"
sleep 2
psql postgres --port=$TEST_PG_PORT_REPLICA -c "select * from t;";

# Shutdown master and switch to replica
pg_ctl -D $TEST_PG_DIR_MASTER stop

# enable test changes that will postpone CreateEndOfRecoveryRecord()
touch $TEST_PG_DIR_REPLICA/delay_recovery.signal

psql postgres --port=$TEST_PG_PORT_REPLICA -c "SELECT pg_promote();" &

# Shutdown immediately Postgres (in real life that could be an accident power 
loss, or any other system level problem...)
sleep 5 # sleep a while to ensure that history file was created
pg_ctl -D $TEST_PG_DIR_REPLICA stop -m i
sleep 5
# cleanup - we do not need the file anymore
rm $TEST_PG_DIR_REPLICA/delay_recovery.signal
# start replica again
pg_ctl -D $TEST_PG_DIR_REPLICA  -l logfile_replica start
# after this point replica is started, and working, but the timeline is still 1 
!!!
# it can be checked with `pg_controldata $TEST_PG_DIR_REPLICA`
# and we will get problems later once we switch the replica back to its 
original role (refer to steps below).

# Bring back master as replica
touch $TEST_PG_DIR_MASTER/standby.signal
sed -i "s/#primary_conninfo = ''/primary_conninfo = 'host=localhost 
port=$TEST_PG_PORT_REPLICA'/g" $TEST_PG_DIR_MASTER/postgresql.conf
pg_ctl -D $TEST_PG_DIR_MASTER -l logfile start

# Check replication is working
psql postgres --port=$TEST_PG_PORT_REPLICA -c "
create table t2 (
  a integer,
  b text,
  d timestamp without time zone
);
insert into t2 select i, 'test-replica-2-master'||i , now() from 
generate_series(1, 5)i;
"
sleep 2
psql postgres --port=$TEST_PG_PORT_MASTER -c "select * from t2;";

# Switch back
pg_ctl -D $TEST_PG_DIR_REPLICA stop

psql postgres --port=$TEST_PG_PORT_MASTER -c "SELECT pg_promote();"

touch $TEST_PG_DIR_REPLICA/standby.signal
pg_ctl -D $TEST_PG_DIR_REPLICA -l logfile_replica start

# attempt to start replica here will result in:
#       'pg_ctl: could not start server'
#       'Examine the log output.'
# and in the log:
#       2025-01-16 17:30:27.174 +10 [4806] FATAL:  requested timeline 2 is not 
a child of this server's history
#       2025-01-16 17:30:27.174 +10 [4806] DETAIL:  Latest checkpoint is at 
0/303FF90 on timeline 1, but in the history of the requested timeline, the 
server forked off from that timeline at 0/3023538.
#

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index bf3dbda901..bdfc92cd6e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6126,6 +6126,19 @@ StartupXLOG(void)
        Insert->fullPageWrites = lastFullPageWrites;
        UpdateFullPageWrites();
 
+       /*
+        * Code below is for test purposes only to reveal possible timeline 
issue.
+        */
+       {
+               struct stat st;
+
+               elog(LOG, "Start delay in StartupXLOG()");
+
+               while (stat("delay_recovery.signal", &st) == 0)
+                       pg_usleep(10000L); /* sleep for 10 ms*/
+
+               elog(LOG, "Stop delay in StartupXLOG()");
+       }
        /*
         * Emit checkpoint or end-of-recovery record in XLOG, if required.
         */

Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

Reply via email to