Re: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

2025-01-21 Thread Roman Eskin

Hi Andrey,

Thank you for your feedback!


I think here you can just specify target timeline for the standby instance_1 
and it will continue recovery from instance_2.


Most likely yes, but nevertheless it looks more like a W/A.


Persisting recovery signal file for some _timeout_ seems super dangerous to me. 
In distributed systems every extra _timeout_ is a source of complexity, 
uncertainty and despair.


The approach is not about persisting the signal files for some timeout. 
Currently the files are removed in StartupXLOG() before 
writeTimeLineHistory() and PerformRecoveryXLogAction() are called. The 
suggestion is to move the file removal after PerformRecoveryXLogAction() 
inside StartupXLOG().


Best regards,
Roman Eskin






Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

2025-01-16 Thread Roman Eskin

Hello pgsql-hackers,

It appears to exist an issue with timeline switching in archive
recovery mode if the startup process is interrupted by some external
reason (for ex. system power loss) in a specific tight time window
(after a new history file is created but before end-of-recovery record
is done). If this happens, and the Postgres instance is launched
again, it carries out the common crash recovery, starts successfully
to 'in production' state, but continues the previous timeline. Thus,
we are having inconsistent WAL state, where we are writing WAL segment
files to old timeline and having presented new orphaned timeline in
the history file.

This issue can reveal itself in streaming replication scenario, for
ex.:
1. Assume we have instance_1 (Master) and instance_2 (Replica);
2. instance_1 (Master) fails for any reason, instance_2 (Replica)
starts to promote. But while instance_2 is still in `StartupXLOG()`
somewhere between `writeTimeLineHistory()` and
`CreateEndOfRecoveryRecord()`, we are getting, for example, power
loss on instance_2 server.
3. Once instance_2 server is up again, instance_2 is restarted. The
problem with timeline already exists here, but very likely to be
unnoticed, as instance_2 (now Master) is accepting connections and
working.
4. Bring back instance_1 as Replica.
5. Switch back instance_1 and instance_2 to the original
configuration. And here, when we try to start instance_2 as Replica,
we'll get a FATAL:
"FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3023538."

I agree that one have to be very unlucky to get into such situation
with 2 consequent failures and to hit the specific time window, but if
this happens, it will look quite severe and hard to debug. We faced
this situation on our synthetic tests.

To reproduce the issue on latest code from Postgres master branch in a
stable manner, I used the attached
'test-code-for-timeline-issue.patch'. It postpones execution in
`StartupXLOG()` if we need to hit the requested time window. Detailed
steps to reproduce I've used with the patch are in the attached file
'steps - NG scenario.txt'.

Possible way to fix this issue, that we now consider, is to postpone
the removal of signal files ("recovery.signal" and "standby.signal")
until the end-of-recovery record is done, and thus retry interrupted
promotion in case previous one failed. Looks like, the only
insignificant negative outcome for this approach is possible
assignment of one more timeline identity. But we would like to get
more feedback from the community. Do you foresee any possible
drawbacks for this approach? Please advise.

Many thanks in advance!

P.S. The topic above is a bit similar to this old one -
https://www.postgresql.org/message-id/flat/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq%3DXxmXLmF6ipNFecbShQ%40mail.gmail.com.

Best regards,
Roman EskinTEST_PG_PORT_MASTER=5433
TEST_PG_PORT_REPLICA=5434

TEST_PG_DIR_MASTER=~/postgres_data_master
TEST_PG_DIR_REPLICA=~/postgres_data_replica


# Setup master
mkdir $TEST_PG_DIR_MASTER

pg_ctl -D $TEST_PG_DIR_MASTER initdb

sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/g" 
$TEST_PG_DIR_MASTER/postgresql.conf
sed -i "s/#port = 5432/port = $TEST_PG_PORT_MASTER/g" 
$TEST_PG_DIR_MASTER/postgresql.conf

pg_ctl -D $TEST_PG_DIR_MASTER -l logfile start

# Setup replica
pg_basebackup -h localhost --checkpoint=fast -D $TEST_PG_DIR_REPLICA -R -X 
stream --port=$TEST_PG_PORT_MASTER
sed -i "s/#listen_addresses = 'localhost'/listen_addresses = '*'/g" 
$TEST_PG_DIR_REPLICA/postgresql.conf
sed -i "s/port = $TEST_PG_PORT_MASTER/port = $TEST_PG_PORT_REPLICA/g" 
$TEST_PG_DIR_REPLICA/postgresql.conf

pg_ctl -D $TEST_PG_DIR_REPLICA -l logfile_replica start

# Check replication is working
psql postgres --port=$TEST_PG_PORT_MASTER -c "
create table t (
  a integer,
  b text,
  d timestamp without time zone
);
insert into t select i, 'test-master-2-replica'||i , now() from 
generate_series(1, 5)i;
"
sleep 2
psql postgres --port=$TEST_PG_PORT_REPLICA -c "select * from t;";

# Shutdown master and switch to replica
pg_ctl -D $TEST_PG_DIR_MASTER stop

# enable test changes that will postpone CreateEndOfRecoveryRecord()
touch $TEST_PG_DIR_REPLICA/delay_recovery.signal

psql postgres --port=$TEST_PG_PORT_REPLICA -c "SELECT pg_promote();" &

# Shutdown immediately Postgres (in real life that could be an accident power 
loss, or any other system level problem...)
sleep 5 # sleep a while to ensure that history file was created
pg_ctl -D $TEST_PG_DIR_REPLICA stop -m i
sleep 5
# cleanup - we do not need the file anymore
rm $TEST_PG_DIR_REPLICA/delay_recovery.signal
# start replica again
pg_ctl -D $TEST_PG_DIR_REPLICA  -l logfile_replica start
# after this point replica is started, and working, but the timeline is still 1 
!!!
# it can be checked with `pg_controldata $TEST_PG_DIR_REPLICA`
#