Hi Horiguchi-san,
On 2021/05/31 16:58, Kyotaro Horiguchi wrote:
So, I started a thread for this topic diverged from the following
thread.
https://www.postgresql.org/message-id/4698027d-5c0d-098f-9a8e-8cf09e36a...@nttcom.co.jp_1
So, what should we do for the user? I think we should put some notes
in postgresql.conf or in the documentation. For example, something
like this:
I'm not sure about the exact configuration you have in mind, but that
would happen on the cascaded standby in the case where the upstream
promotes. In this case, the history file for the new timeline is
archived twice. walreceiver triggers archiving of the new history
file at the time of the promotion, then startup does the same when it
restores the file from archive. Is it what you complained about?
Thank you for creating a new thread and explaining this.
We are not using cascade replication in our environment, but I think
the situation is similar. As an overview, when I do a promote,
the archive_command fails due to the history file.
I've created a reproduction script that includes building replication,
and I'll share it with you. (I used Robert's test.sh as a reference
for creating the reproduction script. Thanks)
The scenario (sr_test_historyfile.sh) is as follows.
#1 Start pgprimary as a main
#2 Create standby
#3 Start pgstandby as a standby
#4 Execute archive command
#5 Shutdown pgprimary
#6 Start pgprimary as a standby
#7 Promote pgprimary
#8 Execute archive_command again, but failed since duplicate history
file exists (see pgstandby.log)
Note that this may not be appropriate if you consider it as a recovery
procedure for replication configuration. However, I'm sharing it as it is
because this seems to be the procedure used in the customer's environment
(PG-REX).
The same workaround using the alternative archive script works for the
case.
We could check pg_wal before fetching archive, however, archiving is
not controlled so strictly that duplicate archiving never happens and
I think we choose possible duplicate archiving than having holes in
archive. (so we suggest the "test ! -f" script)
====
Note: If you use archive_mode=always, the archive_command on the
standby side should not be used "test ! -f".
====
It could be one workaround. However, I would suggest not to overwrite
existing files (with a file with different content) to protect archive
from corruption.
We might need to write that in the documentation...
I think you're right, replacing it with an alternative archive script
that includes the cmp command will resolve the error. The reason is that
I checked with the diff command that the history files are identical.
=====
$ diff -s pgprimary/arc/00000002.history pgstandby/arc/00000002.history
Files pgprimary/arc/00000002.history and pgstandby/arc/00000002.history are
identical
=====
Regarding "test ! -f",
I am wondering how many people are using the test command for
archive_command. If I remember correctly, the guide provided by
NTT OSS Center that we are using does not recommend using the test command.
Regards,
Tatsuro Yamada
2021-06-01 13:11:50.168 JST [3041] LOG: starting PostgreSQL 14beta1 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39),
64-bit
2021-06-01 13:11:50.169 JST [3041] LOG: listening on IPv6 address "::1", port
5450
2021-06-01 13:11:50.169 JST [3041] LOG: listening on IPv4 address "127.0.0.1",
port 5450
2021-06-01 13:11:50.171 JST [3041] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5450"
2021-06-01 13:11:50.176 JST [3042] LOG: database system was shut down at
2021-06-01 13:11:50 JST
2021-06-01 13:11:50.179 JST [3041] LOG: database system is ready to accept
connections
2021-06-01 13:11:56.890 JST [3041] LOG: received fast shutdown request
2021-06-01 13:11:56.891 JST [3041] LOG: aborting any active transactions
2021-06-01 13:11:56.891 JST [3041] LOG: background worker "logical replication
launcher" (PID 3049) exited with exit code 1
2021-06-01 13:11:56.891 JST [3043] LOG: shutting down
2021-06-01 13:11:56.946 JST [3041] LOG: database system is shut down
2021-06-01 13:12:00.023 JST [3120] LOG: starting PostgreSQL 14beta1 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39),
64-bit
2021-06-01 13:12:00.023 JST [3120] LOG: listening on IPv6 address "::1", port
5450
2021-06-01 13:12:00.023 JST [3120] LOG: listening on IPv4 address "127.0.0.1",
port 5450
2021-06-01 13:12:00.025 JST [3120] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5450"
2021-06-01 13:12:00.028 JST [3121] LOG: database system was shut down at
2021-06-01 13:11:56 JST
cp: cannot stat ‘arc/00000002.history’: No such file or directory
2021-06-01 13:12:00.031 JST [3121] LOG: entering standby mode
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
2021-06-01 13:12:00.035 JST [3121] LOG: consistent recovery state reached at
0/70000A0
2021-06-01 13:12:00.035 JST [3121] LOG: invalid record length at 0/70000A0:
wanted 24, got 0
2021-06-01 13:12:00.035 JST [3120] LOG: database system is ready to accept
read only connections
cp: cannot stat ‘arc/00000002.history’: No such file or directory
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
cp: cannot stat ‘arc/00000002.history’: No such file or directory
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
2021-06-01 13:12:03.119 JST [3121] LOG: received promote request
2021-06-01 13:12:03.119 JST [3121] LOG: redo is not required
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
cp: cannot stat ‘arc/00000002.history’: No such file or directory
2021-06-01 13:12:03.134 JST [3121] LOG: selected new timeline ID: 2
2021-06-01 13:12:03.156 JST [3121] LOG: archive recovery complete
cp: cannot stat ‘arc/00000001.history’: No such file or directory
2021-06-01 13:12:03.166 JST [3120] LOG: database system is ready to accept
connections
2021-06-01 13:12:13.296 JST [3136] ERROR: requested WAL segment
000000020000000000000008 has already been removed
2021-06-01 13:12:13.296 JST [3136] STATEMENT: START_REPLICATION 0/7000000
TIMELINE 2
2021-06-01 13:11:50.564 JST [3065] LOG: starting PostgreSQL 14beta1 on
x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39),
64-bit
2021-06-01 13:11:50.565 JST [3065] LOG: listening on IPv6 address "::1", port
5451
2021-06-01 13:11:50.565 JST [3065] LOG: listening on IPv4 address "127.0.0.1",
port 5451
2021-06-01 13:11:50.568 JST [3065] LOG: listening on Unix socket
"/tmp/.s.PGSQL.5451"
2021-06-01 13:11:50.572 JST [3066] LOG: database system was interrupted; last
known up at 2021-06-01 13:11:50 JST
cp: cannot stat ‘arc/00000002.history’: No such file or directory
2021-06-01 13:11:50.684 JST [3066] LOG: entering standby mode
cp: cannot stat ‘arc/000000010000000000000002’: No such file or directory
2021-06-01 13:11:50.694 JST [3066] LOG: redo starts at 0/2000028
2021-06-01 13:11:50.695 JST [3066] LOG: consistent recovery state reached at
0/2000100
2021-06-01 13:11:50.695 JST [3065] LOG: database system is ready to accept
read only connections
cp: cannot stat ‘arc/000000010000000000000003’: No such file or directory
2021-06-01 13:11:50.708 JST [3074] LOG: started streaming WAL from primary at
0/3000000 on timeline 1
2021-06-01 13:11:56.945 JST [3074] LOG: replication terminated by primary
server
2021-06-01 13:11:56.945 JST [3074] DETAIL: End of WAL reached on timeline 1 at
0/70000A0.
2021-06-01 13:11:56.945 JST [3074] FATAL: could not send end-of-streaming
message to primary: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
cp: cannot stat ‘arc/00000002.history’: No such file or directory
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
2021-06-01 13:11:56.949 JST [3066] LOG: invalid record length at 0/70000A0:
wanted 24, got 0
2021-06-01 13:11:56.950 JST [3112] FATAL: could not connect to the primary
server: connection to server on socket "/tmp/.s.PGSQL.5450" failed: No such
file or directory
Is the server running locally and accepting connections on that
socket?
cp: cannot stat ‘arc/00000002.history’: No such file or directory
cp: cannot stat ‘arc/000000010000000000000007’: No such file or directory
2021-06-01 13:12:01.966 JST [3135] LOG: started streaming WAL from primary at
0/7000000 on timeline 1
2021-06-01 13:12:03.165 JST [3135] LOG: replication terminated by primary
server
2021-06-01 13:12:03.165 JST [3135] DETAIL: End of WAL reached on timeline 1 at
0/70000A0.
2021-06-01 13:12:03.165 JST [3135] LOG: fetching timeline history file for
timeline 2 from primary server
cp: cannot stat ‘arc/00000002.history’: No such file or directory
cp: cannot stat ‘arc/00000003.history’: No such file or directory
2021-06-01 13:12:03.176 JST [3066] LOG: restored log file "00000002.history"
from archive
2021-06-01 13:12:03.178 JST [3066] LOG: new target timeline is 2
cp: cannot stat ‘arc/000000020000000000000007’: No such file or directory
2021-06-01 13:12:03.186 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:03.186 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:03.191 JST [3066] LOG: restored log file
"000000010000000000000007" from archive
2021-06-01 13:12:03.202 JST [3066] LOG: invalid record length at 0/70000A0:
wanted 24, got 0
2021-06-01 13:12:03.202 JST [3135] LOG: restarted WAL streaming at 0/7000000
on timeline 2
2021-06-01 13:12:04.191 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:04.191 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:05.195 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:05.195 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:05.195 JST [3071] WARNING: archiving write-ahead log file
"00000002.history" failed too many times, will try again later
2021-06-01 13:12:05.198 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:05.198 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:06.204 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:06.204 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:07.210 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:07.210 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:07.210 JST [3071] WARNING: archiving write-ahead log file
"00000002.history" failed too many times, will try again later
2021-06-01 13:12:13.270 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:13.270 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:13.296 JST [3135] FATAL: could not receive data from WAL
stream: ERROR: requested WAL segment 000000020000000000000008 has already been
removed
cp: cannot stat ‘arc/00000003.history’: No such file or directory
cp: cannot stat ‘arc/000000020000000000000009’: No such file or directory
2021-06-01 13:12:13.306 JST [3197] LOG: started streaming WAL from primary at
0/9000000 on timeline 2
2021-06-01 13:12:14.276 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:14.276 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:15.283 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:15.283 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:15.283 JST [3071] WARNING: archiving write-ahead log file
"00000002.history" failed too many times, will try again later
2021-06-01 13:12:15.288 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:15.288 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:16.294 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:16.294 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:17.300 JST [3071] LOG: archive command failed with exit code 1
2021-06-01 13:12:17.300 JST [3071] DETAIL: The failed archive command was:
test ! -f arc/00000002.history && cp pg_wal/00000002.history
arc/00000002.history
2021-06-01 13:12:17.300 JST [3071] WARNING: archiving write-ahead log file
"00000002.history" failed too many times, will try again later
#!/bin/bash
PG_PRIMARY_PORT=5450
PG_STANDBY_PORT=5451
# Cleanup anything left over from previous runs.
for d in pgprimary pgstandby; do
if test -d $d; then
pg_ctl stop -D $d;
rm -rf $d
fi
rm -f $d.log
done
# Echo commands from this point onward and exit on failure.
set -ex
# Initialize and Create primary.
# NB: Must enable archiving, but only for history files not WAL files.
initdb -D pgprimary --encoding=UTF8 --no-locale
# Create WAL archive directory
mkdir -p pgprimary/arc
cat >> pgprimary/postgresql.auto.conf <<EOM
port=$PG_PRIMARY_PORT
archive_mode=always
archive_command = 'test ! -f arc/%f && cp %p arc/%f'
restore_command = 'cp arc/%f %p'
recovery_target_timeline = 'latest'
EOM
#1 Start pgprimary as a main
pg_ctl start -D pgprimary -l pgprimary.log
#2 Create standby
pg_basebackup -D pgstandby -R -d "port=$PG_PRIMARY_PORT"
cat >> pgstandby/postgresql.auto.conf <<EOM
port=$PG_STANDBY_PORT
archive_mode=always
archive_command = 'test ! -f arc/%f && cp %p arc/%f'
restore_command = 'cp arc/%f %p'
recovery_target_timeline = 'latest'
EOM
#3 Start pgstandby as a standby
touch pgstandby/standby.signal
pg_ctl start -D pgstandby -l pgstandby.log
sleep 3
#4 Execute archive command
ls -l pgprimary/arc
psql -p $PG_PRIMARY_PORT -c "create table test (a int);"
# These are the trigger for archive_command
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # OK
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # OK
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # OK
sleep 3
ls -l pgprimary/arc
#5 Shutdown pgprimary
pg_ctl stop -D pgprimary
sleep 3
#6 Start pgprimary as a standby
touch pgprimary/standby.signal
pg_ctl start -D pgprimary -l pgprimary.log
sleep 3
#7 Promote pgprimary
pg_ctl promote -D pgprimary
sleep 10
#8 Execute archive_command again, but failed since duplicate history file exist
(see pgstandby.log)
ls -l pgprimary/arc
ls -l pgstandby/arc
# These are the trigger for archive_command
echo "---- NG ---- (If we use archive_mode=on, archive_command is successful)"
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # NG: archive_command failed
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # NG: archive_command failed
psql -p $PG_PRIMARY_PORT -c "insert into test values (1); checkpoint; select
pg_switch_wal();" # NG: archive_command failed
ls -l pgprimary/arc
ls -l pgstandby/arc
exit