On 2020/12/09 17:43, Kyotaro Horiguchi wrote:
Hello.

We found a behavioral change (which seems to be a bug) in recovery at
PG13.

The following steps might seem somewhat strange but the replication
code deliberately cope with the case.  This is a sequense seen while
operating a HA cluseter using Pacemaker.

- Run initdb to create a primary.
- Set archive_mode=on on the primary.
- Start the primary.

- Create a standby using pg_basebackup from the primary.
- Stop the standby.
- Stop the primary.

- Put stnadby.signal to the primary then start it.
- Promote the primary.

- Start the standby.


Until PG12, the parimary signals end-of-timeline to the standby and
switches to the next timeline.  Since PG13, that doesn't happen and
the standby continues to request for the segment of the older
timeline, which no longer exists.

FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 
000000010000000000000003 has already been removed

It is because WalSndSegmentOpen() can fail to detect a timeline switch
on a historic timeline, due to use of a wrong variable to check
that. It is using state->seg.ws_segno but it seems to be a thinko when
the code around was refactored in 709d003fbd.

The first patch detects the wrong behavior.  The second small patch
fixes it.

Thanks for reporting this! This looks like a bug.

When I applied two patches in the master branch and
ran "make check-world", I got the following error.

============== creating database "contrib_regression" ==============
# Looks like you planned 37 tests but ran 36.
# Looks like your test exited with 255 just after 36.
t/001_stream_rep.pl ..................
Dubious, test returned 255 (wstat 65280, 0xff00)
Failed 1/37 subtests
...
Test Summary Report
-------------------
t/001_stream_rep.pl                (Wstat: 65280 Tests: 36 Failed: 0)
  Non-zero exit status: 255
  Parse errors: Bad plan.  You planned 37 tests but ran 36.
Files=21, Tests=239, 302 wallclock secs ( 0.10 usr  0.05 sys + 41.69 cusr 39.84 
csys = 81.68 CPU)
Result: FAIL
make[2]: *** [check] Error 1
make[1]: *** [check-recovery-recurse] Error 2
make[1]: *** Waiting for unfinished jobs....
t/070_dropuser.pl ......... ok


Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION


Reply via email to