On 2020/12/09 17:43, Kyotaro Horiguchi wrote:
Hello. We found a behavioral change (which seems to be a bug) in recovery at PG13. The following steps might seem somewhat strange but the replication code deliberately cope with the case. This is a sequense seen while operating a HA cluseter using Pacemaker. - Run initdb to create a primary. - Set archive_mode=on on the primary. - Start the primary. - Create a standby using pg_basebackup from the primary. - Stop the standby. - Stop the primary. - Put stnadby.signal to the primary then start it. - Promote the primary. - Start the standby. Until PG12, the parimary signals end-of-timeline to the standby and switches to the next timeline. Since PG13, that doesn't happen and the standby continues to request for the segment of the older timeline, which no longer exists. FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003 has already been removed It is because WalSndSegmentOpen() can fail to detect a timeline switch on a historic timeline, due to use of a wrong variable to check that. It is using state->seg.ws_segno but it seems to be a thinko when the code around was refactored in 709d003fbd. The first patch detects the wrong behavior. The second small patch fixes it.
Thanks for reporting this! This looks like a bug. When I applied two patches in the master branch and ran "make check-world", I got the following error. ============== creating database "contrib_regression" ============== # Looks like you planned 37 tests but ran 36. # Looks like your test exited with 255 just after 36. t/001_stream_rep.pl .................. Dubious, test returned 255 (wstat 65280, 0xff00) Failed 1/37 subtests ... Test Summary Report ------------------- t/001_stream_rep.pl (Wstat: 65280 Tests: 36 Failed: 0) Non-zero exit status: 255 Parse errors: Bad plan. You planned 37 tests but ran 36. Files=21, Tests=239, 302 wallclock secs ( 0.10 usr 0.05 sys + 41.69 cusr 39.84 csys = 81.68 CPU) Result: FAIL make[2]: *** [check] Error 1 make[1]: *** [check-recovery-recurse] Error 2 make[1]: *** Waiting for unfinished jobs.... t/070_dropuser.pl ......... ok Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION