We're *still* not out of the woods with 026_overwrite_contrecord.pl,
as we are continuing to see occasional "mismatching overwritten LSN"
failures, further down in the test where it tries to start up the
standby:

  sysname   |    branch     |      snapshot       |     stage     |             
                                        l                                       
               
------------+---------------+---------------------+---------------+------------------------------------------------------------------------------------------------------------
 spurfowl   | REL_13_STABLE | 2021-10-18 03:56:26 | recoveryCheck | 2021-10-18 
00:08:09.324 EDT [2455:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 sidewinder | HEAD          | 2021-10-19 04:32:36 | recoveryCheck | 2021-10-19 
06:46:23.168 CEST [26393:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 francolin  | REL9_6_STABLE | 2021-10-26 01:41:39 | recoveryCheck | 2021-10-26 
01:48:05.646 UTC [3417202][][1/0:0] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 petalura   | HEAD          | 2021-11-05 00:20:03 | recoveryCheck | 2021-11-05 
02:58:12.146 CET [61848fb3.28d157:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 lapwing    | REL_11_STABLE | 2021-11-05 17:24:49 | recoveryCheck | 2021-11-05 
17:39:29.741 UTC [9831:6] FATAL:  mismatching overwritten LSN 0/1FFE014 -> 
0/1FFE000
 morepork   | HEAD          | 2021-11-10 02:51:12 | recoveryCheck | 2021-11-10 
04:03:33.576 CET [73561:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 petalura   | HEAD          | 2021-11-16 15:20:03 | recoveryCheck | 2021-11-16 
18:16:47.875 CET [6193e77f.35b87f:6] FATAL:  mismatching overwritten LSN 
0/1FFE018 -> 0/1FFE000
 morepork   | HEAD          | 2021-11-17 03:45:36 | recoveryCheck | 2021-11-17 
04:57:04.359 CET [32089:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
 spurfowl   | REL_10_STABLE | 2021-11-22 22:21:03 | recoveryCheck | 2021-11-22 
17:29:35.520 EST [16011:6] FATAL:  mismatching overwritten LSN 0/1FFE018 -> 
0/1FFE000
(9 rows)

Looking at adjacent successful runs, it seems that the exact point
where the "missing contrecord" starts varies substantially, even after
our previous fix to disable autovacuum in this test.  How could that be?

It's probably for the best though, because I think this is exposing
an actual bug that we would not have seen if the start point were
completely consistent.  I have not dug into the code, but it looks to
me like if the "consistent recovery state" is reached exactly at a
page boundary (0/1FFE000 in all these cases), then the standby expects
that to be what the OVERWRITE_CONTRECORD record will point at.  But
actually it points to the first WAL record on that page, resulting
in a bogus failure.

                        regards, tom lane


Reply via email to