Thomas Munro <thomas.mu...@enterprisedb.com> writes: > On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Fair enough. But I'd still like an explanation of why only about >> half of the population is showing a failure here. Seems like every >> machine should be seeing the LSN as moving backwards in this test. >> So (a) why aren't they all failing, and (b) should we change the >> test to make sure every platform sees that happening?
> Every machine sees the LSN moving backwards, but the code path that > had the assertion only reached if it decides to interpolate, which is > timing dependent: there needs to be a future sample in the lag > tracking buffer, which I guess is not the case in those runs. I'm dissatisfied with this explanation because if it's just timing, it doesn't seem very likely that some machines would reproduce the failure every single time while others never would. Maybe that can be blamed on kernel scheduler vagaries + different numbers of cores, but I can't escape the feeling that there's something here we've not fully understood. While chasing after this earlier today, I turned on some debug logging and noted that the standby's reports look like 2017-04-23 15:46:46.206 EDT [34829] LOG: database system is ready to accept read only connections 2017-04-23 15:46:46.212 EDT [34834] LOG: fetching timeline history file for timeline 2 from primary server 2017-04-23 15:46:46.212 EDT [34834] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-04-23 15:46:46.213 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 2017-04-23 15:46:46.214 EDT [34834] LOG: replication terminated by primary server 2017-04-23 15:46:46.214 EDT [34834] DETAIL: End of WAL reached on timeline 1 at 0/3028470. 2017-04-23 15:46:46.214 EDT [34834] LOG: sending write 0/3028470 flush 0/3028470 apply 0/3028470 2017-04-23 15:46:46.214 EDT [34830] LOG: new target timeline is 2 2017-04-23 15:46:46.214 EDT [34834] LOG: restarted WAL streaming at 0/3000000 on timeline 2 2017-04-23 15:46:46.228 EDT [34834] LOG: sending write 0/3020000 flush 0/3028470 apply 0/3028470 So you're right that the standby's reported "write" position can go backward, but it seems pretty darn odd that the flush and apply positions didn't go backward too. Is there a bug there? I remain of the opinion that if we can't tell from the transmitted data whether a timeline switch has caused the position to go backward, then that's a protocol shortcoming that ought to be fixed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers