Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Tom Lane Sun, 23 Apr 2017 19:54:06 -0700

Thomas Munro <[email protected]> writes:
> On Sun, Apr 23, 2017 at 6:01 PM, Tom Lane <[email protected]> wrote:
>> Fair enough.  But I'd still like an explanation of why only about
>> half of the population is showing a failure here.  Seems like every
>> machine should be seeing the LSN as moving backwards in this test.
>> So (a) why aren't they all failing, and (b) should we change the
>> test to make sure every platform sees that happening?


> Every machine sees the LSN moving backwards, but the code path that
> had the assertion only reached if it decides to interpolate, which is
> timing dependent: there needs to be a future sample in the lag
> tracking buffer, which I guess is not the case in those runs.

I'm dissatisfied with this explanation because if it's just timing,
it doesn't seem very likely that some machines would reproduce the
failure every single time while others never would.  Maybe that can be
blamed on kernel scheduler vagaries + different numbers of cores, but
I can't escape the feeling that there's something here we've not
fully understood.

While chasing after this earlier today, I turned on some debug logging
and noted that the standby's reports look like

2017-04-23 15:46:46.206 EDT [34829] LOG:  database system is ready to accept 
read only connections
2017-04-23 15:46:46.212 EDT [34834] LOG:  fetching timeline history file for 
timeline 2 from primary server
2017-04-23 15:46:46.212 EDT [34834] LOG:  started streaming WAL from primary at 
0/3000000 on timeline 1
2017-04-23 15:46:46.213 EDT [34834] LOG:  sending write 0/3020000 flush 
0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34834] LOG:  replication terminated by primary 
server
2017-04-23 15:46:46.214 EDT [34834] DETAIL:  End of WAL reached on timeline 1 
at 0/3028470.
2017-04-23 15:46:46.214 EDT [34834] LOG:  sending write 0/3028470 flush 
0/3028470 apply 0/3028470
2017-04-23 15:46:46.214 EDT [34830] LOG:  new target timeline is 2
2017-04-23 15:46:46.214 EDT [34834] LOG:  restarted WAL streaming at 0/3000000 
on timeline 2
2017-04-23 15:46:46.228 EDT [34834] LOG:  sending write 0/3020000 flush 
0/3028470 apply 0/3028470

So you're right that the standby's reported "write" position can go
backward, but it seems pretty darn odd that the flush and apply
positions didn't go backward too.  Is there a bug there?

I remain of the opinion that if we can't tell from the transmitted
data whether a timeline switch has caused the position to go backward,
then that's a protocol shortcoming that ought to be fixed.

                        regards, tom lane


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [COMMITTERS] pgsql: Replication lag tracking for walsenders

Reply via email to