Hi, all

Recently, I found a bug on update timing of walrcv->flushedUpto variable, 
consider the following scenario, there is one Primary node, one Standby node 
which streaming from Primary:
There are a large number of SQL running in the Primary, and the length of the 
xlog record generated by these SQL maybe greater than the left space of current 
page so that it needs to be written cross pages. As shown below, the length of 
the last_xlog of wal_1 is greater than the left space of last_page, so it has 
to be written in wal_2. If Primary crashed after flused the last_page of wal_1 
to disk, the remian content of last_xlog hasn't been flushed in time, then the 
last_xlog in wal_1 will be incomplete. And Standby also received the wal_1 by 
wal-streaming in this case.
[日志1.png]

Primary restarts after crash, during the crash recovery, Primary will find that 
the last_xlog of wal_1 is invalid, and it will cover the space of last_xlog by 
inserting new xlog record. However, Standby won't do this, and there will be 
xlog inconsistency between Primary and standby at this time.


When Standby restarts and replays the last_xlog, it will first get the content 
of XLogRecord structure (the header of last_xlog is completed flushed), and 
find that it has to reassemble the last_xlog, the next page of last_xlog is 
within wal_2, which not exists in pg_wal of Standby. So it request xlog 
streaming from Primary to get the wal_2, and update the walrcv->flushedUpto 
when it has received new xlog and flushed them to disk, now the value of 
walrcv->flushedUpto is some LSN within wal_2.


Standby get wal_2 from Primary, but the content of the first page of wal_2 is 
not the remain content of last_xlog, which has already been covered by new xlog 
in Primary. Standby checked and found that the record is invalid, it will read 
the last_xlog again, and call the WaitForWALToBecomeAvailable function, in this 
function it will shutdown the wal-streaming and read the record from pg_wal.


Again, the record read from pg_wal is also invalid, so Standby will request 
wal-streaming again, and it is worth noting that the value of 
walrcv->flushedUpto has already been set to wal_2 before, which is greater than 
the LSN Standby needs, so the variable havedata in WaitForWALToBecomeAvailable 
is always true, and Standby considers that it received the xlog, it will read 
the content from wal_2.


Next is the endless loop: Standby found the xlog is invalid -> read the 
last_xlog again -> shutdown wal-streaming and read xlog from pg_wal -> found 
the xlog is invalid -> request wal-streaming, expect to get the correct xlog, 
but it will return from WaitForWALToBecomeAvailable immediately because the 
walrcv->flushedUpto is always greater than the LSN it needs ->read and found 
the xlog is invalid -> read the last_xlog again ->......


In this case, Standby will never get the correct xlog record until it restarts


The confusing point is: why only updates the walrcv->flushedUpto at the first 
startup of walreceiver on a specific timeline, not each time when request xlog 
streaming? In above case, it is also reasonable to update walrcv->flushedUpto 
to wal_1 when Standby re-receive wal_1. So I changed to update the 
walrcv->flushedUpto each time when request xlog streaming, which is the patch I 
want to share with you, based on postgresql-13.2, what do you think of this 
change?

By the way, I also want to know why call pgstat_reset_all function during 
recovery process?

Thanks & Best Regard

Attachment: 0001-Update-walrcv-flushedUpto-each-time-when-request-xlog-streaming.patch
Description: Binary data

Reply via email to