Hi, Fujii Thanks for your reply. And I want to share a patch about the bug with you, I add XLogFlush() in xact_redo_abort() to update the minimum recovery point.
Best Regards, Suyu ------------------------------------------------------------------ 发件人:Fujii Masao <masao.fu...@oss.nttdata.com> 发送时间:2021年7月27日(星期二) 16:26 收件人:蔡梦娟(玊于) <mengjuan....@alibaba-inc.com>; pgsql-hackers <pgsql-hackers@lists.postgresql.org> 主 题:Re: Why don't update minimum recovery point in xact_redo_abort On 2021/07/27 2:38, 蔡梦娟(玊于) wrote: > Hi, all > > Recently, I got a PANIC while restarts standby, which can be reproduced by > the following steps, based on pg 11: > 1. begin a transaction in primary node; > 2. create a table in the transaction; > 3. insert lots of data into the table; > 4. do a checkpoint, and restart standby after checkpoint is done in primary > node; > 5. insert/update lots of data into the table again; > 6. abort the transaction. I could reproduce the issue by using the similar steps and disabling full_page_writes, in the master branch. > > after step 6, fast shutdown standby node, and then restart standby, you will > get a PANIC log, and the backtrace is: > #0 0x00007fc663e5a277 in raise () from /lib64/libc.so.6 > #1 0x00007fc663e5b968 in abort () from /lib64/libc.so.6 > #2 0x0000000000c89f01 in errfinish (dummy=0) at elog.c:707 > #3 0x0000000000c8cba3 in elog_finish (elevel=22, fmt=0xdccc18 "WAL contains > references to invalid pages") at elog.c:1658 > #4 0x00000000005e476a in XLogCheckInvalidPages () at xlogutils.c:253 > #5 0x00000000005cbc1a in CheckRecoveryConsistency () at xlog.c:9477 > #6 0x00000000005ca5c5 in StartupXLOG () at xlog.c:8609 > #7 0x0000000000a025a5 in StartupProcessMain () at startup.c:274 > #8 0x0000000000643a5c in AuxiliaryProcessMain (argc=2, argv=0x7ffe4e4849a0) > at bootstrap.c:485 > #9 0x0000000000a00620 in StartChildProcess (type=StartupProcess) at > postmaster.c:6215 > #10 0x00000000009f92c6 in PostmasterMain (argc=3, argv=0x4126500) at > postmaster.c:1506 > #11 0x00000000008eab64 in main (argc=3, argv=0x4126500) at main.c:232 > > I think the reason for the above error is as follows: > 1. the transaction in primary node was aborted finally, the standby node also > deleted the table files after replayed the xlog record, however, without > updating minimum recovery point; > 2. primary node did a checkpoint before abort, and then standby node is > restarted, so standby node will recovery from a point where the table has > already been created and data has been inserted into the table; > 3. when standby node restarts after step 6, it will find the page needed > during recovery doesn't exist, which has already been deleted by > xact_redo_abort before, so standby node will treat this page as an invalid > page; > 4. xact_redo_abort drop relation files without updating minumum recovery > point, before standby node replay the abort xlog record and forget invalid > pages again, it will reach consistency because the abort xlogrecord lsn is > greater than minrecoverypoint; > 5. during checkRecoveryConsistency, it will check invalid pages, and find > that there is invalid page, and the PANIC log will be generated. > > So why don't update minimum recovery point in xact_redo_abort, just like > XLogFlush in xact_redo_commit, in which way standby could reach consistency > and check invalid pages after replayed the abort xlogrecord. ISTM that you're right. xact_redo_abort() should call XLogFlush() to update the minimum recovery point on truncation. This seems the oversight in commit 7bffc9b7bf. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
0001-Update-minimum-recovery-point-on-file-deletion-durin.patch
Description: Binary data