Hi, Fujii

Thanks for your reply.
And I want to share a patch about the bug with you, I add XLogFlush() in 
xact_redo_abort() to update the minimum recovery point.

Best Regards,
Suyu



------------------------------------------------------------------
发件人:Fujii Masao <masao.fu...@oss.nttdata.com>
发送时间:2021年7月27日(星期二) 16:26
收件人:蔡梦娟(玊于) <mengjuan....@alibaba-inc.com>; pgsql-hackers 
<pgsql-hackers@lists.postgresql.org>
主 题:Re: Why don't update minimum recovery point in xact_redo_abort



On 2021/07/27 2:38, 蔡梦娟(玊于) wrote:
> Hi, all
> 
> Recently, I got a PANIC while restarts standby, which can be reproduced by 
> the following steps, based on pg 11:
> 1. begin a transaction in primary node;
> 2. create a table in the transaction;
> 3. insert lots of data into the table;
> 4. do a checkpoint, and restart standby after checkpoint is done in primary 
> node;
> 5. insert/update lots of data into the table again;
> 6. abort the transaction.

I could reproduce the issue by using the similar steps and
disabling full_page_writes, in the master branch.


> 
> after step 6, fast shutdown standby node, and then restart standby, you will 
> get a PANIC log, and the backtrace is:
> #0  0x00007fc663e5a277 in raise () from /lib64/libc.so.6
> #1  0x00007fc663e5b968 in abort () from /lib64/libc.so.6
> #2  0x0000000000c89f01 in errfinish (dummy=0) at elog.c:707
> #3  0x0000000000c8cba3 in elog_finish (elevel=22, fmt=0xdccc18 "WAL contains 
> references to invalid pages") at elog.c:1658
> #4  0x00000000005e476a in XLogCheckInvalidPages () at xlogutils.c:253
> #5  0x00000000005cbc1a in CheckRecoveryConsistency () at xlog.c:9477
> #6  0x00000000005ca5c5 in StartupXLOG () at xlog.c:8609
> #7  0x0000000000a025a5 in StartupProcessMain () at startup.c:274
> #8  0x0000000000643a5c in AuxiliaryProcessMain (argc=2, argv=0x7ffe4e4849a0) 
> at bootstrap.c:485
> #9  0x0000000000a00620 in StartChildProcess (type=StartupProcess) at 
> postmaster.c:6215
> #10 0x00000000009f92c6 in PostmasterMain (argc=3, argv=0x4126500) at 
> postmaster.c:1506
> #11 0x00000000008eab64 in main (argc=3, argv=0x4126500) at main.c:232
> 
> I think the reason for the above error is as follows:
> 1. the transaction in primary node was aborted finally, the standby node also 
> deleted the table files after replayed the xlog record, however, without 
> updating minimum recovery point;
> 2. primary node did a checkpoint before abort, and then standby node is 
> restarted, so standby node will recovery from a point where the table has 
> already been created and data has been inserted into the table;
> 3. when standby node restarts after step 6, it will find the page needed 
> during recovery doesn't exist, which has already been deleted by 
> xact_redo_abort before, so standby node will treat this page as an invalid 
> page;
> 4. xact_redo_abort drop relation files without updating minumum recovery 
> point, before standby node replay the abort xlog record and forget invalid 
> pages again, it will reach consistency because the abort xlogrecord lsn is 
> greater than minrecoverypoint;
> 5. during checkRecoveryConsistency, it will check invalid pages, and find 
> that there is invalid page, and the PANIC log will be generated.
> 
> So why don't update minimum recovery point in xact_redo_abort, just like 
> XLogFlush in xact_redo_commit, in which way standby could reach consistency 
> and check invalid pages after replayed the abort xlogrecord.

ISTM that you're right. xact_redo_abort() should call XLogFlush() to
update the minimum recovery point on truncation. This seems
the oversight in commit 7bffc9b7bf.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment: 0001-Update-minimum-recovery-point-on-file-deletion-durin.patch
Description: Binary data

Reply via email to