Dear Hackers,

I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase 
transaction states from disk or the WAL. It takes xid as well as some other 
input parameters and executes the following steps:

Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, 
TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the 
corresponding file and returns it

In some, very rare scenarios, the postgres instance will newer recover because 
of such logic. Imagine, that the two_phase directory contains some files with 
two-phase states of transactions of distant future. I assume, it can happen if 
some WAL segments are broken and ignored (as well as clog data) but two_phase 
directory was not broken. In recovery, postgresql reads all the files in 
two_phase and tries to recover two-phase states.

The problem appears in the functions TransactionIdDidCommit or 
TransactionIdDidAbort. These functions may fail with the FATAL message like 
below when no clog state on disk is available for the xid:

FATAL:  could not access status of transaction 286331153
DETAIL:  Could not open file "pg_xact/0111": No such file or directory.

Such error do not allow the postgresql instance to be started.

My guess, if to swap Step #1 with Step #2 such error will disappear because 
transactions will be filtered when comparing xid with 
ShmemVariableCache->nextXid before accessing clog. The function will be more 
robust. In general, it works but I'm not sure that such logic will not break 
some rare boundary cases. Another solution is to catch and ignore such error, 
but the original solution is the simpler one. I appreciate any thoughts 
concerning this topic. May be, you know some cases when such change in logic is 
not relevant?

Thank you in advance!

With best regards,
Vitaly



Reply via email to