While testing the crash resilience of the recent 2-part-commit improvements, I've run into a problem where sometimes after a crash the recovery process creates zeroed files in pg_subtrans until it exhausts all disk space.
Looking at the code, it looks like it does not anticipate that the xid might wrap around, meaning startPage/endPage might also wrap around. But obviously should not do so at int_max but rather at some much smaller other value. Here is the state near the time of disaster: (gdb) print startPage $1 = 2813758 (gdb) print endPage $2 = 179 (gdb) p oldestActiveXID $3 = 4293679649 (gdb) p ShmemVariableCache->nextXid $4 = 367568 Attached is my attempt at a fix. I've tested it for the ability to start up the crashed server again, but have not tested the full stack from initdb to crash with this in place. Assuming I'm right, I am curious how this problem has been around so long without being discovered previously. So maybe I'm not right. I found this with some code to accelerate the consumption of xids, but I don't see how that would lead to a false positive here. I think I found it testing 2-part-commit because that inherently means leaving an active XID hanging around for a few checkpoint cycles, which is something I've never intentionally tested before. Cheers, Jeff
StartupSUB.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers