Affects: 7.2.3 (possibly down to 7.1.0) Frequency: Very Rare Effect When Occurring: Database corruption Difficulty of Fix: Trivial Certianty of Diagnosis: about 50% according to Tom
This bug was analyzed by Tom Lane; I'm just writing it up. In 7.2.3 (and possibly in earlier versions) these two rows in xlog.c are out of order: FlushBufferPool(); CheckPointCLOG(); Per Tom's analysis: "7.2.* does checkpoint operations in the wrong order: CreateCheckPoint does FlushBufferPool(); CheckPointCLOG(); ... create and write checkpoint WAL record ... The reason this is the wrong order is that CheckPointCLOG() only issues write()s of dirty pg_clog pages; it does not fsync them. Thus, it is possible that the checkpoint WAL record will be flushed to disk while the clog page writes are still hanging about in kernel disk cache. If there is a system crash before the kernel gets around to sync'ing the dirty clog pages to disk, then we lose --- on restart, the WAL logic will only replay WAL entries after the latest checkpoint, and so any transaction commits occurring before the checkpoint would fail to be re-marked in pg_clog." As an error scenario, this seems rather farfetched; Postgres would have to be killed, a second time, while in recovery mode at a moment between FlushBufferPool() and CheckPointCLOG(). A remote enough possibility to ignore. Except that it seems to have happened twice, to two different users. The scenario under which this bug becomes critical is this: 1) In the middle of a large UPDATE statement, the Postgres server loses power from a general power outage or local building short. 2) This server is not buffered by a UPS. 3) Due to work on the power system or weather damage, power comes back on, then off after a few minutes, cycling off-and-on 4-5 times (this is not farfetched; during the California "power crisis" I saw it happen several times). 4) This has the possible effect of repeatedly downing Postgres while it is in recovery mode. 5) Sooner or later, the up-down effect "gets lucky" and postgres goes down while FlushBufferCache() is finishing up. 6) The user ends up with two versions of one or more of their records marked as valid by Postgres. Per Tom's analysis of one such problem: "Well, here's what I've found so far. The two tuples in question have header data like so (as printed by pg_filedump): Item 28 -- Length: 248 Offset: 7944 (0x1f08) Flags: USED OID: 487894 CID: min(0) max(0) XID: min(9776912) max(17920315) Block Id: 4664 linp Index: 1 Attributes: 31 Size: 36 infomask: 0x0903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID) t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d Item 2 -- Length: 248 Offset: 7944 (0x1f08) Flags: USED OID: 487894 CID: min(0) max(0) XID: min(9777615) max(10180711) Block Id: 4666 linp Index: 1 Attributes: 31 Size: 36 infomask: 0x2903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID|UPDATED) t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d What appears to have happened is this: transaction 9776912 created the row initially (the first of the two items is evidently the first incarnation of the row, since it does not have HEAP_UPDATED set). A little while later, transaction 9777615 updated the row, creating the second tuple. Our problem is that both tuples appear to be committed good --- both have XMIN_COMMITTED set. Digging into the pg_clog data, I find that 9776912 is shown as "committed", as expected. But 9777615 is shown as "in progress" --- the clog entry has not been marked as either committed or aborted!" Since this bug has been fixed in 7.3.1, it's not cirtical to release a patch. HOWEVER, given the triviality of the fix ... simply swapping those two lines in xlog.c ... does everyone think it would be a good idea to post a notice on the lists of the issue and the fix? While it easy enough to tell users, "Upgrade, or get a UPS" this is not practical for everyone. And is there any potential issue with swapping those two lines? -- -Josh Berkus Aglio Database Solutions San Francisco ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]