[BUGS] WAL Recovery Bug in 7.2.3

Josh Berkus Thu, 16 Jan 2003 14:38:34 -0800

Affects: 7.2.3 (possibly down to 7.1.0)
Frequency:  Very Rare
Effect When Occurring:   Database corruption
Difficulty of Fix:  Trivial
Certianty of Diagnosis:  about 50% according to Tom


This bug was analyzed by Tom Lane; I'm just writing it up.

In 7.2.3 (and possibly in earlier versions) these two rows in xlog.c are out 
of order:
        FlushBufferPool();
        CheckPointCLOG();

Per Tom's analysis:
"7.2.* does checkpoint operations in the wrong order: CreateCheckPoint
does
        FlushBufferPool();
        CheckPointCLOG();
        ... create and write checkpoint WAL record ...
The reason this is the wrong order is that CheckPointCLOG() only issues
write()s of dirty pg_clog pages; it does not fsync them.  Thus, it is
possible that the checkpoint WAL record will be flushed to disk while
the clog page writes are still hanging about in kernel disk cache.
If there is a system crash before the kernel gets around to sync'ing
the dirty clog pages to disk, then we lose --- on restart, the WAL logic
will only replay WAL entries after the latest checkpoint, and so any
transaction commits occurring before the checkpoint would fail to be
re-marked in pg_clog."

As an error scenario, this seems rather farfetched; Postgres would have to be 
killed, a second time, while in recovery mode at a moment between 
FlushBufferPool() and CheckPointCLOG().   A remote enough possibility to 
ignore.   Except that it seems to have happened twice, to two different 
users.

The scenario under which this bug becomes critical is this:

1) In the middle of a large UPDATE statement, the Postgres server loses power 
from a general power outage or local building short.
2) This server is not buffered by a UPS.
3) Due to work on the power system or weather damage, power comes back on, 
then off after a few minutes, cycling off-and-on 4-5 times (this is not 
farfetched; during the California "power crisis" I saw it happen several 
times).
4) This has the possible effect of repeatedly downing Postgres while it is in 
recovery mode.   
5) Sooner or later, the up-down effect "gets lucky" and postgres goes down 
while FlushBufferCache() is finishing up.
6) The user ends up with two versions of one or more of their records marked 
as valid by Postgres.  Per Tom's analysis of one such problem:

"Well, here's what I've found so far.  The two tuples in question have
header data like so (as printed by pg_filedump):

 Item  28 -- Length:  248  Offset: 7944 (0x1f08)  Flags: USED
  OID: 487894  CID: min(0) max(0)  XID: min(9776912) max(17920315)
  Block Id: 4664  linp Index: 1   Attributes: 31   Size: 36
  infomask: 0x0903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID) 
  t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d 

 Item   2 -- Length:  248  Offset: 7944 (0x1f08)  Flags: USED
  OID: 487894  CID: min(0) max(0)  XID: min(9777615) max(10180711)
  Block Id: 4666  linp Index: 1   Attributes: 31   Size: 36
  infomask: 0x2903 (HASNULL|HASVARLENA|XMIN_COMMITTED|XMAX_INVALID|UPDATED) 
  t_bits: [0]: 0xff [1]: 0xfe [2]: 0xdf [3]: 0x7d 

What appears to have happened is this: transaction 9776912 created the
row initially (the first of the two items is evidently the first
incarnation of the row, since it does not have HEAP_UPDATED set).  A
little while later, transaction 9777615 updated the row, creating the
second tuple.  Our problem is that both tuples appear to be committed
good --- both have XMIN_COMMITTED set.

Digging into the pg_clog data, I find that 9776912 is shown as
"committed", as expected.  But 9777615 is shown as "in progress" --- the
clog entry has not been marked as either committed or aborted!"


Since this bug has been fixed in 7.3.1, it's not cirtical to release a patch.   
HOWEVER, given the triviality of the fix ... simply swapping those two lines 
in xlog.c ... does everyone think it would be a good idea to post a notice on 
the lists of the issue and the fix?

While it easy enough to tell users, "Upgrade, or get a UPS" this is not 
practical for everyone.

And is there any potential issue with swapping those two lines?


-- 
-Josh Berkus
 Aglio Database Solutions
 San Francisco


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]

[BUGS] WAL Recovery Bug in 7.2.3

Reply via email to