[HACKERS] Report: race conditions in WAL replay routines

Tom Lane Sun, 05 Feb 2012 11:19:54 -0800

Pursuant to the recent discussion about bugs 6200 and 6245, I went
trawling through all the WAL redo routines looking for bugs such as
inadequate locking or transiently trashing shared buffers.  Here's
what I found:


* As we already knew, RestoreBkpBlocks is broken because it transiently
trashes a shared buffer, which another process could be accessing while
holding only a pin.

* seq_redo() has the same disease, since we allow SELECT * FROM
sequences.

* Everything else seems to be free of that specific issue; in particular
the index-related replay routines are at fairly low risk since we don't
have any coding rules allowing index pages to be examined without
holding a buffer lock.

* There are assorted replay routines that assume they can whack fields
of ShmemVariableCache around without any lock.  However, it's pretty
inconsistent; about half do it like that, while the other half assume
that they can read ShmemVariableCache without lock but should acquire
lock to modify it.  I think the latter coding rule is a whole lot safer
in the presence of Hot Standby and should be adopted throughout.

* Same goes for MultiXactSetNextMXact and MultiXactAdvanceNextMXact.
It's not entirely clear to me that no read-only transaction can ever
examine the shared-memory variables they change.  In any case, if there
is in fact no other process examining those variables, there can be no
contention for the lock so it should be cheap to get.

* Not exactly a race condition, but: tblspc_redo does ereport(ERROR)
if it fails to clean out tablespace directories.  This seems to me to be
the height of folly, especially when the failure is more or less an
expected case.  If the error occurs the database is dead in the water,
because that error is actually a PANIC and will recur on subsequent
restart attempts.  Therefore there is no way to recover short of manual
intervention to clean out the non-empty directory.  And why are we
pulling the fire alarm like this?  Well, uh, it's because we might fail
to recover some disk space in the dropped tablespace.  Seems to me to be
a lot better to just elog(LOG) and move on.  This is quite analogous to
the case of failing to unlink a file after commit --- wasting disk space
might be bad, but it's very much the lesser evil compared to this.

Barring objections I'm going to fix all this stuff and back-patch as
far as 9.0 where hot standby was added.

                        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Report: race conditions in WAL replay routines

Reply via email to