As for fixing the problem we do understand: ISTM it's just an awful idea for pgrename and pgunlink to be willing to loop forever. I think they should time out and report the failure after some reasonable period (say between 10 sec and a minute).
is the main problem realy in the rename/delete function? while i'm in no position of actually knowing whats going on under the hood, my observations in +10 cases during this afternoon/evening revealed some patterns:
it is defenitely the writer process that blocks the db. but in every case the writer process seems to fail to rename the file due to another postgresql still holding a filehandle to the very xlog file that should be renamed. ProcessExplorer lets you force a close of the file handle - as soon as you do this [which is a bad thing to do, i assume], the rename succeeds and processing continues normally.
i actually can reproduce the error at will now - i just need do pump enough data into the db (~200mb data seems sufficient) to have it lock up.
- thomas
---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster