What happens is that when we recycle WAL segments, we rename them and then sync
them using fdatasync (which is the default on Linux). However fdatasync does not
force fsync on the parent directory, so in case of power failure the rename may
get lost. The recovery won't realize those segments actually contain changes
Agree. Some time ago I faced with this, although it wasn't a postgres.

So, what's going on? The problem is that while the rename() is atomic, it's not
guaranteed to be durable without an explicit fsync on the parent directory. And
by default we only do fdatasync on the recycled segments, which may not force
fsync on the directory (and ext4 does not do that, apparently).

This impacts all current kernels (tested on 2.6.32.68, 4.0.5 and 4.4-rc1), and
also all supported PostgreSQL versions (tested on 9.1.19, but I believe all
versions since spread checkpoints were introduced are vulnerable).

FWIW this has nothing to do with storage reliability - you may have good drives,
RAID controller with BBU, reliable SSDs or whatever, and you're still not safe.
This issue is at the filesystem level, not storage.
Agree again.

I plan to do more power failure testing soon, with more complex test scenarios.
I suspect there might be other similar issues (e.g. when we rename a file before
a checkpoint and don't fsync the directory - then the rename won't be replayed
and will be lost).
It would be very useful, but I hope you will not find a new bug :)

--
Teodor Sigaev                                   E-mail: teo...@sigaev.ru
                                                   WWW: http://www.sigaev.ru/


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to