Bonjour Michaël,
I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues. It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.
I agree with Andres on this one. We are never going to make this stuff
safe if we don't handle page reads with the proper locks because of torn
pages. What I think we should do is provide a SQL function which reads a
page in shared mode, and then checks its checksum if its LSN is older
than the previous redo point. This discards cases with rather hot
pages, but if the page is hot enough then the backend re-reading the
page would just do the same by verifying the page checksum by itself. --
Michael
My 0.02€ about that, as one of the reviewer of the patch:
I agree that having a server function (extension?) to do a full checksum
verification, possibly bandwidth-controlled, would be a good thing.
However it would have side effects, such as interfering deeply with the
server page cache, which may or may not be desirable.
On the other hand I also see value in an independent system-level external
tool capable of a best effort checksum verification: the current check
that the cluster is offline to prevent pg_verify_checksum from running is
kind of artificial, and when online simply counting
online-database-related checksum issues looks like a reasonable
compromise.
So basically I think that allowing pg_verify_checksum to run on an online
cluster is still a good thing, provided that expected errors are correctly
handled.
--
Fabien.