Bonjour Michaël,

I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues.  It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.

I agree with Andres on this one. We are never going to make this stuff safe if we don't handle page reads with the proper locks because of torn pages. What I think we should do is provide a SQL function which reads a page in shared mode, and then checks its checksum if its LSN is older than the previous redo point. This discards cases with rather hot pages, but if the page is hot enough then the backend re-reading the page would just do the same by verifying the page checksum by itself. -- Michael

My 0.02€ about that, as one of the reviewer of the patch:

I agree that having a server function (extension?) to do a full checksum verification, possibly bandwidth-controlled, would be a good thing. However it would have side effects, such as interfering deeply with the server page cache, which may or may not be desirable.

On the other hand I also see value in an independent system-level external tool capable of a best effort checksum verification: the current check that the cluster is offline to prevent pg_verify_checksum from running is kind of artificial, and when online simply counting online-database-related checksum issues looks like a reasonable compromise.

So basically I think that allowing pg_verify_checksum to run on an online cluster is still a good thing, provided that expected errors are correctly handled.

--
Fabien.

Reply via email to