Re: Online verification of checksums

Fabien COELHO Sat, 02 Mar 2019 22:58:53 -0800


Bonjour Michaël,

I gotta say, my conclusion from this debate is that it's simply a
mistake to do this without involvement of the server that can use
locking to prevent these kind of issues.  It seems pretty absurd to me
to have hacky workarounds around partial writes of a live server, around
truncation, etc, even though the server has ways to deal with that.
I agree with Andres on this one. We are never going to make this stuffsafe if we don't handle page reads with the proper locks because of tornpages. What I think we should do is provide a SQL function which reads apage in shared mode, and then checks its checksum if its LSN is olderthan the previous redo point. This discards cases with rather hotpages, but if the page is hot enough then the backend re-reading thepage would just do the same by verifying the page checksum by itself. --Michael


My 0.02€ about that, as one of the reviewer of the patch:

I agree that having a server function (extension?) to do a full checksumverification, possibly bandwidth-controlled, would be a good thing.However it would have side effects, such as interfering deeply with theserver page cache, which may or may not be desirable.

On the other hand I also see value in an independent system-level externaltool capable of a best effort checksum verification: the current checkthat the cluster is offline to prevent pg_verify_checksum from running iskind of artificial, and when online simply countingonline-database-related checksum issues looks like a reasonablecompromise.

So basically I think that allowing pg_verify_checksum to run on an onlinecluster is still a good thing, provided that expected errors are correctlyhandled.


--
Fabien.

Re: Online verification of checksums

Reply via email to