> On Feb 17, 2021, at 12:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
> 
> On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
>> It will reconnect and retry a command one time on error.  That should cover 
>> the case that the connection to the database was merely lost.  If the second 
>> attempt also fails, no further retry of the same command is attempted, 
>> though commands for remaining relation targets will still be attempted, both 
>> for the database that had the error and for other remaining databases in the 
>> list.
>> 
>> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` 
>> could result in two failures per relation in db2 before finally moving on to 
>> db3.  That seems pretty awful considering how many relations that could be, 
>> but failing to soldier on in the face of errors seems a strange design for a 
>> corruption checking tool.
> 
> That doesn't seem right at all. I think a PQsendQuery() failure is so
> remote that it's probably justification for giving up on the entire
> operation. If it's caused by a problem with some object, it probably
> means that accessing that object caused the whole database to go down,
> and retrying the object will take the database down again. Retrying
> the object is betting that the user interrupted connectivity between
> pg_amcheck and the database but the interruption is only momentary and
> the user actually wants to complete the operation. That seems unlikely
> to me. I think it's far more probably that the database crashed or got
> shut down and continuing is futile.
> 
> My proposal is: if we get an ERROR trying to *run* a query, give up on
> that object but still try the other ones after reconnecting. If we get
> a FATAL or PANIC trying to *run* a query, give up on the entire
> operation. If even sending a query fails, also give up.

This is changed in v40 as you propose to exit on FATAL and PANIC level errors 
and on error to send a query.  On lesser errors (which includes all corruption 
reports about btrees and some heap corruption related errors), the slot's 
connection is still useable, I think.  Are there cases where the error is lower 
than FATAL and yet the connection needs to be reestablished?  It does not seem 
so from the testing I have done, but perhaps I'm not thinking of the right sort 
of non-fatal error?

(I'll wait to post v40 until after hearing your thoughts on this.)

>> In v39, exit(1) is used for all errors which are intended to stop the 
>> program.  It is important to recognize that finding corruption is not an 
>> error in this sense.  A query to verify_heapam() can fail if the relation's 
>> checksums are bad, and that happens beyond verify_heapam()'s control when 
>> the page is not allowed into the buffers.  There can be errors if the file 
>> backing a relation is missing.  There may be other corruption error cases 
>> that I have not yet thought about.  The connections' errors get reported to 
>> the user, but pg_amcheck does not exit as a consequence of them.  As 
>> discussed above, failing to send the query to the server is not viewed as a 
>> reason to exit, either.  It would be hard to quantify all the failure modes, 
>> but presumably the catalogs for a database could be messed up enough to 
>> cause such failures, and I'm not sure that pg_amcheck should just abort.
> 
> I agree that exit(1) should happen after any error intended to stop
> the program. But I think it should also happen at the end of the run
> if we hit any problems for which we did not stop, so that exit(0)
> means your database is healthy.

In v40, exit(1) means the program encountered fatal errors leading it to stop, 
and exit(2) means that a non-fatal error and/or corruption reports occurred 
somewhere during the processing.  Otherwise, exit(0) means your database was 
successfully checked and is healthy.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Reply via email to