> On Feb 17, 2021, at 12:56 PM, Robert Haas <robertmh...@gmail.com> wrote:
>
> On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
> <mark.dil...@enterprisedb.com> wrote:
>> It will reconnect and retry a command one time on error. That should cover
>> the case that the connection to the database was merely lost. If the second
>> attempt also fails, no further retry of the same command is attempted,
>> though commands for remaining relation targets will still be attempted, both
>> for the database that had the error and for other remaining databases in the
>> list.
>>
>> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3`
>> could result in two failures per relation in db2 before finally moving on to
>> db3. That seems pretty awful considering how many relations that could be,
>> but failing to soldier on in the face of errors seems a strange design for a
>> corruption checking tool.
>
> That doesn't seem right at all. I think a PQsendQuery() failure is so
> remote that it's probably justification for giving up on the entire
> operation. If it's caused by a problem with some object, it probably
> means that accessing that object caused the whole database to go down,
> and retrying the object will take the database down again. Retrying
> the object is betting that the user interrupted connectivity between
> pg_amcheck and the database but the interruption is only momentary and
> the user actually wants to complete the operation. That seems unlikely
> to me. I think it's far more probably that the database crashed or got
> shut down and continuing is futile.
>
> My proposal is: if we get an ERROR trying to *run* a query, give up on
> that object but still try the other ones after reconnecting. If we get
> a FATAL or PANIC trying to *run* a query, give up on the entire
> operation. If even sending a query fails, also give up.
This is changed in v40 as you propose to exit on FATAL and PANIC level errors
and on error to send a query. On lesser errors (which includes all corruption
reports about btrees and some heap corruption related errors), the slot's
connection is still useable, I think. Are there cases where the error is lower
than FATAL and yet the connection needs to be reestablished? It does not seem
so from the testing I have done, but perhaps I'm not thinking of the right sort
of non-fatal error?
(I'll wait to post v40 until after hearing your thoughts on this.)
>> In v39, exit(1) is used for all errors which are intended to stop the
>> program. It is important to recognize that finding corruption is not an
>> error in this sense. A query to verify_heapam() can fail if the relation's
>> checksums are bad, and that happens beyond verify_heapam()'s control when
>> the page is not allowed into the buffers. There can be errors if the file
>> backing a relation is missing. There may be other corruption error cases
>> that I have not yet thought about. The connections' errors get reported to
>> the user, but pg_amcheck does not exit as a consequence of them. As
>> discussed above, failing to send the query to the server is not viewed as a
>> reason to exit, either. It would be hard to quantify all the failure modes,
>> but presumably the catalogs for a database could be messed up enough to
>> cause such failures, and I'm not sure that pg_amcheck should just abort.
>
> I agree that exit(1) should happen after any error intended to stop
> the program. But I think it should also happen at the end of the run
> if we hit any problems for which we did not stop, so that exit(0)
> means your database is healthy.
In v40, exit(1) means the program encountered fatal errors leading it to stop,
and exit(2) means that a non-fatal error and/or corruption reports occurred
somewhere during the processing. Otherwise, exit(0) means your database was
successfully checked and is healthy.
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company