Re: BUG #17212: pg_amcheck fails on checking temporary relations

Mark Dilger Mon, 04 Oct 2021 17:33:15 -0700

> On Oct 4, 2021, at 4:45 PM, Peter Geoghegan <p...@bowt.ie> wrote:
> 
> I'm guessing that you meant REINDEX CONCURRENTLY.

Yes.

> Since you're talking about the case where it has an error

Sorry, I realized after hitting <send> that you might take it that way, but I 
mean the logs generally, not just postgres logs.  If somebody runs "reindex 
concurrently" on all tables at midnight every night, and they see one morning 
in the (non-postgres) logs from the midnight hour weird error messages about 
their RAID controller, they may well want to check all their indexes to see if 
any of them were corrupted.  This is a totally made-up example, but the idea 
that a user might want to check their indexes, tables, or both owing to errors 
of some kind is not far-fetched.  

> , the whole
> index build must have failed. So the user would get exactly what
> they'd expect -- verification of the original index, without any
> hindrance from the new/failed index.

Right, in that case, but not if hardware errors corrupted the index, and 
generated logs, without happening to trip up the reindex concurrently statement 
itself.

> (Thinks for a moment...)
> 
> Actually, I think that we'd only verify the original index, even
> before the error with CONCURRENTLY (though I've not checked that point
> myself).

To get back on track, let me say that I'm not taking the position that what 
pg_amcheck currently does is necessarily correct, but just that I'd like to be 
careful about what we change, if anything.  There are three things that seem to 
irritate people:

1) A non-zero exit code means "not everything was checked and passed" rather 
than "at least one thing is definitely corrupt".

2) Skipping of indexes is reported to the user with the word 'ERROR' in the 
report rather than, say, 'NOTICE'.

3) Deadlocks can occur

I have resisted changing #1 on the theory that `pg_amcheck --all && 
./post_all_checks_pass.sh` should only run the post_all_checks_pass.sh if 
indeed all checks have passed, and I'm interpreting skipping an index check as 
being contrary to that.  But maybe that's wrong of me.  I don't know.  There is 
already sloppiness between the time that pg_amcheck resolves which database 
relations are matched by --all, --table, --index, etc. and the time that all 
the checks are started, and again between that time and when the last one is 
complete.  Database objects could be created or dropped during those spans of 
time, in which case --all doesn't have quite so well defined a meaning.  But 
the user running pg_amcheck might also *know* that they aren't running any such 
DDL, and therefore expect --all to actually result in everything being checked.

I find it strange that I should do anything about #2 in pg_amcheck, since it's 
the function in verify_nbtree that phrases the situation as an error.  But I 
suppose I can just ignore that and have it print as a notice.  I'm genuinely 
not trying to give you grief here -- I simply don't like that pg_amcheck is 
adding commentary atop what the checking functions are doing.  I see a clean 
division between what pg_amcheck is doing and what amcheck is doing, and this 
feels to me to put that on the wrong side of the divide.  If refusing to check 
the index because it is not in the requisite state is a notice, then why 
wouldn't verify_nbtree raise it as one and return early rather than raising an 
error?

I also find it strange that #3 is being attributed to pg_amcheck's choice of 
how to call the checking function, because I can't think of any other function 
where we require the SQL caller to do anything like what you are requiring here 
in order to prevent deadlocks, and also because the docs for the functions 
don't say that a deadlock is possible, merely that the function may return with 
an error.  I was totally content to get an error back, since errors are how the 
verify_nbtree functions communicate everything else, and the handler for those 
functions is already prepared to deal with the error messages so returned.  But 
it clearly annoys you that pg_amcheck is doing this, so I'll go forward with 
the patch that I already have written which does otherwise.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Re: BUG #17212: pg_amcheck fails on checking temporary relations

Reply via email to