Re: pg_amcheck contrib application

Robert Haas Wed, 03 Mar 2021 09:16:18 -0800

On Wed, Mar 3, 2021 at 10:22 AM Mark Dilger
<mark.dil...@enterprisedb.com> wrote:
> > Your four messages about there being nothing to check seem like they
> > could be consolidated down to one: "nothing to check for pattern
> > \"%s\"".
>
> I anticipated your review comment, but I'm worried about the case that 
> somebody runs
>
>   pg_amcheck -t "foo" -i "foo"
>
> and one of those matches and the other does not.  The message 'nothing to 
> check for pattern "foo"' will be wrong (because there was something to check 
> for it) and unhelpful (because it doesn't say which failed to match.)


Fair point.

> Changed, though I assumed your parens for corruption() were not intended.

Uh, yeah.

> Thanks for the review!

+        fprintf(stderr, "%s: no relations to check", progname);

Missing newline.

Generally, I would favor using pg_log_whatever as a way of reporting
messages starting when option parsing is complete. In other words,
starting here:

+        fprintf(stderr, "%s: no databases to check\n", progname);

I see no real advantage in having a bunch of these using
fprintf(stderr, ...), which to me seems most appropriate only for very
early failures.

Perhaps amcheck_sql could be spread across fewer lines, now that it
doesn't have so many decorations?

pg_basebackup uses -P as a short form for --progress, so maybe we
should match that here.

When I do "pg_amcheck --progress", it just says "259/259 (100%)" which
I don't find too clear. The corresponding pg_basebackup output is
"32332/32332 kB (100%), 1/1 tablespace" which has the advantage of
including units. I think if you just add the word "relations" to your
message it will be nicer.

When I do "pg_amcheck -s public" it tells me that there are no
relations to check in schemas for "public". I think "schemas matching"
would read better than "schemas for." Similar with the other messages.
When I try "pg_amcheck -t nessie" it tells me that there are no tables
to check for "nessie" but saying that there are no tables to check
matching "nessie" to me sounds more natural.

The code doesn't seem real clear on the difference between a database
name and a pattern. Consider:

createdb rhaas
createdb 'rh*s'
PGDATABASE='rh*s' pg_amcheck

It checks the rhaas database, which I venture to say is just plain wrong.

The error message when I exclude the only checkable database is not
very clear. "pg_amcheck -D rhaas"  says pg_amcheck: no checkable
database: "rhaas". Well, I get that there's no checkable database. But
as a user I have no idea what "rhaas" is. I can even get it to issue
this complaint more than once:

createdb q
createdb qq
pg_amcheck -D 'q*' q qq

Now it issues the "no checkable database" complaint twice, once for q
and once for qq. But if there's no checkable database, I only need to
know that once. Either the message is wrongly-worded, or it should
only be issued once and doesn't need to include the pattern. I think
it's the second one, but I could be wrong.

Using a pattern as the only or first argument doesn't work; i.e.
"pg_amcheck rhaas" works but "pg_amcheck rhaa?" fails because there is
no database with that exact literal name. This seems like another
instance of confusion between a literal database name and a database
name pattern. I'm not quite sure what the right solution is here. We
could give up on having database patterns altogether -- the comparable
issue does not arise for database and schema name patterns -- or the
maintenance database could default to something that's not going to be
a pattern, like "postgres," rather than being taken from a
command-line argument that is intended to be a pattern. Or some hybrid
approach e.g. -d options are patterns, but don't set the maintenance
database, while extra command line arguments are literal database
names, and thus are presumably OK to use as the maintenance DB. But
it's too weird IMHO to support patterns here and then have supplying
one inevitably fail unless you also specify --maintenance-db.

It's sorta annoying that there doesn't seem to be an easy way to find
out exactly what relations got checked as a result of whatever I did.
Perhaps pg_amcheck -v should print a line for each relation saying
that it's checking that relation; it's not actually that verbose as
things stand. If we thought that was overdoing it, we could set things
up so that multiple -v options keep increasing the verbosity level, so
that you can get this via pg_amcheck -vv. I submit that pg_amcheck -e
is not useful for this purpose because the queries, besides being
long, use the relation OIDs rather than the names, so it's not easy to
see what happened.

I think that something's not working in terms of schema exclusion. If
I create a brand-new database and then run "pg_amcheck -S pg_catalog
-S information_schema -S pg_toast" it still checks stuff. In fact it
seems to check the exact same amount of stuff that it checks if I run
it with no command-line options at all. In fact, if I run "pg_amcheck
-S '*'" that still checks everything. Unless I'm misunderstanding what
this option is supposed to do, the fact that a version of this patch
where this seemingly doesn't work at all escaped to the list suggests
that your testing has got some gaps.

I like the semantics of --no-toast-expansion and --no-index-expansion
as you now have them, but I find I don't really like the names. Could
I suggest --no-dependent-indexes and --no-dependent-toast?

I tried pg_amcheck --startblock=tsgsdg and got an error message
without a trailing newline. I tried --startblock=-525523 and got no
error. I tried --startblock=99999999999999999999999999 and got a
complaint that the value was out of bounds, but without a trailing
newline. Maybe there's an argument that the bounds don't need to be
checked, but surely there's no argument for checking one and not the
other. I haven't tried the corresponding cases with --endblock but you
should. I tried --startblock=2 --endblock=1 and got a complaint that
the ending block precedes the starting block, which is totally
reasonable (though I might say "start block" and "end block" rather
than using the -ing forms) but this message is prefixed with
"pg_amcheck: " whereas the messages about an altogether invalid
starting block where not so prefixed. Is there a reason not to make
this consistent?

I also tried using a random positive integer for startblock, and for
every relation I am told "ERROR:  starting block number must be
between 0 and <whatever>". That makes sense, because I used a big
number for the start block and I don't have any big relations, but it
makes for an absolute ton of output, because every verify_heapam query
is 11 lines long. This suggests a couple of possible improvements.
First, maybe we should only display the query that produced the error
in verbose mode. Second, maybe the verify_heapam() query should be
tightened up so that it doesn't stretch across quite so many lines. I
think the call to verify_heapam() could be spread across like 2 lines
rather than 7, which would improve readability. On a related note, I
wonder why we need every verify_heapam() call to join to pg_class and
pg_namespace just to fetch the schema and table name which,
presumably, we should or at least could already have. This kinda
relates to my comment earlier about making -v print a message per
relation so that we can see, in human-readable format, which relations
are getting checked. Right now, if you got an error checking just one
relation, how would you know which relation you got it from? Unless
the server happens to report that information in the message, you're
just in the dark, because pg_amcheck won't tell you.

The line "Read the description of the amcheck contrib module for
details" seems like it could be omitted. Perhaps the first line of the
help message could be changed to read "pg_amcheck uses amcheck to find
corruption in a PostgreSQL database." or something like that, instead.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Re: pg_amcheck contrib application

Reply via email to