> On Mar 15, 2021, at 11:57 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> 
> Not sure that I believe the theory that this is from bad luck of
> concurrent autovacuum timing, though.  The fact that we're seeing
> this on just those two animals suggests strongly to me that it's
> architecture-correlated, instead.

I find it a little hard to see how amcheck is tripping over a toasted value 
just in this one table, pg_statistic, and not in any of the others.  The error 
message says the problem is in attribute 27, which I believe means it is 
stavalues2.  The comment in the header for this catalog is intriguing:

    /*
     * Values in these arrays are values of the column's data type, or of some
     * related type such as an array element type.  We presently have to cheat
     * quite a bit to allow polymorphic arrays of this kind, but perhaps
     * someday it'll be a less bogus facility.
     */
    anyarray    stavalues1;
    anyarray    stavalues2;
    anyarray    stavalues3;
    anyarray    stavalues4;
    anyarray    stavalues5;

This is hard to duplicate in a test, because you can't normally create tables 
with pseudo-type columns.  However, if amcheck is walking the tuple and does 
not correctly update the offset with the length of attribute 26, it may try to 
read attribute 27 at the wrong offset, unsurprisingly leading to garbage, 
perhaps a garbage toast pointer.  The attached patch v7-0004 adds a check to 
verify_heapam to see if the va_toastrelid matches the expected toast table oid 
for the table we're reading.  That check almost certainly should have been 
included in the initial version of verify_heapam, so even if it does nothing to 
help us in this issue, it's good that it be committed, I think.

It is unfortunate that the failing test only runs pg_amcheck after creating 
numerous corruptions, as we can't know if pg_amcheck would have complained 
about pg_statistic before the corruptions were created in other tables, or if 
it only does so after.  The attached patch v7-0003 adds a call to pg_amcheck 
after all tables are created and populated, but before any corruptions are 
caused.  This should help narrow down what is happening, and doesn't hurt to 
leave in place long-term.

I don't immediately see anything wrong with how pg_statistic uses a 
pseudo-type, but it leads me to want to poke a bit more at pg_statistic on 
hornet and tern, though I don't have any regression tests specifically for 
doing so.

Tests v7-0001 and v7-0002 are just repeats of the tests posted previously.

Attachment: v7-0001-Turning-off-autovacuum-during-corruption-tests.patch
Description: Binary data

Attachment: v7-0002-Fixing-a-confusing-amcheck-corruption-message.patch
Description: Binary data

Attachment: v7-0003-Extend-pg_amcheck-test-suite.patch
Description: Binary data

Attachment: v7-0004-Add-extra-check-of-toast-pointer-in-amcheck.patch
Description: Binary data


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Reply via email to