Heikki Linnakangas <hlinn...@iki.fi> writes: > After my commit c532d15ddd to split up copy.c, buildfarm animal "prion" > failed in pg_upgrade > (https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2020-11-23%2009%3A23%3A16):
prion's continued to fail, rarely, in this same way: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-09%2009%3A13%3A16 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-02-15%2004%3A08%3A06 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2021-05-12%2018%3A43%3A14 The failures are remarkably identical, and they also look a lot like field reports we've been seeing off and on for years. I do not know why it always seems to be pg_toast_2619 (i.e. pg_statistic) that's affected, but the pattern is pretty undeniable by now. What I do have that's new is that *I can reproduce it*, at long last. For me, installing the attached patch and running pg_upgrade's "make check" fails, pretty much every time, with symptoms identical to prion's. The patch consists of (1) 100ms delay just before detoasting, when loading a pg_statistic catcache entry that has toasted datums (2) provisions to mark such catcache entries dead immediately (effectively, CATCACHE_FORCE_RELEASE for only these tuples); this is to force us through (1) as often as possible (3) some quick-hack debugging aids added to HeapTupleSatisfiesToast, plus convert the missing-chunk error to PANIC to get a stack trace. If it doesn't reproduce for you, try adjusting the delay. 100ms was the first value I tried, though, so I think it's probably not too sensitive. The trace I'm getting shows pretty positively that autovacuum has fired on pg_statistic, and removed the needed toast entries, just before the failure. So something is wrong with our logic about when toast entries can be removed. I do not have a lot of idea why, but I see something that is probably a honkin' big clue: 2021-05-15 17:28:05.965 EDT [2469444] LOG: HeapTupleSatisfiesToast: xmin 2 t_infomask 0x0b02 That is, the toast tuples in question are not just frozen, but actually have xmin = FrozenTransactionId. I do not think that is normal --- at least, it's not the state immediately after initdb, and I can't make it happen with "vacuum freeze pg_statistic". A plausible theory is that pg_upgrade caused this to happen (but how?) and then vacuuming of toast rows goes off the rails because of it. Anyway, I have no more time to poke at this right now, so I'm posting the reproducer in case someone else wants to look at it. regards, tom lane
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c index d3c57cd16a..5830df83c2 100644 --- a/src/backend/access/heap/heapam_visibility.c +++ b/src/backend/access/heap/heapam_visibility.c @@ -73,11 +73,14 @@ #include "access/xlog.h" #include "storage/bufmgr.h" #include "storage/procarray.h" +#include "tcop/tcopprot.h" #include "utils/builtins.h" #include "utils/combocid.h" #include "utils/snapmgr.h" +bool trace_toast_visibility = false; + /* * SetHintBits() * @@ -366,6 +369,13 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot, Assert(ItemPointerIsValid(&htup->t_self)); Assert(htup->t_tableOid != InvalidOid); + if (trace_toast_visibility) + ereport(LOG, + errmsg("HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x", + HeapTupleHeaderGetXmin(tuple), + tuple->t_infomask), + debug_query_string ? 0 : errbacktrace()); + if (!HeapTupleHeaderXminCommitted(tuple)) { if (HeapTupleHeaderXminInvalid(tuple)) @@ -420,6 +430,11 @@ HeapTupleSatisfiesToast(HeapTuple htup, Snapshot snapshot, return false; } + if (trace_toast_visibility) + elog(LOG, "HeapTupleSatisfiesToast: xmin %u t_infomask 0x%04x OK", + HeapTupleHeaderGetXmin(tuple), + tuple->t_infomask); + /* otherwise assume the tuple is valid for TOAST. */ return true; } diff --git a/src/backend/access/heap/heaptoast.c b/src/backend/access/heap/heaptoast.c index 55bbe1d584..03ea2dc80d 100644 --- a/src/backend/access/heap/heaptoast.c +++ b/src/backend/access/heap/heaptoast.c @@ -781,7 +781,7 @@ heap_fetch_toast_slice(Relation toastrel, Oid valueid, int32 attrsize, * Final checks that we successfully fetched the datum */ if (expectedchunk != (endchunk + 1)) - ereport(ERROR, + ereport(PANIC, (errcode(ERRCODE_DATA_CORRUPTED), errmsg_internal("missing chunk number %d for toast value %u in %s", expectedchunk, valueid, diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index 4fbdc62d8c..a1f923e02c 100644 --- a/src/backend/utils/cache/catcache.c +++ b/src/backend/utils/cache/catcache.c @@ -23,6 +23,7 @@ #include "access/xact.h" #include "catalog/pg_collation.h" #include "catalog/pg_operator.h" +#include "catalog/pg_statistic.h" #include "catalog/pg_type.h" #include "common/hashfn.h" #include "miscadmin.h" @@ -39,6 +40,7 @@ #include "utils/resowner_private.h" #include "utils/syscache.h" +extern bool trace_toast_visibility; /* #define CACHEDEBUG */ /* turns DEBUG elogs on */ @@ -1810,6 +1812,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, { CatCTup *ct; HeapTuple dtp; + bool makedead = false; MemoryContext oldcxt; /* negative entries have no tuple associated */ @@ -1827,7 +1830,16 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, * something using a slightly stale catcache entry. */ if (HeapTupleHasExternal(ntp)) + { + if (cache->cc_reloid == StatisticRelationId) + { + pg_usleep(100000); + makedead = true; + trace_toast_visibility = true; + } dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc); + trace_toast_visibility = false; + } else dtp = ntp; @@ -1887,7 +1899,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments, ct->my_cache = cache; ct->c_list = NULL; ct->refcount = 0; /* for the moment */ - ct->dead = false; + ct->dead = makedead; ct->negative = negative; ct->hash_value = hashValue;