On Sun, Sep 20, 2020 at 10:43 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > > I wrote: > > So this confirms the suspicion that the cause of the buildfarm > > failures is a concurrently-open transaction, presumably from > > autovacuum. I don't have time to poke further right now. > .. > 2. As Amit suspected, there's an inconsistency between pruneheap.c's > rules for which tuples are removable and vacuum.c's rules for that. > This seems like a massive bug in its own right: what's the point of > pruneheap.c going to huge effort to decide whether it should keep a > tuple if vacuum will then kill it anyway? I do not understand why > whoever put in the GlobalVisState stuff only applied it in pruneheap.c > and not VACUUM proper. > > These two points interact, in that we don't get to the "cannot freeze" > failure except when pruneheap has decided not to remove something that > is removable according to VACUUM's rules. (VACUUM doesn't actually > remove it, because lazy_scan_heap won't try to remove HeapOnly tuples > even when it thinks they're HEAPTUPLE_DEAD; but then it tries to freeze > the tuple, and heap_prepare_freeze_tuple spits up.) However, if I revert > a7212be8b then the pg_surgery test still fails in the presence of a > concurrent transaction (both of the expected "skipping TID" notices > disappear). So reverting that patch wouldn't get us out of trouble. > > I think to move forward, we need to figure out what the freezing > behavior ought to be for temp tables. We could make it the same > as it was before a7212be8b, which'd just require some more complexity > in vacuum_set_xid_limits. However, that negates the idea that we'd > like VACUUM's behavior on a temp table to be fully independent of > whether concurrent transactions exist. I'd prefer to allow a7212be8b's > behavior to stand, but then it seems we need to lobotomize the error > check in heap_prepare_freeze_tuple to some extent. > > Independently of that, it seems like we need to fix things so that > when pruneheap.c is called by vacuum, it makes EXACTLY the same > dead-or-not-dead decisions that the main vacuum code makes. This > business with applying some GlobalVisState rule or other instead > seems just as unsafe as can be. >
Yeah, on a quick look it seems before commit dc7420c2c9 the pruneheap.c and the main Vacuum code use to make the same decision and that is commit which has introduced GlobalVisState stuff. > AFAICS, there is no chance of the existing pg_surgery regression test > being fully stable if we don't fix both things. > What if ensure that it runs with autovacuum = off and there is no parallel test running? I am not sure about the second part but if we can do that then the test will be probably stable. -- With Regards, Amit Kapila.