On 2019-07-17 11:53:48 -0400, Tom Lane wrote: > David Rowley <david.row...@2ndquadrant.com> writes: > > Surely it can't be that since that just sets what *pages gets set to. > > Tom mentioned that following was returning 0 pages and tuples: > > > -- Temporary hack to investigate whether extra vacuum/analyze is happening > > select relname, relpages, reltuples > > from pg_class > > where relname like '__star' order by relname; > > relname | relpages | reltuples > > ---------+----------+----------- > > a_star | 1 | 3 > > I poked around a little and came up with a much simpler theory: > VACUUM will not change relpages/reltuples if it does not scan any pages > (cf. special case for tupcount_pages == 0 in heap_vacuum_rel, at line 343 > in HEAD's vacuumlazy.c). And, because sanity_check.sql's VACUUM is a > plain unaggressive vacuum, all that it takes to make it skip over a_star's > one page is for somebody else to have a pin on that page.
I wonder if we could set log_min_messages to DEBUG2 on occasionally failing machines to test that theory. That ought to hit appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ", "Skipped %u pages due to buffer pins, ", vacrelstats->pinskipped_pages), ... ereport(elevel, (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), tups_vacuumed, num_tuples, vacrelstats->scanned_pages, nblocks), errdetail_internal("%s", buf.data))); > So a chance > collision with the bgwriter or checkpointer could cause the observed > symptom, not just for a_star but for the other single-page relations that > are at stake here. Those pages are certainly dirty after create_misc.sql, > so it's hardly implausible for one of these processes to be holding pin > while trying to write out the buffer at the time sanity_check.sql runs. > > A brute-force way to fix this (or at least reduce the odds quite a bit) > would be to have sanity_check.sql issue a CHECKPOINT before its VACUUM, > thereby guaranteeing that none of these pages are still in need of being > written. Not sure how much that'd penalize the regression tests' runtime, > or whether we'd have a loss of test coverage of VACUUM behaviors. Alternatively we could VACUUM FREEZE the relevant tables? That then ought to hit the blocking codepath in lazu_scan_heap()? Greetings, Andres Freund