Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

Justin Pryzby Tue, 08 Jun 2021 14:23:59 -0700

On Tue, Jun 08, 2021 at 01:52:40PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby <pry...@telsasoft.com> wrote:
> > > They're running this:
> > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c 
> > > "REINDEX INDEX CONCURRENTLY $i"
> > > And if it times out, it then runs: $PSQL "DROP INDEX CONCURRENTLY $bad"
> > ...
> > > $ date -d @1623121264
> > > Mon Jun  7 22:01:04 CDT 2021
> 
> Perhaps reindex was waiting on the VACUUM process to finish, while
> VACUUM was (in effect) busy waiting on the REINDEX to finish.


But when the reindex exited, the vacuum kept spinning until I sent SIGABRT 12
hours later.

> The other bug that you reported back in July of last year [1] (which
> involved a "REINDEX INDEX pg_class_tblspc_relfilenode_index") was
> pretty easy to recreate, just by running the REINDEX in a tight loop.
> Could you describe how tricky it is to repro this issue now?

I didn't try to reproduce it, but now hit it twice in 3 days.
(Actuallly, I did try to reproduce it, by running tight loops around
vacuum/analyze pg_statistic, which didn't work.  Maybe because reindex is
what's important.)

I mentioned that we've been running pg14b1 since 2021-05-20. So it ran fine for
13 days before breaking in an obvious way.

OH - in the first instance, I recorded the stuck process, but not its
timestamp.  It looks like that autovacuum process *also* started right after
10pm, which is when the reindex job starts.  So it seems like REINDEX may
trigger this pretty consistently:

(gdb) frame 4
#4  heap_vacuum_rel (rel=0x7f0349466d28, params=0x1c77b7c, bstrategy=<optimized 
out>) at vacuumlazy.c:612
612     vacuumlazy.c: No such file or directory.
(gdb) info locals
starttime = 676177375524485

$ date -d '2000-01-01 UTC + 676177375seconds'
Fri Jun  4 22:02:55 CDT 2021

> If you instrument the "goto retry" code added to lazy_scan_prune() by
> commit 8523492d, then you might notice that it is hit in contexts that
> it was never intended to work with. If you can reduce reproducing the
> problem to reproducing hitting that goto in the absence of an aborted
> transaction, then it might be a lot easier to produce a simple repro.

I'm not sure what you're suggesting ?  Maybe I should add some NOTICES there.

I'm not sure why/if pg_statistic is special, but I guess when analyze happens,
it gets updated, and eventually processed by autovacuum.

The main table here is a partitioned table which receives UPDATEs which moves
tuples into a different partition (probably more often than what's
recommended).

 autovacuum_analyze_threshold      | 2
 autovacuum_analyze_scale_factor   | 0.005
 autovacuum_vacuum_scale_factor    | 0.005
 log_autovacuum_min_duration       | 9000
 checkpoint_timeout                | 60
 wal_level                         | minimal

In pg14, the parent table is auto-analyzed.

-- 
Justin

Re: pg14b1 stuck in lazy_scan_prune/heap_page_prune of pg_statistic

Reply via email to