On Tue, Jun 08, 2021 at 01:52:40PM -0700, Peter Geoghegan wrote: > On Tue, Jun 8, 2021 at 12:27 PM Justin Pryzby <pry...@telsasoft.com> wrote: > > > They're running this: > > > | PGOPTIONS="--deadlock_timeout=333ms -cstatement-timeout=3600s" psql -c > > > "REINDEX INDEX CONCURRENTLY $i" > > > And if it times out, it then runs: $PSQL "DROP INDEX CONCURRENTLY $bad" > > ... > > > $ date -d @1623121264 > > > Mon Jun 7 22:01:04 CDT 2021 > > Perhaps reindex was waiting on the VACUUM process to finish, while > VACUUM was (in effect) busy waiting on the REINDEX to finish.
But when the reindex exited, the vacuum kept spinning until I sent SIGABRT 12 hours later. > The other bug that you reported back in July of last year [1] (which > involved a "REINDEX INDEX pg_class_tblspc_relfilenode_index") was > pretty easy to recreate, just by running the REINDEX in a tight loop. > Could you describe how tricky it is to repro this issue now? I didn't try to reproduce it, but now hit it twice in 3 days. (Actuallly, I did try to reproduce it, by running tight loops around vacuum/analyze pg_statistic, which didn't work. Maybe because reindex is what's important.) I mentioned that we've been running pg14b1 since 2021-05-20. So it ran fine for 13 days before breaking in an obvious way. OH - in the first instance, I recorded the stuck process, but not its timestamp. It looks like that autovacuum process *also* started right after 10pm, which is when the reindex job starts. So it seems like REINDEX may trigger this pretty consistently: (gdb) frame 4 #4 heap_vacuum_rel (rel=0x7f0349466d28, params=0x1c77b7c, bstrategy=<optimized out>) at vacuumlazy.c:612 612 vacuumlazy.c: No such file or directory. (gdb) info locals starttime = 676177375524485 $ date -d '2000-01-01 UTC + 676177375seconds' Fri Jun 4 22:02:55 CDT 2021 > If you instrument the "goto retry" code added to lazy_scan_prune() by > commit 8523492d, then you might notice that it is hit in contexts that > it was never intended to work with. If you can reduce reproducing the > problem to reproducing hitting that goto in the absence of an aborted > transaction, then it might be a lot easier to produce a simple repro. I'm not sure what you're suggesting ? Maybe I should add some NOTICES there. I'm not sure why/if pg_statistic is special, but I guess when analyze happens, it gets updated, and eventually processed by autovacuum. The main table here is a partitioned table which receives UPDATEs which moves tuples into a different partition (probably more often than what's recommended). autovacuum_analyze_threshold | 2 autovacuum_analyze_scale_factor | 0.005 autovacuum_vacuum_scale_factor | 0.005 log_autovacuum_min_duration | 9000 checkpoint_timeout | 60 wal_level | minimal In pg14, the parent table is auto-analyzed. -- Justin