On 2018-06-07 14:19:18 -0700, Andres Freund wrote: > Hi, > > On 2018-03-29 12:17:24 +0100, Greg Stark wrote: > > I'm poking around to see debug a vacuuming problem and wondering if > > I've found something more serious. > > > > As far as I can tell the snapshots on HOT standby are built using a > > list of running xids that the primary builds and puts in the WAL and > > that seems to include all xids from transactions running in all > > databases. The HOT standby would then build a snapshot and eventually > > send the xmin of that snapshot back to the primary in the hot standby > > feedback and that would block vacuuming tuples that might be visible > > to the standby. > > > Many ages ago Alvaro sweated blood to ensure vacuums could run for > > long periods of time without holding back the xmin horizon and > > blocking other vacuums from cleaning up tuples. That's the purpose of > > the excludeVacuum flag in GetCurrentVirtualXIDs(). That's possible > > because we know vacuums won't insert any tuples that queries might try > > to view and also vacuums won't try to perform any sql queries on other > > tables. > > > I can't find anywhere that the standby snapshot building mechanism > > gets this same information about which xids are actually vacuums that > > can be ignored when building a snapshot. So I'm concerned that the hot > > standby sending back its xmin would be effectively undermining this > > mechanism and forcing vacuum xids to be included in the xmin horizon > > and prevent vacuuming of tuples. > > > Am I missing something obvious? Is this a known problem? > > Maybe I'm missing something, but the running transaction data reported > to the standby does *NOT* include anything about lazy vacuums - they > don't have an xid. The reason there's PROC_IN_VACUUM etc isn't the xid, > it's *xmin*, no? > > We currently do acquire an xid when truncating the relation - but I > think it'd somewhat fair to argue that that's somewhat of a bug. The > reason a log is acquired is that we need to log AEL locks, and that > currently means they have to be assigned to a transaction. > > Given that the truncation happens at the end of VACUUM and it *NEEDS* to > be present on the standby - otherwise the locking stuff is useless - I > don't think the fix commited in this thread is correct. > > Wonder if the right thing here wouldn't be to instead transiently > acquire an AEL lock during replay when truncating a relation?
Isn't the fact that vacuum truncation requires an AEL, and that the change committed today excludes those transactions from running xacts records, flat out broken? Look at: void ProcArrayApplyRecoveryInfo(RunningTransactions running) ... /* * Remove stale locks, if any. * * Locks are always assigned to the toplevel xid so we don't need to care * about subxcnt/subxids (and by extension not about ->suboverflowed). */ StandbyReleaseOldLocks(running->xcnt, running->xids); by excluding running transactions you have, as far as I can tell, effectively removed the vacuum truncation AEL from the standby. Now that only happens when a running xact record is logged, but as that happens in the background... I also don't understand why this change would be backpatched in the first place. It's a relatively minor efficiency thing, no? Greetings, Andres Freund