On Thu, Feb 3, 2022 at 3:14 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> + The only other option that may be combined with > <literal>VERBOSE</literal>, although in single-user mode no client > messages are > + output. > > Given VERBOSE with EMERGENCY can work only in multi-user mode, why > only VERBOSE can be specified with EMERGENCY? I think the same is true > for other options like PARALLEL; PARALLEL can work only in multi-user > mode. You are right; it makes sense to allow options that would be turned off automatically in single-user mode. Even if we don't expect it to be used in normal mode, the restrictions should make sense. Also, maybe documenting the allowed combinations is a distraction in the main entry and should be put in the notes at the bottom. > + It performs a database-wide vacuum on tables, toast tables, and > materialized views whose > + xid age or mxid age is older than 1 billion. > > Do we need to allow the user to specify the threshold or need a higher > value (at least larger than 1.6 billion, default value of > vacuum_failsafe_age)? I imagined a case where there are a few very-old > tables (say 2 billion old) and many tables that are older than 1 > billion. In this case, VACUUM (EMERGENCY) would take a long time to > complete. I still don't think fine-tuning is helpful here. Shutdown vacuum should be just as trivial to run as it is now, but with better behavior. I believe a user knowledgeable enough to come up with the best number is unlikely to get in this situation in the first place. I'm also not sure a production support engineer would (or should) immediately figure out a better number than a good default. That said, the 1 billion figure was a suggestion from Peter G. upthread, and a higher number could be argued. > But to minimize the downtime, we might want to run VACUUM > (EMERGENCY) on only the very-old tables, start the cluster in > multi-user mode, and run vacuum on multiple tables in parallel. That's exactly the idea. Also, back in normal mode, we can start streaming WAL again. However, we don't want to go back online so close to the limit that we risk shutdown again. People have a reasonable expectation that if you fix an emergency, it's now fixed and the application can go back online. Falling down repeatedly, or worrying if it's possible, is very bad. > + if (params->options & VACOPT_EMERGENCY) > + { > + /* > + * Only consider relations able to hold unfrozen XIDs (anything > else > + * should have InvalidTransactionId in relfrozenxid anyway). > + */ > + if (classForm->relkind != RELKIND_RELATION && > + classForm->relkind != RELKIND_MATVIEW && > + classForm->relkind != RELKIND_TOASTVALUE) > + { > + Assert(!TransactionIdIsValid(classForm->relfrozenxid)); > + Assert(!MultiXactIdIsValid(classForm->relminmxid)); > + continue; > + } > + > + table_xid_age = DirectFunctionCall1(xid_age, > classForm->relfrozenxid); > + table_mxid_age = DirectFunctionCall1(mxid_age, > classForm->relminmxid); > + > > I think that instead of calling xid_age and mxid_age for each > relation, we can compute the thresholds for xid and mxid once, and > then compare them to relation's relfrozenxid and relminmxid. That sounds like a good idea if it's simple to implement, so I will try it. If it adds complexity, I don't think it's worth it. Scanning a few thousand rows in pg_class along with the function calls is tiny compared to the actual vacuum work. -- John Naylor EDB: http://www.enterprisedb.com