On Thu, 20 Jan 2022 at 14:31, Robert Haas <robertmh...@gmail.com> wrote: > > In my view, previous efforts in this area have been too simplistic. >
One thing I've been wanting to do something about is I think autovacuum needs to be a little cleverer about when *not* to vacuum a table because it won't do any good. I've seen a lot of cases where autovacuum kicks off a vacuum of a table even though the globalxmin hasn't really advanced significantly over the oldest frozen xid. When it's a large table this really hurts because it could be hours or days before it finishes and at that point there's quite a bit of bloat. This isn't a common occurrence, it happens when the system is broken in some way. Either there's an idle-in-transaction session or something else keeping the global xmin held back. What it does though is make things *much* worse and *much* harder for a non-expert to hit on the right remediation. It's easy enough to tell them to look for these idle-in-transaction sessions or set timeouts. It's much harder to determine whether it's a good idea for them to go and kill the vacuum that's been running for days. And it's not a great thing for people to be getting in the habit of doing either. I want to be able to stop telling people to kill vacuums kicked off by autovacuum. I feel like it's a bad thing for someone to ever have to do and I know some fraction of the time I'm telling them to do it it'll have been a terrible thing to have done (but we'll never know which times those were). Determining whether a running vacuum is actually doing any good is pretty hard and on older versions probably impossible. I was thinking of just putting a check in before kicking off a vacuum and if the globalxmin is a significant fraction of the distance to the relfrozenxid then instead log a warning. Basically it means "we can't keep the bloat below the threshold due to the idle transactions et al, not because there's insufficient i/o bandwidth". At the same time it would be nice if autovacuum could recognize when the i/o bandwidth is insufficient. If it finishes a vacuum it could recheck whether the table is eligible for vacuuming and log that it's unable to keep up with the vacuuming requirements -- but right now that would be a lie much of the time when it's not a lack of bandwidth preventing it from keeping up. -- greg