On Fri, Mar 12, 2021 at 9:34 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > I agreed that when we're close to overrunning the > maintnenance_work_mem space, the situation changes. If we skip it in > even that case, the next vacuum will be likely to use up > maintenance_work_mem, leading to a second index scan. Which is > bad. > > If this threshold is aimed to avoid a second index scan due to > overrunning the maintenance_work_mem, using a ratio of > maintenance_work_mem would be a good idea. On the other hand, if it's > to avoid accumulating debt affecting the cost of index vacuuming, > using a ratio of the total heap tuples seems better.
It's both, together. These are two *independent* considerations/thresholds. At least in the code that decides whether or not we skip. Either threshold can force a full index scan (index vacuuming). What I'm really worried about is falling behind (in terms of the amount of memory available for TIDs to delete in indexes) without any natural limit. Suppose we just have the SKIP_VACUUM_PAGES_RATIO threshold (i.e. no maintenance_work_mem threshold thing). With just SKIP_VACUUM_PAGES_RATIO there will be lots of tables where index vacuuming is almost always avoided, which is good. But SKIP_VACUUM_PAGES_RATIO might be a bit *too* effective. If we have to do 2 or even 3 scans of the index when we finally get to index vacuuming then that's not great, it's inefficient -- but it's at least *survivable*. But what if there are 10, 100, even 1000 bulk delete calls for each index when it finally happens? That's completely intolerable. In other words, I am not worried about debt, exactly. Debt is normal in moderation. Healthy, even. I am worried about bankruptcy, perhaps following a rare and extreme event. It's okay to be imprecise, but all of the problems must be survivable. The important thing to me for a maintenance_work_mem threshold is that there is *some* limit. At the same time, it may totally be worth accepting 2 or 3 index scans during some eventual VACUUM operation if there are many more VACUUM operations that don't even touch the index -- that's a good deal! Also, it may actually be inherently necessary to accept a small risk of having a future VACUUM operation that does multiple scans of each index -- that is probably a necessary part of skipping index vacuuming each time. Think about the cost of index vacuuming (the amount of I/O and the duration of index vacuuming) as less as less memory is available for TIDs. It's non-linear. The cost explodes once we're past a certain point. The truly important thing is to "never get killed by the explosion". > The situation where we need to deal with here is a very large table > that has a lot of dead tuples but those fit in fewer heap pages (less > than 1% of all heap blocks). In this case, it's likely that the number > of dead tuples also is relatively small compared to the total heap > tuples, as you mentioned. If dead tuples fitted in fewer pages but > accounted for most of all heap tuples in the heap, it would be a more > serious situation, there would definitely already be other problems. > So considering those conditions, I agreed to use a ratio of > maintenance_work_mem as a threshold. Maybe we can increase the > constant to 70, 80, or so. You mean 70% of maintenance_work_mem? That seems fine to me. See my "Why does lazy_vacuum_table_and_indexes() not make one decision for the entire VACUUM on the first call, and then stick to its decision?" remarks at the end of this email, though -- maybe it should not be an explicit threshold at all. High level philosophical point: In general I think that the algorithm for deciding whether or not to perform index vacuuming should *not* be clever. It should also not focus on getting the benefit of skipping index vacuuming. I think that a truly robust design will be one that always starts with the assumption that index vacuuming will be skipped, and then "works backwards" by considering thresholds/reasons to *not* skip. For example, the SKIP_VACUUM_PAGES_RATIO thing. The risk of "explosions" or "bankruptcy" can be thought of as a cost here, too. We should simply focus on the costs directly, without even trying to understand the relationship between each of the costs, and without really trying to understand the benefit to the user from skipping index vacuuming. > > Have you thought more about how the index vacuuming skipping can be > > configured by users? Maybe a new storage param, that works like the > > current SKIP_VACUUM_PAGES_RATIO constant? > > Since it’s unclear to me yet that adding a new storage parameter or > GUC parameter for this feature would be useful even for future > improvements in this area, I haven't thought yet about having users > control skipping index vacuuming. I’m okay with a constant value for > the threshold for now. I agree -- a GUC will become obsolete in only a year or two anyway. And it will be too hard to tune. Question about your patch: lazy_vacuum_table_and_indexes() can be called multiple times (when low on maintenance_work_mem). Each time it is called we decide what to do for that call and that batch of TIDs. But...why should it work that way? The whole idea of a SKIP_VACUUM_PAGES_RATIO style threshold doesn't make sense to me if the code in lazy_vacuum_table_and_indexes() resets npages_deadlp (sets it to 0) on each call. I think that npages_deadlp should never be reset during a single VACUUM operation. npages_deadlp is supposed to be something that we track for the entire table. The patch actually compares it to the size of the whole table * SKIP_VACUUM_PAGES_RATIO inside lazy_vacuum_table_and_indexes(): > + if (*npages_deadlp > RelationGetNumberOfBlocks(onerel) * > SKIP_VACUUM_PAGES_RATIO) > + { > + } The code that I have quoted here is actually how I expect SKIP_VACUUM_PAGES_RATIO to work, but I notice an inconsistency: lazy_vacuum_table_and_indexes() resets npages_deadlp later on, which makes either the quoted code or the reset code wrong (at least when VACUUM needs multiple calls to the lazy_vacuum_table_and_indexes() function). With multiple calls to lazy_vacuum_table_and_indexes() (due to low memory), we'll be comparing npages_deadlp to the wrong thing -- because npages_deadlp cannot be treated as a proportion of the blocks in the *whole table*. Maybe the resetting of npages_deadlp would be okay if you also used the number of heap blocks that were considered since the last npages_deadlp reset, and then multiply that by SKIP_VACUUM_PAGES_RATIO (instead of RelationGetNumberOfBlocks(onerel)). But I suspect that the real solution is to not reset npages_deadlp at all (without changing the quoted code, which seems basically okay). With tables/workloads that the patch helps a lot, we expect that the SKIP_VACUUM_PAGES_RATIO threshold will *eventually* be crossed by one of these VACUUM operations, which *finally* triggers index vacuuming. So not only do we expect npages_deadlp to be tracked at the level of the entire VACUUM operation -- we might even imagine it growing slowly over multiple VACUUM operations, perhaps over many months. At least conceptually -- it should only grow across VACUUM operations, until index vacuuming finally takes place. That's my mental model for npages_deadlp, at least. It tracks an easy to understand cost, which, as I said, is what the threshold/algorithm/lazy_vacuum_table_and_indexes() should focus on. Why does lazy_vacuum_table_and_indexes() not make one decision for the entire VACUUM on the first call, and then stick to its decision? That seems much cleaner. Another advantage of that approach is that it might be enough to handle low maintenance_work_mem risks -- perhaps those can be covered by simply waiting until the first VACUUM operation that runs out of memory and so requires multiple lazy_vacuum_table_and_indexes() calls. If at that point we decide to do index vacuuming throughout the entire vacuum operation, then we will not allow the table to accumulate many more TIDs than we can expect to fit in an entire maintenance_work_mem space. Under this scheme, the "maintenance_work_mem threshold" can be thought of as an implicit thing (it's not a constant/multiplier or anything) -- it is >= 100% of maintenance_work_mem, in effect. -- Peter Geoghegan