On Thu, Mar 11, 2021 at 8:31 AM Robert Haas <robertmh...@gmail.com> wrote: > I agree, but all you need is one long-lived tuple toward the end of > the array and you're stuck never being able to truncate it. It seems > like a worthwhile improvement, but whether it actually helps will be > workload-dependant.
When it comes to improving VACUUM I think that most of the really interesting scenarios are workload dependent in one way or another. In fact even that concept becomes a little meaningless much of the time. For example with workloads that really benefit from bottom-up deletion, the vast majority of individual leaf pages have quite a bit of spare capacity at any given time. Again, "rare" events can have outsized importance in the aggregate -- most of the time every leaf page taken individually is a-okay! It's certainly not just indexing stuff. We have a tendency to imagine that HOT updates occur when indexes are not logically modified, except perhaps in the presence of some kind of stressor, like a long-running transaction. I guess that I do the same, informally. But let's not forget that the reality is that very few tables *consistently* get HOT updates, regardless of the shape of indexes and UPDATE statements. So in the long run practically all tables in many ways consist of pages that resemble those from a table that "only gets non-HOT updates" in the simplest sense. I suspect that the general preference for using lower-offset LP_UNUSED items first (inside PageAddItemExtended()) will tend to make this problem of "one high tuple that isn't dead" not so bad in many cases. In any case Matthias' patch makes the situation strictly better, and we can only fix one problem at a time. We have to start by eliminating individual low-level behaviors that *don't make sense*. Jan Wieck told me that he had to set heap fill factor to the ludicrously conservative setting of 50 just to get the TPC-C/BenchmarkSQL OORDER and ORDER_LINE tables to be stable over time [1] -- on-disk size stability is absolutely expected here. And these are the biggest tables! It takes hours if not days or even weeks for the situation to really get out of hand with a normal FF setting. I am almost certain that this is due to second order effects (even third order effects) that start from things like line pointer bloat and FSM inefficiencies. I suspect that it doesn't matter too much if you make heap fill factor 70 or 90 with these tables because the effect is non-linear -- for whatever reason 50 was found to be the magic number, through trial and error. "Incremental VACUUM" (the broad concept, not just this one patch) is likely to rely on our being able to make the performance characteristics more linear, at least in future iterations. Of course it's true that we should eliminate line pointer bloat and any kind of irreversible bloat because the overall effect is non-linear, unstable behavior, which is highly undesirable on its face. But it's also true that these improvements leave us with more linear behavior at a high-level, which is itself much easier to understand and model in a top-down fashion. It then becomes possible to build a cost model that makes VACUUM sensitive to the needs of the app, and how to make on-disk sizes *stable* in a variety of conditions. So in that sense I'd say that Matthias' patch is totally relevant. I know that I sound hippy-dippy here. But the fact is that bottom-up index deletion has *already* made the performance characteristics much simpler and therefore much easier to model. I hope to do more of that. [1] https://github.com/wieck/benchmarksql/blob/29b62435dc5c9eaf178983b43818fcbba82d4286/run/sql.postgres/extraCommandsBeforeLoad.sql#L1 -- Peter Geoghegan