On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <p...@bowt.ie> wrote: > Attached is a POC patch that teaches nbtree to delete old duplicate > versions from unique indexes. The optimization targets non-HOT > duplicate version bloat. Although the patch is rather rough, it > nevertheless manages to more or less eliminate a whole class of index > bloat: Unique index bloat from non-HOT updates in workloads where no > transaction lasts for more than a few seconds.
I'm slightly surprised that this thread didn't generate more interest back in June. After all, maintaining the pristine initial state of (say) a primary key index even after many high throughput non-HOT updates (i.e. avoiding "logically unnecessary" page splits entirely) is quite appealing. It arguably goes some way towards addressing long held criticisms of our approach to MVCC. Especially if it can be generalized to all b-tree indexes -- the Uber blog post mentioned tables that have several indexes, which presumably means that there can be no HOT updates (the author of the blog post didn't seem to be aware of HOT at all). I've been trying to generalize my approach to work with all indexes. I think that I can find a strategy that is largely effective at preventing version churn page splits that take place with workloads that have many non-HOT updates, without any serious downsides for workloads that do not benefit. I want to get feedback on that now, since I expect that it will be controversial. Teaching indexes about how tuples are versioned or chaining tuples seems like a non-starter, so the trick will be to come up with something that behaves in approximately the same way as that in cases where it helps. The approach I have in mind is to pass down a hint from the executor to btinsert(), which lets nbtree know that the index tuple insert is in fact associated with a non-HOT update. This hint is only given when the update did not logically modify the columns that the index covers (so presumably the majority of unchanged indexes get the hint, but not the one or two indexes whose columns were modified by our update statement -- or maybe the non-HOT update was caused by not being able to fit a new version on the same heap page, in which case all the btinsert() calls/all the indexes on the table get the hint). Of course, this hint makes it all but certain that the index tuple is the successor for some other index tuple located on the same leaf page. We don't actually include information about which other existing tuple it is, since it pretty much doesn't matter. Even if we did, we definitely cannot opportunistically delete it, because it needs to stay in the index at least until our transaction commits (this should be obvious). Actually, we already try to delete it from within _bt_check_unique() today for unique indexes -- we just never opportunistically mark it dead at that point (as I said, it's needed until the xact commits at the earliest). Here is the maybe-controversial part: The algorithm initially assumes that all indexes more or less have the same properties as unique indexes from a versioning point of view, even though that's clearly not true. That is, it initially assumes that the only reason why there can be duplicates on any leaf page it looks at is because some previous transaction also did a non-HOT update that added a new, unchanged index tuple. The new algorithm only runs when the hint is passed down from the executor and when the only alternative is to split the page (or have a deduplication pass), so clearly there is some justification for this assumption -- it really is highly unlikely that this update (which is on the verge of splitting the page) just so happened to be the first such update that affected the page. It's extremely likely that there will be at least a couple of other tuples like that on the page, and probably quite a few more. And even if we only manage to free one or two tuples, that'll still generally be enough to fit the incoming tuple. In general that is usually quite valuable. Even with a busy database, that might buy us minutes or hours before the question of splitting the same leaf page arises again. By the time that happens, longer running transactions may have committed, VACUUM may have run, etc. Like unique index deduplication, this isn't about freeing space -- it's about buying time. To be blunt: It may be controversial that we're accessing multiple heap pages while holding an exclusive lock on a leaf page, in the hopes that we can avoid a page split, but without any certainty that it'll work out. Sometimes (maybe even most of the time), this assumption turns out to be mostly correct, and we benefit in the obvious way -- no "unnecessary" page splits for affected non-unique indexes. Other times it won't work out, usually because the duplicates are in fact logical duplicates from distinct logical rows. When the new deletion thing doesn't work out, the situation works itself out in the obvious way: we get a deduplication pass. If that doesn't work out we get a page split. So we have three legitimate strategies for resolving the "pressure" against a leaf page: last minute emergency duplicate checks + deletion (the new thing), a deduplication pass, or a page split. The strategies are in competition with each other (though only when we have non-HOT updates). We're only willing to access a fixed number of heap pages (heap pages pointed to by duplicate index tuples) to try to delete some index tuples and avoid a split, and only in the specific context of the hint being received at the point a leaf page fills and it looks like we might have to split it. I think that we should probably avoid doing anything with posting list tuples left behind by deduplication, except maybe if there are only 2 or 3 TIDs -- just skip over them. That way, cases with duplicates across logical rows (not version churn) tend to get made into a posting list early (e.g. during an index build), so we don't even bother trying to delete from them later. Non-posting-list duplicates suggest recency, which suggests version churn -- those dups must at least have come after the most recent deduplication pass. Plus we have heuristics that maximize the chances of finding versions to kill. And we tend to look at the same blocks again and again -- like in the patch I posted, we look at the value with the most dups for things to kill first, and so on. So repeated version deletion passes won't end up accessing totally different heap blocks each time, unless they're successful at killing old versions. (I think that this new feature should be framed as extending the deduplication infrastructure to do deletes -- so it can only be used on indexes that use deduplication.) Even if this new mechanism ends up slowing down non-HOT updates noticeably -- which is *not* something that I can see with my benchmarks now -- then that could still be okay. I think that there is something sensible about imposing a penalty on non-HOT update queries that can cause problems for everybody today. Why shouldn't they have to clean up their own mess? I think that it buys us a lot to condition cleanup on avoiding page splits, because any possible penalty is only paid in cases where there isn't something else that keeps the bloat under control. If something like the kill_prior_tuple mechanism mostly keeps bloat under control already, then we'll resolve the situation that way instead. An important point about this technique is that it's just a back stop, so it can run very infrequently while still preventing an index from growing -- an index that can double in size today. If existing mechanisms against "logically unnecessary" page splits are 99% effective today, then they may still almost be useless to users -- your index still doubles in size. It just takes a little longer in Postgres 13 (with deduplication) compared to Postgres 12. So there is a really huge asymmetry that we still aren't doing enough about, even with deduplication. Deduplication cannot prevent the first "wave" of page splits with primary key style indexes due to the dimensions of the index tuples on the page. The first wave may be all that matters (deduplication does help more when things get really bad, but I'd like to avoid "merely bad" performance characteristics, too). Consider the simplest possible real world example. If we start out with 366 items on a leaf page initially (the actual number with default fillfactor + 64-bit alignment for the pgbench indexes), we can store another 40 non-HOT dups on the same leaf page before the page splits. We only save 4 bytes by merging a dup into one of the 366 original tuples. It's unlikely that many of the 40 dups that go on the page will be duplicates of *each other*, and deduplication only really starts to save space when posting list tuples have 4 or 5 TIDs each. So eventually all of the original leaf pages split when they really shouldn't, despite the availability of deduplication. -- Peter Geoghegan