Re: Deleting older versions in unique indexes to avoid page splits

Peter Geoghegan Wed, 07 Oct 2020 16:49:26 -0700

On Tue, Jun 30, 2020 at 5:03 PM Peter Geoghegan <p...@bowt.ie> wrote:
> Attached is a POC patch that teaches nbtree to delete old duplicate
> versions from unique indexes. The optimization targets non-HOT
> duplicate version bloat. Although the patch is rather rough, it
> nevertheless manages to more or less eliminate a whole class of index
> bloat: Unique index bloat from non-HOT updates in workloads where no
> transaction lasts for more than a few seconds.


I'm slightly surprised that this thread didn't generate more interest
back in June. After all, maintaining the pristine initial state of
(say) a primary key index even after many high throughput non-HOT
updates (i.e. avoiding "logically unnecessary" page splits entirely)
is quite appealing. It arguably goes some way towards addressing long
held criticisms of our approach to MVCC. Especially if it can be
generalized to all b-tree indexes -- the Uber blog post mentioned
tables that have several indexes, which presumably means that there
can be no HOT updates (the author of the blog post didn't seem to be
aware of HOT at all).

I've been trying to generalize my approach to work with all indexes. I
think that I can find a strategy that is largely effective at
preventing version churn page splits that take place with workloads
that have many non-HOT updates, without any serious downsides for
workloads that do not benefit. I want to get feedback on that now,
since I expect that it will be controversial. Teaching indexes about
how tuples are versioned or chaining tuples seems like a non-starter,
so the trick will be to come up with something that behaves in
approximately the same way as that in cases where it helps.

The approach I have in mind is to pass down a hint from the executor
to btinsert(), which lets nbtree know that the index tuple insert is
in fact associated with a non-HOT update. This hint is only given when
the update did not logically modify the columns that the index covers
(so presumably the majority of unchanged indexes get the hint, but not
the one or two indexes whose columns were modified by our update
statement -- or maybe the non-HOT update was caused by not being able
to fit a new version on the same heap page, in which case all the
btinsert() calls/all the indexes on the table get the hint). Of
course, this hint makes it all but certain that the index tuple is the
successor for some other index tuple located on the same leaf page. We
don't actually include information about which other existing tuple it
is, since it pretty much doesn't matter. Even if we did, we definitely
cannot opportunistically delete it, because it needs to stay in the
index at least until our transaction commits (this should be obvious).
Actually, we already try to delete it from within _bt_check_unique()
today for unique indexes -- we just never opportunistically mark it
dead at that point (as I said, it's needed until the xact commits at
the earliest).

Here is the maybe-controversial part: The algorithm initially assumes
that all indexes more or less have the same properties as unique
indexes from a versioning point of view, even though that's clearly
not true. That is, it initially assumes that the only reason why there
can be duplicates on any leaf page it looks at is because some
previous transaction also did a non-HOT update that added a new,
unchanged index tuple. The new algorithm only runs when the hint is
passed down from the executor and when the only alternative is to
split the page (or have a deduplication pass), so clearly there is
some justification for this assumption -- it really is highly unlikely
that this update (which is on the verge of splitting the page) just so
happened to be the first such update that affected the page.

It's extremely likely that there will be at least a couple of other
tuples like that on the page, and probably quite a few more. And even
if we only manage to free one or two tuples, that'll still generally
be enough to fit the incoming tuple. In general that is usually quite
valuable. Even with a busy database, that might buy us minutes or
hours before the question of splitting the same leaf page arises
again. By the time that happens, longer running transactions may have
committed, VACUUM may have run, etc. Like unique index deduplication,
this isn't about freeing space -- it's about buying time.

To be blunt: It may be controversial that we're accessing multiple
heap pages while holding an exclusive lock on a leaf page, in the
hopes that we can avoid a page split, but without any certainty that
it'll work out.

Sometimes (maybe even most of the time), this assumption turns out to
be mostly correct, and we benefit in the obvious way -- no
"unnecessary" page splits for affected non-unique indexes. Other times
it won't work out, usually because the duplicates are in fact logical
duplicates from distinct logical rows. When the new deletion thing
doesn't work out, the situation works itself out in the obvious way:
we get a deduplication pass. If that doesn't work out we get a page
split. So we have three legitimate strategies for resolving the
"pressure" against a leaf page: last minute emergency duplicate checks
+ deletion (the new thing), a deduplication pass, or a page split. The
strategies are in competition with each other (though only when we
have non-HOT updates).

We're only willing to access a fixed number of heap pages (heap pages
pointed to by duplicate index tuples) to try to delete some index
tuples and avoid a split, and only in the specific context of the hint
being received at the point a leaf page fills and it looks like we
might have to split it. I think that we should probably avoid doing
anything with posting list tuples left behind by deduplication, except
maybe if there are only 2 or 3 TIDs -- just skip over them. That way,
cases with duplicates across logical rows (not version churn) tend to
get made into a posting list early (e.g. during an index build), so we
don't even bother trying to delete from them later. Non-posting-list
duplicates suggest recency, which suggests version churn -- those dups
must at least have come after the most recent deduplication pass. Plus
we have heuristics that maximize the chances of finding versions to
kill. And we tend to look at the same blocks again and again -- like
in the patch I posted, we look at the value with the most dups for
things to kill first, and so on. So repeated version deletion passes
won't end up accessing totally different heap blocks each time, unless
they're successful at killing old versions.

(I think that this new feature should be framed as extending the
deduplication infrastructure to do deletes -- so it can only be used
on indexes that use deduplication.)

Even if this new mechanism ends up slowing down non-HOT updates
noticeably -- which is *not* something that I can see with my
benchmarks now -- then that could still be okay. I think that there is
something sensible about imposing a penalty on non-HOT update queries
that can cause problems for everybody today. Why shouldn't they have
to clean up their own mess? I think that it buys us a lot to condition
cleanup on avoiding page splits, because any possible penalty is only
paid in cases where there isn't something else that keeps the bloat
under control. If something like the kill_prior_tuple mechanism mostly
keeps bloat under control already, then we'll resolve the situation
that way instead.

An important point about this technique is that it's just a back stop,
so it can run very infrequently while still preventing an index from
growing -- an index that can double in size today. If existing
mechanisms against "logically unnecessary" page splits are 99%
effective today, then they may still almost be useless to users --
your index still doubles in size. It just takes a little longer in
Postgres 13 (with deduplication) compared to Postgres 12. So there is
a really huge asymmetry that we still aren't doing enough about, even
with deduplication.

Deduplication cannot prevent the first "wave" of page splits with
primary key style indexes due to the dimensions of the index tuples on
the page. The first wave may be all that matters (deduplication does
help more when things get really bad, but I'd like to avoid "merely
bad" performance characteristics, too). Consider the simplest possible
real world example. If we start out with 366 items on a leaf page
initially (the actual number with default fillfactor + 64-bit
alignment for the pgbench indexes), we can store another 40 non-HOT
dups on the same leaf page before the page splits. We only save 4
bytes by merging a dup into one of the 366 original tuples. It's
unlikely that many of the 40 dups that go on the page will be
duplicates of *each other*, and deduplication only really starts to
save space when posting list tuples have 4 or 5 TIDs each. So
eventually all of the original leaf pages split when they really
shouldn't, despite the availability of deduplication.

-- 
Peter Geoghegan

Re: Deleting older versions in unique indexes to avoid page splits

Reply via email to