Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Tue, 17 Dec 2019 15:31:11 -0800

On Tue, Dec 17, 2019 at 1:58 PM Bruce Momjian <br...@momjian.us> wrote:
> > Attached is v26, which adds this new criteria/heuristic for unique
> > indexes. We now seem to consistently get good results with unique
> > indexes.
>
> In the past we tried to increase the number of cases where HOT updates
> can happen but were unable to.


Right -- the WARM project.

The Z-heap project won't change the fundamentals here. It isn't going
to solve the fundamental problem of requiring that the index AM create
a new set of physical index tuples in at least *some* cases. A heap
tuple cannot be updated in-place when even one indexed column changes
-- you're not much better off than you were with the classic heapam,
because indexes get bloated in a way that wouldn't happen with Oracle.
(Even still, Z-heap isn't sensitive to when and how opportunistic heap
pruning takes place, and doesn't have the same issue with having to
fit the heap tuple on the same page or create a new HOT chain. This
will make things much better with some workloads.)

> Would this help with non-HOT updates?

Definitely, yes. The strategy used with unique indexes is specifically
designed to avoid "unnecessary" page splits altogether -- it only
makes sense because of the possibility of non-HOT UPDATEs with
mostly-unchanged index tuples. Thinking about what's going on here
from first principles is what drove the unique index deduplication
design:

With many real world unique indexes, the true reason behind most or
all B-Tree page splits is "version churn". I view these page splits as
a permanent solution to a temporary problem -- we *permanently*
degrade the index structure in order to deal with a *temporary* burst
in versions that need to be stored. That's really bad.

Consider a classic pgbench workload, for example. The smaller indexes
on the smaller tables (pgbench_tellers_pkey and pgbench_branches_pkey)
have leaf pages that will almost certainly be split a few minutes in,
even though the UPDATEs on the underlying tables never modify indexed
columns (i.e. even though HOT is as effective as it possibly could be
with this unthrottled workload). Actually, even the resulting split
pages will themselves usually be split again, and maybe even once more
after that.  We started out with leaf pages that stored just under 370
items on each leaf page (with fillfactor 90 + 8KiB BLCKSZ), and end up
with leaf pages that often have less than 50 items (sometimes as few
as 10). Even though the "logical contents" of the index are *totally*
unchanged. This could almost be considered pathological by users.

Of course, it's easy to imagine a case where it matters a lot more
than classic pgbench (pgbench_tellers_pkey and pgbench_branches_pkey
are always small, so it's easy to see the effect, which is why I went
with that example). For example, you could have a long running
transaction, which would probably have the effect of significantly
bloating even the large pgbench index (pgbench_accounts_pkey) --
typically you won't see that with classic pgbench until you do
something to frustrate VACUUM (and opportunistic cleanup). (I have
mostly been using non-HOT UPDATEs to test the patch, though.)

In theory we could go even further than this by having some kind of
version store for indexes, and using this to stash old versions rather
than performing a page split. Then you wouldn't have any page splits
in the pgbench indexes; VACUUM would eventually be able to return the
index to its "pristine" state. The trade-off with that design would be
that index scans would have to access two index pages for a while (a
leaf page, plus its subsidiary old version page). Maybe we can
actually go that far in the future -- there are various database
research papers that describe designs like this (the designs described
within these papers do things like determine whether a "version split"
or a "value split" should be performed).

What we have now is an incremental improvement, that doesn't have any
apparent downside with unique indexes -- the way that deduplication is
triggered for unique indexes is almost certain to be a win. When
deduplication isn't triggered, everything works in the same way as
before -- it's "zero overhead" for unique indexes that don't benefit.
The design augments existing garbage collection mechanisms,
particularly the way in which we set LP_DEAD bits within
_bt_check_unique().

> Do we have any benchmarks where non-HOT updates cause slowdowns that we
> can test on this?

AFAICT, any workload that has lots of non-HOT updates will benefit at
least a little bit -- indexes will finish up smaller, there will be
higher throughput, and there will be a reduction in latency for
queries.

With the right distribution of values, it's not that hard to mostly
control bloat in an index that doubles in size without the
optimization, which is much more significant. I have already reported
on this [1]. I've also been able to observe increases of 15%-20% in
TPS with similar workloads (with commensurate reductions in query
latency) more recently. This was with a simple gaussian distribution
for pgbench_accounts.aid, and a non-unique index with deduplication
enabled on pgbench_accounts.abalance. (The patch helps control the
size of both indexes, especially the extra non-unique one.)

[1] 
https://postgr.es/m/CAH2-WzkXHhjhmUYfVvu6afbojU97MST8RUT1U=hld2w-gc5...@mail.gmail.com
-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to