Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Tue, 24 Sep 2019 12:41:37 -0700

On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <p...@bowt.ie> wrote:
> I attach version 17.


I attach a patch that applies on top of v17. It adds support for
deduplication within unique indexes. Actually, this is a snippet of
code that appeared in my prototype from August 5 (we need very little
extra code for this now). Unique index support kind of looked like a
bad idea at the time, but things have changed a lot since then.

I benchmarked this overnight using a custom pgbench-based test that
used a Zipfian distribution, with a single-row SELECT and an UPDATE of
pgbench_accounts per xact. I used regular/logged tables this time
around, since WAL-logging is now fairly efficient. I added a separate
low cardinality index on pgbench_accounts(abalance). A low cardinality
index is the most interesting case for this patch, obviously, but it
also serves to prevent all HOT updates, increasing bloat for both
indexes. We want a realistic case that creates a lot of index bloat.

This wasn't a rigorous enough benchmark to present here in full, but
the results were very encouraging. With reasonable client counts for
the underlying hardware, we seem to have a small increase in TPS, and
a small decrease in latency. There is a regression with 128 clients,
when contention is ridiculously high (this is my home server, which
only has 4 cores). More importantly:

* The low cardinality index is almost 3x smaller with the patch -- no
surprises there.

* The read latency is where latency goes up, if it goes up at all.
Whatever it is that might increase latency, it doesn't look like it's
deduplication itself. Yeah, deduplication is expensive, but it's not
nearly as expensive as a page split. (I'm talking about the immediate
cost, not the bigger picture, though the bigger picture matters even
more.)

* The growth in primary key size over time is the thing I find really
interesting. The patch seems to really control the number of pages
splits over many hours with lots of non-HOT updates. I think that a
timeline of days or weeks could be really interesting.

I am now about 75% convinced that adding deduplication to unique
indexes is a good idea, at least as an option that is disabled by
default. We're already doing well here, even though there has been no
work on optimizing deduplication in unique indexes. Further
optimizations seem quite possible, though. I'm mostly thinking about
optimizing non-HOT updates by teaching nbtree some basic things about
versioning with unique indexes.

For example, we could remember "recently dead" duplicates of the value
we are about to insert (as part of an UPDATE statement) from within
_bt_check_unique(). Then, when it looks like a page split may be
necessary, we can hint to  _bt_dedup_one_page(): "please just
deduplicate the group of duplicates starting from this offset, which
are duplicates of the this new item I am inserting -- do not create a
posting list that I will have to split, though". We already cache the
binary search bounds established within _bt_check_unique() in
insertstate, so perhaps we could reuse that to make this work. The
goal here is that the the old/recently dead versions end up together
in their own posting list (or maybe two posting lists), whereas our
new/most current tuple is on its own. There is a very good chance that
our transaction will commit, leaving somebody else to set the LP_DEAD
bit on the posting list that contains those old versions. In short,
we'd be making deduplication and opportunistic garbage collection
cooperate closely.

This can work both ways -- maybe we should also teach
_bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is,
if we add the mechanism I just described in the last paragraph, maybe
_bt_dedup_one_page() marks the posting list that is likely to get its
LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here,
LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit
on this posting list tuple very soon". That way, if nobody actually
does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If
it goes to the heap and finds the latest version, and that that
version is visible to any possible MVCC snapshot, that means that it's
safe to kill all the other versions, even without the LP_DEAD bit set
-- this is a unique index. So, it often gets to kill lots of extra
garbage that it wouldn't get to kill, preventing page splits. The cost
is pretty low: the risk that the single heap page check will be a
wasted effort. (Of course, we still have to visit the heap pages of
things that we go on to kill, to get the XIDs to generate recovery
conflicts -- the important point is that we only need to visit one
heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do
all this -- cases that don't benefit at all also don't pay very much.)

I don't think that we need to do either of these two other things to
justify committing the patch with unique index support. But, teaching
nbtree a little bit about versioning like this could work rather well
in practice, without it really mattering that it will get the wrong
idea at times (e.g. when transactions abort a lot). This all seems
promising as a family of techniques for unique indexes. It's worth
doing extra work if it might delay a page split, since delaying can
actually fully prevent page splits that are mostly caused by non-HOT
updates. Most primary key indexes are serial PKs, or some kind of
counter. Postgres should mostly do page splits for these kinds of
primary keys indexes in the places that make sense based on the
dataset, and not because of "write amplification".

-- 
Peter Geoghegan

v17-0005-Reintroduce-unique-index-support.patch
Description: Binary data

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to