Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Peter Geoghegan Tue, 03 Dec 2019 12:14:35 -0800

On Tue, Nov 12, 2019 at 3:21 PM Peter Geoghegan <p...@bowt.ie> wrote:
> * Decided to go back to turning deduplication on by default with
> non-unique indexes, and off by default using unique indexes.
>
> The unique index stuff was regressed enough with INSERT-heavy
> workloads that I was put off, despite my initial enthusiasm for
> enabling deduplication everywhere.


I have changed my mind about this again. I now think that it would
make sense to treat deduplication within unique indexes as a separate
feature that cannot be disabled by the GUC at all (though we'd
probably still respect the storage parameter for debugging purposes).
I have found that fixing the WAL record size issue has helped remove
what looked like a performance penalty for deduplication (but was
actually just a general regression). Also, I have found a way of
selectively applying deduplication within unique indexes that seems to
have no downside, and considerable upside.

The new criteria/heuristic for unique indexes is very simple: If a
unique index has an existing item that is a duplicate on the incoming
item at the point that we might have to split the page, then apply
deduplication. Otherwise (when the incoming item has no duplicates),
don't apply deduplication at all -- just accept that we'll have to
split the page. We already cache the bounds of our initial binary
search in insert state, so we can reuse that information within
_bt_findinsertloc() when considering deduplication in unique indexes.

This heuristic makes sense because deduplication within unique indexes
should only target leaf pages that cannot possibly receive new values.
In many cases, the only reason why almost all primary key leaf pages
can ever split is because of non-HOT updates whose new HOT chain needs
a new, equal entry in the primary key. This is the case with your
standard identity column/serial primary key, for example (only the
rightmost page will have a page split due to the insertion of new
logical rows -- everything other variety of page split must be due to
new physical tuples/versions). I imagine that it is possible for a
leaf page to be a "mixture"  of these two basic/general tendencies,
but not for long. It really doesn't matter if we occasionally fail to
delay a page split where that was possible, nor does it matter if we
occasionally apply deduplication when that won't delay a split for
very long -- pretty soon the page will split anyway. A split ought to
separate the parts of the keyspace that exhibit each tendency. In
general, we're only interested in delaying page splits in unique
indexes *indefinitely*, since in effect that will prevent them
*entirely*. (So the goal is *significantly* different to our general
goal for deduplication -- it's about buying time for VACUUM to run or
whatever, rather than buying space.)

This heuristic helps the TPC-C "old order" tables PK from bloating
quite noticeably, since that was the only unique index that is really
affected by non-HOT UPDATEs (i.e. the UPDATE queries that touch that
table happen to not be HOT-safe in general, which is not the case for
any other table). It doesn't regress anything else from TPC-C, since
there really isn't a benefit for other tables. More importantly, the
working/draft version of the patch will often avoid a huge amount of
bloat in a pgbench-style workload that has an extra index on the
pgbench_accounts table, to prevent HOT updates. The accounts primary
key (pgbench_accounts_pkey) hardly grows at all with the patch, but
grows 2x on master.

This 2x space saving seems to occur reliably, unless there is a lot of
contention on individual *pages*, in which case the bloat can be
delayed but not prevented. We get that 2x space saving with either
uniformly distributed random updates on pgbench_accounts (i.e. the
pgbench default), or with a skewed distribution that hashes the PRNG's
value. Hashing like this simulates a workload where there the skew
isn't concentrated in one part of the key space (i.e. there is skew,
but very popular values are scattered throughout the index evenly,
rather than being concentrated together in just a few leaf pages).

Can anyone think of an adversarial case, that we may not do so well on
with the new "only deduplicate within unique indexes when new item
already has a duplicate" strategy? I'm having difficulty identifying
some kind of worst case.

-- 
Peter Geoghegan

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to