On Mon, Sep 23, 2019 at 5:13 PM Peter Geoghegan <p...@bowt.ie> wrote: > I attach version 17.
I attach a patch that applies on top of v17. It adds support for deduplication within unique indexes. Actually, this is a snippet of code that appeared in my prototype from August 5 (we need very little extra code for this now). Unique index support kind of looked like a bad idea at the time, but things have changed a lot since then. I benchmarked this overnight using a custom pgbench-based test that used a Zipfian distribution, with a single-row SELECT and an UPDATE of pgbench_accounts per xact. I used regular/logged tables this time around, since WAL-logging is now fairly efficient. I added a separate low cardinality index on pgbench_accounts(abalance). A low cardinality index is the most interesting case for this patch, obviously, but it also serves to prevent all HOT updates, increasing bloat for both indexes. We want a realistic case that creates a lot of index bloat. This wasn't a rigorous enough benchmark to present here in full, but the results were very encouraging. With reasonable client counts for the underlying hardware, we seem to have a small increase in TPS, and a small decrease in latency. There is a regression with 128 clients, when contention is ridiculously high (this is my home server, which only has 4 cores). More importantly: * The low cardinality index is almost 3x smaller with the patch -- no surprises there. * The read latency is where latency goes up, if it goes up at all. Whatever it is that might increase latency, it doesn't look like it's deduplication itself. Yeah, deduplication is expensive, but it's not nearly as expensive as a page split. (I'm talking about the immediate cost, not the bigger picture, though the bigger picture matters even more.) * The growth in primary key size over time is the thing I find really interesting. The patch seems to really control the number of pages splits over many hours with lots of non-HOT updates. I think that a timeline of days or weeks could be really interesting. I am now about 75% convinced that adding deduplication to unique indexes is a good idea, at least as an option that is disabled by default. We're already doing well here, even though there has been no work on optimizing deduplication in unique indexes. Further optimizations seem quite possible, though. I'm mostly thinking about optimizing non-HOT updates by teaching nbtree some basic things about versioning with unique indexes. For example, we could remember "recently dead" duplicates of the value we are about to insert (as part of an UPDATE statement) from within _bt_check_unique(). Then, when it looks like a page split may be necessary, we can hint to _bt_dedup_one_page(): "please just deduplicate the group of duplicates starting from this offset, which are duplicates of the this new item I am inserting -- do not create a posting list that I will have to split, though". We already cache the binary search bounds established within _bt_check_unique() in insertstate, so perhaps we could reuse that to make this work. The goal here is that the the old/recently dead versions end up together in their own posting list (or maybe two posting lists), whereas our new/most current tuple is on its own. There is a very good chance that our transaction will commit, leaving somebody else to set the LP_DEAD bit on the posting list that contains those old versions. In short, we'd be making deduplication and opportunistic garbage collection cooperate closely. This can work both ways -- maybe we should also teach _bt_vacuum_one_page() to cooperate with _bt_dedup_one_page(). That is, if we add the mechanism I just described in the last paragraph, maybe _bt_dedup_one_page() marks the posting list that is likely to get its LP_DEAD bit set soon with a new hint bit -- the LP_REDIRECT bit. Here, LP_REDIRECT means "somebody is probably going to set the LP_DEAD bit on this posting list tuple very soon". That way, if nobody actually does set the LP_DEAD bit, _bt_vacuum_one_page() still has options. If it goes to the heap and finds the latest version, and that that version is visible to any possible MVCC snapshot, that means that it's safe to kill all the other versions, even without the LP_DEAD bit set -- this is a unique index. So, it often gets to kill lots of extra garbage that it wouldn't get to kill, preventing page splits. The cost is pretty low: the risk that the single heap page check will be a wasted effort. (Of course, we still have to visit the heap pages of things that we go on to kill, to get the XIDs to generate recovery conflicts -- the important point is that we only need to visit one heap page in _bt_vacuum_one_page(), to *decide* if it's possible to do all this -- cases that don't benefit at all also don't pay very much.) I don't think that we need to do either of these two other things to justify committing the patch with unique index support. But, teaching nbtree a little bit about versioning like this could work rather well in practice, without it really mattering that it will get the wrong idea at times (e.g. when transactions abort a lot). This all seems promising as a family of techniques for unique indexes. It's worth doing extra work if it might delay a page split, since delaying can actually fully prevent page splits that are mostly caused by non-HOT updates. Most primary key indexes are serial PKs, or some kind of counter. Postgres should mostly do page splits for these kinds of primary keys indexes in the places that make sense based on the dataset, and not because of "write amplification". -- Peter Geoghegan
v17-0005-Reintroduce-unique-index-support.patch
Description: Binary data