On Fri, Sep 27, 2019 at 9:43 AM Anastasia Lubennikova <a.lubennik...@postgrespro.ru> wrote: > Attached is v19.
Cool. > * By default deduplication is on for non-unique indexes and off for > unique ones. I think that it makes sense to enable deduplication by default -- even with unique indexes. It looks like deduplication can be very helpful with non-HOT updates. I have been benchmarking this using more or less standard pgbench at scale 200, with one big difference -- I also create an index on "pgbench_accounts (abalance)". This is a low cardinality index, which ends up about 3x smaller with the patch, as expected. It also makes all updates non-HOT updates, greatly increasing index bloat in the primary key of the accounts table -- this is what I found really interesting about this workload. The theory behind deduplication within unique indexes seems quite different to the cases we've focussed on so far -- that's why my working copy of the patch makes a few small changes to how _bt_dedup_one_page() works with unique indexes specifically (more on that later). With unique indexes, deduplication doesn't help by creating space -- it helps by creating *time* for garbage collection to run before the real "damage" is done -- it delays page splits. This is only truly valuable when page splits caused by non-HOT updates are delayed by so much that they're actually prevented entirely, typically because the _bt_vacuum_one_page() stuff can now happen before pages split, not after. In general, these page splits are bad because they degrade the B-Tree structure, more or less permanently (it's certainly permanent with this workload). Having a huge number of page splits *purely* because of non-HOT updates is particular bad -- it's just awful. I believe that this is the single biggest problem with the Postgres approach to versioned storage (we know that other DB systems have no primary key page splits with this kind of workload). Anyway, if you run this pgbench workload without rate-limiting, so that a patched Postgres does as much work as physically possible, the accounts table primary key (pgbench_accounts_pkey) at least grows at a slower rate -- the patch clearly beats master at the start of the benchmark/test (as measured by index size). As the clients are ramped up by my testing script, and as time goes on, eventually the size of the pgbench_accounts_pkey index "catches up" with master. The patch delays page splits, but ultimately the system as a whole cannot prevent the page splits altogether, since the server is being absolutely hammered by pgbench. Actually, the index is *exactly* the same size for both the master case and the patch case when we reach this "bloat saturation point". We can delay the problem, but we cannot prevent it. But what about a more realistic workload, with rate-limiting? When I add some rate limiting, so that the TPS/throughput is at about 50% of what I got the first time around (i.e. 50% of what is possible), or 15k TPS, it's very different. _bt_dedup_one_page() can now effectively cooperate with _bt_vacuum_one_page(). Now deduplication is able to "soak up all the extra garbage tuples" for long enough to delay and ultimately *prevent* almost all page splits. pgbench_accounts_pkey starts off at 428 MB for both master and patch (CREATE INDEX makes it that size). After about an hour, the index is 447 MB with the patch. The master case ends up with a pgbench_accounts_pkey size of 854 MB, though (this is very close to 857 MB, the "saturation point" index size from before). This is a very significant improvement, obviously -- the patch has an index that is ~52% of the size seen for the same index with the master branch! Here is how I changed _bt_dedup_one_page() for unique indexes to get this result: * We limit the size of posting lists to 5 heap TIDs in the checkingunique case. Right now, we will actually accept a checkingunique page split before we'll merge together items that result in a posting list with more heap TIDs than that (not sure about these details at all, though). * Avoid creating a new posting list that caller will have to split immediately anyway (this is based on details of _bt_dedup_one_page() caller's newitem tuple). (Not sure how much this customization contributes to this favorable test result -- maybe it doesn't make that much difference.) The goal here is for duplicates that are close together in both time and space to get "clumped together" into their own distinct, small-ish posting list tuples with no more than 5 TIDs. This is intended to help _bt_vacuum_one_page(), which is known to be a very important mechanism for indexes like our pgbench_accounts_pkey index (LP_DEAD bits are set very frequently within _bt_check_unique()). The general idea is to balance deduplication against LP_DEAD killing, and to increase spatial/temporal locality within these smaller posting lists. If we have one huge posting list for each value, then we can't set the LP_DEAD bit on anything at all, which is very bad. If we have a few posting lists that are not so big for each distinct value, we can often kill most of them within _bt_vacuum_one_page(), which is very good, and has minimal downside (i.e. we still get most of the benefits of aggressive deduplication). Interestingly, these non-HOT page splits all seem to "come in waves". I noticed this because I carefully monitored the benchmark/test case over time. The patch doesn't prevent the "waves of page splits" pattern, but it does make it much much less noticeable. > * New function _bt_dedup_is_possible() is intended to be a single place > to perform all the checks. Now it's just a stub to ensure that it works. > > Is there a way to extract this from existing opclass information, > or we need to add new opclass field? Have you already started this work? > I recall there was another thread, but didn't manage to find it. The thread is here: https://www.postgresql.org/message-id/flat/cah2-wzn3ee49gmxb7v1vj3-ac8fwn-fr8pfwqebhe8ryrxt...@mail.gmail.com -- Peter Geoghegan