On Wed, Sep 19, 2018 at 11:23 AM Peter Geoghegan <p...@bowt.ie> wrote: > 3 modes > ------- > > My new approach is to teach _bt_findsplitloc() 3 distinct modes of > operation: Regular/default mode, many duplicates mode, and single > value mode.
I think that I'll have to add a fourth mode, since I came up with another strategy that is really effective though totally complementary to the other 3 -- "multiple insertion point" mode. Credit goes to Kevin Grittner for pointing out that this technique exists about 2 years ago [1]. The general idea is to pick a split point just after the insertion point of the new item (the incoming tuple that prompted a page split) when it looks like there are localized monotonically increasing ranges. This is like a rightmost 90:10 page split, except the insertion point is not at the rightmost page on the level -- it's rightmost within some local grouping of values. This makes the two largest TPC-C indexes *much* smaller. Previously, they were shrunk by a little over 5% by using the new generic strategy, a win that now seems like small potatoes. With this new mode, TPC-C's order_line primary key, which is the largest index of all, is ~45% smaller following a standard initial bulk load at scalefactor 50. It shrinks from 99,085 blocks (774.10 MiB) to 55,020 blocks (429.84 MiB). It's actually slightly smaller than it would be after a fresh REINDEX with the new strategy. We see almost as big a win with the second largest TPC-C index, the stock table's primary key -- it's ~40% smaller. Here is the definition of the biggest index, the order line primary key index: pg@tpcc[3666]=# \d order_line_pkey Index "public.order_line_pkey" Column │ Type │ Key? │ Definition ───────────┼─────────┼──────┼──────────── ol_w_id │ integer │ yes │ ol_w_id ol_d_id │ integer │ yes │ ol_d_id ol_o_id │ integer │ yes │ ol_o_id ol_number │ integer │ yes │ ol_number primary key, btree, for table "public.order_line" The new strategy/mode works very well because we see monotonically increasing inserts on ol_number (an order's item number), but those are grouped by order. It's kind of an adversarial case for our existing implementation, and yet it seems like it's probably a fairly common scenario in the real world. Obviously these are very significant improvements. They really exceed my initial expectations for the patch. TPC-C is generally considered to be by far the most influential database benchmark of all time, and this is something that we need to pay more attention to. My sense is that the TPC-C benchmark is deliberately designed to almost require that the system under test have this "multiple insertion point" B-Tree optimization, suffix truncation, etc. This is exactly the same index that we've seen reports of out of control bloat on when people run TPC-C over hours or days [2]. My next task is to find heuristics to make the new page split mode/strategy kick in when it's likely to help, but not kick in when it isn't (when we want something close to a generic 50:50 page split). These heuristics should look similar to what I've already done to get cases with lots of duplicates to behave sensibly. Anyone have any ideas on how to do this? I might end up inferring a "multiple insertion point" case from the fact that there are multiple pass-by-value attributes for the index, with the new/incoming tuple having distinct-to-immediate-left-tuple attribute values for the last column, but not the first few. It also occurs to me to consider the fragmentation of the page as a guide, though I'm less sure about that. I'll probably need to experiment with a variety of datasets before I settle on something that looks good. Forcing the new strategy without considering any of this actually works surprisingly well on cases where you'd think it wouldn't, since a 50:50 page split is already something of a guess about where future insertions will end up. [1] https://postgr.es/m/CACjxUsN5fV0kV=yirxwa0s7lqoojuy7soptipdhucemhgwo...@mail.gmail.com [2] https://www.commandprompt.com/blog/postgres_autovacuum_bloat_tpc-c/ -- Peter Geoghegan