On 26/02/2019 12:31, Peter Geoghegan wrote:
On Mon, Jan 28, 2019 at 7:32 AM Heikki Linnakangas <hlinn...@iki.fi> wrote:
I spent some time first trying to understand the current algorithm, and
then rewriting it in a way that I find easier to understand. I came up
with the attached. I think it optimizes for the same goals as your
patch, but the approach is quite different.
Attached is v13 of the patch series, which significantly refactors
nbtsplitloc.c to implement the algorithm using the approach from your
prototype posted on January 28 -- I now take a "top down" approach
that materializes all legal split points up-front, as opposed to the
initial "bottom up" approach that used recursion, and weighed
everything (balance of free space, suffix truncation, etc) all at
once.
Great, looks much simpler now, indeed! Now I finally understand the
algorithm.
I'm using qsort() to sort the candidate split points array. I'm not
trying to do something clever to avoid the up-front effort of sorting
everything, even though we could probably get away with that much of
the time (e.g. by doing a top-N sort in default mode). Testing has
shown that using an inlined qsort() routine in the style of
tuplesort.c would make my serial test cases/microbenchmarks faster,
without adding much complexity. We're already competitive with the
master branch even without this microoptimization, so I've put that
off for now. What do you think of the idea of specializing an
inlineable qsort() for nbtsplitloc.c?
If the performance is acceptable without it, let's not bother. We can
optimize later.
What would be the worst case scenario for this? Splitting a page that
has as many tuples as possible, I guess, so maybe inserting into a table
with a single-column index, with 32k BLCKSZ. Have you done performance
testing on something like that?
Unlike in your prototype, v13 makes the array for holding candidate
split points into a single big allocation that is always exactly
BLCKSZ. The idea is that palloc() can thereby recycle the big
_bt_findsplitloc() allocation within _bt_split(). palloc() considers
8KiB to be the upper limit on the size of individual blocks it
manages, and we'll always go on to palloc(BLCKSZ) through the
_bt_split() call to PageGetTempPage(). In a sense, we're not even
allocating memory that we weren't allocating already. (Not sure that
this really matters, but it is easy to do it that way.)
Rounding up the allocation to BLCKSZ seems like a premature
optimization. Even if it saved some cycles, I don't think it's worth the
trouble of having to explain all that in the comment.
I think you could change the curdelta, leftfree, and rightfree fields in
SplitPoint to int16, to make the array smaller.
Other changes from your prototype:
* I found a more efficient representation than a pair of raw
IndexTuple pointers for each candidate split. Actually, I use the same
old representation (firstoldonright + newitemonleft) in each split,
and provide routines to work backwards from that to get the lastleft
and firstright tuples. This approach is far more space efficient, and
space efficiency matters when you've allocating space for hundreds of
items in a critical path like this.
Ok.
* You seemed to refactor _bt_checksplitloc() in passing, making it not
do the newitemisfirstonright thing. I changed that back. Did I miss
something that you intended here?
My patch treated the new item the same as all the old items, in
_bt_checksplitloc(), so it didn't need newitemisonright. You still need
it with your approach.
Changes to my own code since v12:
* Simplified "Add "split after new tuple" optimization" commit, and
made it more consistent with associated code. This is something that
was made a lot easier by the new approach. It would be great to hear
what you think of this.
I looked at it very briefly. Yeah, it's pretty simple now. Nice!
About this comment on _bt_findsplit_loc():
/*
* _bt_findsplitloc() -- find an appropriate place to split a page.
*
* The main goal here is to equalize the free space that will be on each
* split page, *after accounting for the inserted tuple*. (If we fail to
* account for it, we might find ourselves with too little room on the page
* that it needs to go into!)
*
* If the page is the rightmost page on its level, we instead try to arrange
* to leave the left split page fillfactor% full. In this way, when we are
* inserting successively increasing keys (consider sequences, timestamps,
* etc) we will end up with a tree whose pages are about fillfactor% full,
* instead of the 50% full result that we'd get without this special case.
* This is the same as nbtsort.c produces for a newly-created tree. Note
* that leaf and nonleaf pages use different fillfactors.
*
* We are passed the intended insert position of the new tuple, expressed as
* the offsetnumber of the tuple it must go in front of (this could be
* maxoff+1 if the tuple is to go at the end). The new tuple itself is also
* passed, since it's needed to give some weight to how effective suffix
* truncation will be. The implementation picks the split point that
* maximizes the effectiveness of suffix truncation from a small list of
* alternative candidate split points that leave each side of the split with
* about the same share of free space. Suffix truncation is secondary to
* equalizing free space, except in cases with large numbers of duplicates.
* Note that it is always assumed that caller goes on to perform truncation,
* even with pg_upgrade'd indexes where that isn't actually the case
* (!heapkeyspace indexes). See nbtree/README for more information about
* suffix truncation.
*
* We return the index of the first existing tuple that should go on the
* righthand page, plus a boolean indicating whether the new tuple goes on
* the left or right page. The bool is necessary to disambiguate the case
* where firstright == newitemoff.
*
* The high key for the left page is formed using the first item on the
* right page, which may seem to be contrary to Lehman & Yao's approach of
* using the left page's last item as its new high key on the leaf level.
* It isn't, though: suffix truncation will leave the left page's high key
* fully equal to the last item on the left page when two tuples with equal
* key values (excluding heap TID) enclose the split point. It isn't
* necessary for a new leaf high key to be equal to the last item on the
* left for the L&Y "subtree" invariant to hold. It's sufficient to make
* sure that the new leaf high key is strictly less than the first item on
* the right leaf page, and greater than the last item on the left page.
* When suffix truncation isn't possible, L&Y's exact approach to leaf
* splits is taken (actually, a tuple with all the keys from firstright but
* the heap TID from lastleft is formed, so as to not introduce a special
* case).
*
* Starting with the first right item minimizes the divergence between leaf
* and internal splits when checking if a candidate split point is legal.
* It is also inherently necessary for suffix truncation, since truncation
* is a subtractive process that specifically requires lastleft and
* firstright inputs.
*/
This is pretty good, but I think some copy-editing can make this even
better. If you look at the old comment, it had this structure:
1. Explain what the function does.
2. Explain the arguments
3. Explain the return value.
The additions to this comment broke the structure. The explanations of
argument and return value are now in the middle, in 3rd and 4th
paragraphs. And the 3rd paragraph that explains the arguments, now also
goes into detail on what the function does with it. I'd suggest moving
things around to restore the old structure, that was more clear.
The explanation of how the high key for the left page is formed (5th
paragraph), seems out-of-place here, because the high key is not formed
here.
Somewhere in the 1st or 2nd paragraph, perhaps we should mention that
the function effectively uses a different fillfactor in some other
scenarios too, not only when it's the rightmost page.
In the function itself:
* maxsplits should never exceed maxoff because there will be at most as
* many candidate split points as there are points _between_ tuples,
once
* you imagine that the new item is already on the original page (the
* final number of splits may be slightly lower because not all splits
* will be legal). Even still, add space for an extra two splits out of
* sheer paranoia.
*/
state.maxsplits = maxoff + 2;
state.splits = palloc(Max(BLCKSZ, sizeof(SplitPoint) *
state.maxsplits));
state.nsplits = 0;
I wouldn't be that paranoid. The code that populates the array is pretty
straightforward.
/*
* Scan through the data items and calculate space usage for a split at
* each possible position. We start at the first data offset rather
than
* the second data offset to handle the "newitemoff == first data
offset"
* case (otherwise, a split whose firstoldonright is the first data
offset
* can't be legal, and won't actually end up being recorded by
* _bt_recordsplit).
*
* Still, it's typical for almost all calls to _bt_recordsplit to
* determine that the split is legal, and therefore enter it into the
* candidate split point array for later consideration.
*/
Suggestion: Remove the "Still" word. The observation that typically all
split points are legal is valid, but it seems unrelated to the first
paragraph. (Do we need to mention it at all?)
/*
* If the new item goes as the last item, record the split point that
* leaves all the old items on the left page, and the new item on the
* right page. This is required because a split that leaves the new
item
* as the firstoldonright won't have been reached within the loop. We
* always record every possible split point.
*/
Suggestion: Remove the last sentence. An earlier comment already said
that we calculate space usage for a split at each possible position,
that seems sufficient. Like it was before this patch.
/*
* Find lowest possible penalty among split points currently regarded as
* acceptable -- the "perfect" penalty. The perfect penalty often saves
* _bt_bestsplitloc() additional work around calculating penalties.
This
* is also a convenient point to determine if default mode worked out,
or
* if we should instead reassess which split points should be considered
* acceptable (split interval, and possibly fillfactormult).
*/
perfectpenalty = _bt_perfect_penalty(rel, page, &state, newitemoff,
newitem, &secondmode);
ISTM that figuring out which "mode" we want to operate in is actually
the *primary* purpose of _bt_perfect_penalty(). We only really use the
penalty as an optimization that we pass on to _bt_bestsplitloc(). So I'd
suggest changing the function name to something like _bt_choose_mode(),
and have secondmode be the primary return value from it, with
perfectpenalty being the secondary result through a pointer.
It doesn't really choose the mode, either, though. At least after the
next "Add split after new tuple optimization" patch. The caller has a
big part in choosing what to do. So maybe split _bt_perfect_penalty into
two functions: _bt_perfect_penalty(), which just computes the perfect
penalty, and _bt_analyze_split_interval(), which determines how many
duplicates there are in the top-N split points.
BTW, I like the word "strategy", like you called it in the comment on
SplitPoint struct, better than "mode".
+ if (usemult)
+ delta = fillfactormult * split->leftfree -
+ (1.0 - fillfactormult) * split->rightfree;
+ else
+ delta = split->leftfree - split->rightfree;
How about removing the "usemult" variable, and just check if
fillfactormult == 0.5?
/*
* There are a much smaller number of candidate split points when
* splitting an internal page, so we can afford to be exhaustive. Only
* give up when pivot that will be inserted into parent is as small as
* possible.
*/
if (!state->is_leaf)
return MAXALIGN(sizeof(IndexTupleData) + 1);
Why are there fewer candidate split points on an internal page?
- Heikki