On 21.01.2013 15:19, Heikki Linnakangas wrote:
On 21.01.2013 15:06, Tom Lane wrote:Jeff Davis<pg...@j-davis.com> writes:On Mon, 2013-01-21 at 00:48 -0500, Tom Lane wrote:I looked at this patch. ISTM we should not have the option at all but just do it always. I cannot believe that always-go-left is ever a preferable strategy in the long run; the resulting imbalance in the index will surely kill any possible benefit. Even if there are some cases where it miraculously fails to lose, how many users are going to realize that applies to their case and make use of the option?Sounds good to me.If I remember correctly, there was also an argument that it may be useful for repeatable test results. That's a little questionable for performance (except in those cases where few penalties are identical anyway), but could plausibly be useful for a crash report or something.Meh. There's already a random decision, in the equivalent place and for a comparable reason, in btree (cf _bt_findinsertloc). Nobody's ever complained about that being indeterminate, so I'm unconvinced that there's a market for it with gist.I wonder if it would work for gist to do something similar to _bt_findinsertloc, and have a bias towards the left page, but sometimes descend to one of the pages to the right. You would get the cache locality of usually descending down the same subtree, but also fill the pages to the right. Jeff / Alexander, want to give that a shot?
I did some experimenting with that. I used the same test case Alexander did, with geonames data, and compared unpatched version, the original patch, and the attached patch that biases the first "best" tuple found, but still sometimes chooses the other equally good ones.
testname | initsize | finalsize | idx_blks_read | idx_blks_hit ----------------+----------+-----------+---------------+-------------- patched-10-4mb | 75497472 | 90202112 | 5853604 | 10178331 unpatched-4mb | 75145216 | 94863360 | 5880676 | 10185647 unpatched-4mb | 75587584 | 97165312 | 5903107 | 10183759 patched-2-4mb | 74768384 | 81403904 | 5768124 | 10193738 origpatch-4mb | 74883072 | 82182144 | 5783412 | 10185373All these tests were performed with shared_buffers=4MB, so that the index won't fit completely in shared buffers, and I could use the idx_blk_read/hit ratio as a measure of cache-friendliness. The "origpath" test was with a simplified version of Alexander's patch, see attached. It's the same as the original, but with the randomization-option and gist-build related code removed. The patched-10 and patched-2 tests are two variants with my patch, with 1/10 and 1/2 chance of moving to the next equal tuple, respectively. The differences in cache hit ratio might be just a result of different index sizes. I included two unpatched runs above to show that there's a fair amount of noise in these tests. That's because of the random nature of the test case; it picks rows to delete and insert at random.
I think the conclusion is that all of these patches are effective. The 1/10 variant is less effective, as expected, as it's closer in behavior to the unpatched behavior than the others. The 1/2 variant seems as good as the original patch.
I performed another test, by inserting 1000000 duplicate values on an empty table.
testname | finalsize | idx_blks_read | idx_blks_hit ------------+-----------+---------------+-------------- unpatched | 89030656 | 21350 | 2972033 patched-10 | 88973312 | 21450 | 2971920 patched-2 | 88481792 | 22947 | 2970260 origpatch | 61186048 | 761817 | 2221500The original patch produces a much smaller index in this test, but since the inserts are distributed randomly, the pages are accessed randomly which is bad for caching.
A table full of duplicates isn't very realistic, but overall, I'm leaning towards my version of this patch (gistchoose-2.patch). It has less potential for causing a regression in existing applications, but is just as effective in the original scenario of repeated delete+insert.
- Heikki
duplicatetest.sh
Description: Bourne shell script
gistbloat.sh
Description: Bourne shell script
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c index 8b60774..75778f6 100644 --- a/src/backend/access/gist/gistutil.c +++ b/src/backend/access/gist/gistutil.c @@ -379,6 +379,7 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ GISTENTRY entry, identry[INDEX_MAX_KEYS]; bool isnull[INDEX_MAX_KEYS]; + int look_further_on_equal = -1; Assert(!GistPageIsLeaf(p)); @@ -446,6 +447,8 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ if (j < r->rd_att->natts - 1) best_penalty[j + 1] = -1; + + look_further_on_equal = -1; } else if (best_penalty[j] == usize) { @@ -468,12 +471,46 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ } /* - * If we find a tuple with zero penalty for all columns, there's no - * need to examine remaining tuples; just break out of the loop and - * return it. + * If we find a tuple that's exactly as good as the currently best + * one, we could use either one. When inserting a lot of tuples with + * the same or similar keys, it's preferable to descend down the same + * path when possible, as that's more cache-friendly. On the other + * hand, if all inserts land on the same leaf page, after a split, + * we're never going to insert anything to the other half of the + * split, and end up using only 50% of the available space. + * Distributing the inserts evenly would lead to better usage, 75%, + * but that destroys cache-locality. To get the best of both worlds, + * when we find a tuple that's equally good as the previous best, choose + * whether to stick to the old best, or use the new one. Once we + * decide to stick to the old best, we stick to that for any + * subsequent equally good tuples we might find. This heavily favors + * tuples with low offsets, but still allows some inserts to go to + * other subtrees. + */ + if (j == r->rd_att->natts && result != i) + { + if (look_further_on_equal == -1) + look_further_on_equal = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0; + if (look_further_on_equal == 1) + { + result = i; + look_further_on_equal = -1; + } + } + + /* + * If we find a tuple with zero penalty for all columns, and we've + * decided we don't want to search for another tuple with equal + * penalty, there's no need to examine remaining tuples; just break + * out of the loop and return it. */ if (zero_penalty) - break; + { + if (look_further_on_equal == -1) + look_further_on_equal = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0; + if (look_further_on_equal == 0) + break; + } } return result;
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c index 8b60774..af4d4ea 100644 --- a/src/backend/access/gist/gistutil.c +++ b/src/backend/access/gist/gistutil.c @@ -375,6 +375,7 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ OffsetNumber result; OffsetNumber maxoff; OffsetNumber i; + OffsetNumber offsets[MaxOffsetNumber]; float best_penalty[INDEX_MAX_KEYS]; GISTENTRY entry, identry[INDEX_MAX_KEYS]; @@ -407,11 +408,43 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ maxoff = PageGetMaxOffsetNumber(p); Assert(maxoff >= FirstOffsetNumber); - for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i)) + if (true) { - IndexTuple itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, i)); - bool zero_penalty; - int j; + /* + * If randomization is required then prepare array of offset numbers + * in order to remember which offset numbers were already exanimed. + */ + for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i)) + offsets[i - FirstOffsetNumber] = i; + } + + for (i = 0; i < maxoff; i++) + { + IndexTuple itup; + bool zero_penalty; + int j; + OffsetNumber offset; + + if (true) + { + /* + * Randomization: select random offset number from those which + * aren't previously selected. + */ + OffsetNumber rnd_i, tmp; + rnd_i = random() % (maxoff - i) + i; + tmp = offsets[rnd_i]; + offsets[rnd_i] = offsets[i]; + offsets[i] = tmp; + offset = offsets[i]; + } + else + { + /* No randomization: select next offset number */ + offset = i + FirstOffsetNumber; + } + + itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, offset)); zero_penalty = true; @@ -424,7 +457,7 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ /* Compute penalty for this column. */ datum = index_getattr(itup, j + 1, giststate->tupdesc, &IsNull); - gistdentryinit(giststate, j, &entry, datum, r, p, i, + gistdentryinit(giststate, j, &entry, datum, r, p, offset, FALSE, IsNull); usize = gistpenalty(giststate, j, &entry, IsNull, &identry[j], isnull[j]); @@ -441,7 +474,7 @@ gistchoose(Relation r, Page p, IndexTuple it, /* it has compressed entry */ * adopt this tuple's penalty values as the best for all the * remaining columns during subsequent loop iterations. */ - result = i; + result = offset; best_penalty[j] = usize; if (j < r->rd_att->natts - 1)
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers