On 21.01.2013 15:19, Heikki Linnakangas wrote:
On 21.01.2013 15:06, Tom Lane wrote:
Jeff Davis<pg...@j-davis.com> writes:
On Mon, 2013-01-21 at 00:48 -0500, Tom Lane wrote:
I looked at this patch. ISTM we should not have the option at all but
just do it always. I cannot believe that always-go-left is ever a
preferable strategy in the long run; the resulting imbalance in the
index will surely kill any possible benefit. Even if there are some
cases where it miraculously fails to lose, how many users are going to
realize that applies to their case and make use of the option?

Sounds good to me.

If I remember correctly, there was also an argument that it may be
useful for repeatable test results. That's a little questionable for
performance (except in those cases where few penalties are identical
anyway), but could plausibly be useful for a crash report or something.

Meh. There's already a random decision, in the equivalent place and for
a comparable reason, in btree (cf _bt_findinsertloc). Nobody's ever
complained about that being indeterminate, so I'm unconvinced that
there's a market for it with gist.

I wonder if it would work for gist to do something similar to
_bt_findinsertloc, and have a bias towards the left page, but sometimes
descend to one of the pages to the right. You would get the cache
locality of usually descending down the same subtree, but also fill the
pages to the right. Jeff / Alexander, want to give that a shot?

I did some experimenting with that. I used the same test case Alexander did, with geonames data, and compared unpatched version, the original patch, and the attached patch that biases the first "best" tuple found, but still sometimes chooses the other equally good ones.

    testname    | initsize | finalsize | idx_blks_read | idx_blks_hit
----------------+----------+-----------+---------------+--------------
 patched-10-4mb | 75497472 |  90202112 |       5853604 |     10178331
 unpatched-4mb  | 75145216 |  94863360 |       5880676 |     10185647
 unpatched-4mb  | 75587584 |  97165312 |       5903107 |     10183759
 patched-2-4mb  | 74768384 |  81403904 |       5768124 |     10193738
 origpatch-4mb  | 74883072 |  82182144 |       5783412 |     10185373

All these tests were performed with shared_buffers=4MB, so that the index won't fit completely in shared buffers, and I could use the idx_blk_read/hit ratio as a measure of cache-friendliness. The "origpath" test was with a simplified version of Alexander's patch, see attached. It's the same as the original, but with the randomization-option and gist-build related code removed. The patched-10 and patched-2 tests are two variants with my patch, with 1/10 and 1/2 chance of moving to the next equal tuple, respectively. The differences in cache hit ratio might be just a result of different index sizes. I included two unpatched runs above to show that there's a fair amount of noise in these tests. That's because of the random nature of the test case; it picks rows to delete and insert at random.

I think the conclusion is that all of these patches are effective. The 1/10 variant is less effective, as expected, as it's closer in behavior to the unpatched behavior than the others. The 1/2 variant seems as good as the original patch.


I performed another test, by inserting 1000000 duplicate values on an empty table.

  testname  | finalsize | idx_blks_read | idx_blks_hit
------------+-----------+---------------+--------------
 unpatched  |  89030656 |         21350 |      2972033
 patched-10 |  88973312 |         21450 |      2971920
 patched-2  |  88481792 |         22947 |      2970260
 origpatch  |  61186048 |        761817 |      2221500

The original patch produces a much smaller index in this test, but since the inserts are distributed randomly, the pages are accessed randomly which is bad for caching.

A table full of duplicates isn't very realistic, but overall, I'm leaning towards my version of this patch (gistchoose-2.patch). It has less potential for causing a regression in existing applications, but is just as effective in the original scenario of repeated delete+insert.

- Heikki

Attachment: duplicatetest.sh
Description: Bourne shell script

Attachment: gistbloat.sh
Description: Bourne shell script

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8b60774..75778f6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -379,6 +379,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 	GISTENTRY	entry,
 				identry[INDEX_MAX_KEYS];
 	bool		isnull[INDEX_MAX_KEYS];
+	int			look_further_on_equal = -1;
 
 	Assert(!GistPageIsLeaf(p));
 
@@ -446,6 +447,8 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 
 				if (j < r->rd_att->natts - 1)
 					best_penalty[j + 1] = -1;
+
+				look_further_on_equal = -1;
 			}
 			else if (best_penalty[j] == usize)
 			{
@@ -468,12 +471,46 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 		}
 
 		/*
-		 * If we find a tuple with zero penalty for all columns, there's no
-		 * need to examine remaining tuples; just break out of the loop and
-		 * return it.
+		 * If we find a tuple that's exactly as good as the currently best
+		 * one, we could use either one. When inserting a lot of tuples with
+		 * the same or similar keys, it's preferable to descend down the same
+		 * path when possible, as that's more cache-friendly. On the other
+		 * hand, if all inserts land on the same leaf page, after a split,
+		 * we're never going to insert anything to the other half of the
+		 * split, and end up using only 50% of the available space.
+		 * Distributing the inserts evenly would lead to better usage, 75%,
+		 * but that destroys cache-locality. To get the best of both worlds,
+		 * when we find  a tuple that's equally good as the previous best, choose
+		 * whether to stick to the old best, or use the new one. Once we
+		 * decide to stick to the old best, we stick to that for any
+		 * subsequent equally good tuples we might find. This heavily favors
+		 * tuples with low offsets, but still allows some inserts to go to
+		 * other subtrees.
+		 */
+		if (j == r->rd_att->natts && result != i)
+		{
+			if (look_further_on_equal == -1)
+				look_further_on_equal = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+			if (look_further_on_equal == 1)
+			{
+				result = i;
+				look_further_on_equal = -1;
+			}
+		}
+
+		/*
+		 * If we find a tuple with zero penalty for all columns, and we've
+		 * decided we don't want to search for another tuple with equal
+		 * penalty, there's no need to examine remaining tuples; just break
+		 * out of the loop and return it.
 		 */
 		if (zero_penalty)
-			break;
+		{
+			if (look_further_on_equal == -1)
+				look_further_on_equal = (random() <= (MAX_RANDOM_VALUE / 2)) ? 1 : 0;
+			if (look_further_on_equal == 0)
+				break;
+		}
 	}
 
 	return result;
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8b60774..af4d4ea 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -375,6 +375,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 	OffsetNumber result;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber offsets[MaxOffsetNumber];
 	float		best_penalty[INDEX_MAX_KEYS];
 	GISTENTRY	entry,
 				identry[INDEX_MAX_KEYS];
@@ -407,11 +408,43 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 	maxoff = PageGetMaxOffsetNumber(p);
 	Assert(maxoff >= FirstOffsetNumber);
 
-	for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+	if (true)
 	{
-		IndexTuple	itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, i));
-		bool		zero_penalty;
-		int			j;
+		/*
+		 * If randomization is required then prepare array of offset numbers
+		 * in order to remember which offset numbers were already exanimed.
+		 */
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			offsets[i - FirstOffsetNumber] = i;
+	}
+
+	for (i = 0; i < maxoff; i++)
+	{
+		IndexTuple		itup;
+		bool			zero_penalty;
+		int				j;
+		OffsetNumber	offset;
+
+		if (true)
+		{
+			/*
+			 * Randomization: select random offset number from those which
+			 * aren't previously selected.
+			 */
+			OffsetNumber rnd_i, tmp;
+			rnd_i = random() % (maxoff - i) + i;
+			tmp = offsets[rnd_i];
+			offsets[rnd_i] = offsets[i];
+			offsets[i] = tmp;
+			offset = offsets[i];
+		}
+		else
+		{
+			/* No randomization: select next offset number */
+			offset = i + FirstOffsetNumber;
+		}
+
+		itup = (IndexTuple) PageGetItem(p, PageGetItemId(p, offset));
 
 		zero_penalty = true;
 
@@ -424,7 +457,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 
 			/* Compute penalty for this column. */
 			datum = index_getattr(itup, j + 1, giststate->tupdesc, &IsNull);
-			gistdentryinit(giststate, j, &entry, datum, r, p, i,
+			gistdentryinit(giststate, j, &entry, datum, r, p, offset,
 						   FALSE, IsNull);
 			usize = gistpenalty(giststate, j, &entry, IsNull,
 								&identry[j], isnull[j]);
@@ -441,7 +474,7 @@ gistchoose(Relation r, Page p, IndexTuple it,	/* it has compressed entry */
 				 * adopt this tuple's penalty values as the best for all the
 				 * remaining columns during subsequent loop iterations.
 				 */
-				result = i;
+				result = offset;
 				best_penalty[j] = usize;
 
 				if (j < r->rd_att->natts - 1)
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to