[HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Tomas Vondra Sat, 06 Dec 2014 19:09:25 -0800

Hi,

back when we were discussing the hashjoin patches (now committed),
Robert proposed that maybe it'd be a good idea to sometimes increase the
number of tuples per bucket instead of batching.


That is, while initially sizing the hash table - if the hash table with
enough buckets to satisfy NTUP_PER_BUCKET load factor does not fit into
work_mem, try a bit higher load factor before starting to batch.

Attached patch is an initial attempt to implement this - it's a bit
rough on the edges, but hopefully enough to judge the benefit of this.

The default load factor is 1. The patch tries to degrade this to 2, 4 or
8 in attempt to fit the hash table into work mem. If it doesn't work, it
starts batching with the default load factor. If the batching is
required while the hashjoin is running, the load factor is switched back
to the default one (if we're batching, there's no point in keeping the
slower hash table).

The patch also modifies explain output, to show the load factor.

The testing I've done so far are rather disappointing, though.


    create table a as select i from generate_series(1,1000000) s(i);
    create table b as select i from generate_series(1,1000000) s(i);

    analyze a;
    analyze b;

    select a.i, b.i from a join b on (a.i = b.i);

    work_mem   batches    tuples per bucket      duration
    -----------------------------------------------------
    64 MB            1                    1        585 ms
    46 MB            1                    2        639 ms
    43 MB            1                    4        794 ms
    40 MB            1                    8       1075 ms
    39 MB            2                    1        623 ms

So, even increasing the load factor to 2 is slower than batching.

Of course, with other examples the results may be different. For example
with a much larger outer table:

    create table a as select mod(i,1000000) i
                        from generate_series(1,10000000) s(i);
    analyze a;

    work_mem   batches    tuples per bucket      duration
    -----------------------------------------------------
    64 MB            1                    1       3904 ms
    46 MB            1                    2       4692 ms
    43 MB            1                    4       6499 ms
    40 MB            1                    8       9264 ms
    39 MB            2                    1       4054 ms

Same results. Of course, for "huge" outer tables it will make a
difference. For example on a ~8GB outer table (on a machine with 8GB of
RAM), the batching causes enough I/O to make it slower than ntup=2 (50s
vs. 80s, for example).

I'm not sure what's the best way forward - clearly, for cases that fit
into RAM (temp files into page cache), batching is faster. For tables
large enough to cause a lot of I/O, it may make a difference - but I'm
not sure how to identify these cases.

So ISTM implementing this requires a reliable way to identify which case
we're dealing with - if the outer table is large enough (prefer
increasing load factor) or not (prefer batching). Until then keeping the
current simple/predictible approach is probably better.

Of course, this also depends on additional variables (e.g. is the system
memory-stressed?).

regards
Tomas

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 332f04a..06e612a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1936,26 +1936,33 @@ show_hash_info(HashState *hashstate, ExplainState *es)
 			ExplainPropertyLong("Original Hash Batches",
 								hashtable->nbatch_original, es);
 			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
+			ExplainPropertyLong("Hash Tuples Per Bucket", hashtable->ntup_per_bucket, es);
+			ExplainPropertyLong("Original Tuples Per Bucket",
+								hashtable->ntup_per_bucket, es);
+			ExplainPropertyLong("Peak Memory Usage", spacePeakKb, es);
 		}
 		else if (hashtable->nbatch_original != hashtable->nbatch ||
-				 hashtable->nbuckets_original != hashtable->nbuckets)
+				 hashtable->nbuckets_original != hashtable->nbuckets ||
+				 hashtable->ntup_per_bucket_original != hashtable->ntup_per_bucket)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Memory Usage: %ldkB\n",
+							 "Buckets: %d (originally %d)  Batches: %d (originally %d)  Tuples: %d (originally %d)  Memory Usage: %ldkB\n",
 							 hashtable->nbuckets,
 							 hashtable->nbuckets_original,
 							 hashtable->nbatch,
 							 hashtable->nbatch_original,
+							 hashtable->ntup_per_bucket,
+							 hashtable->ntup_per_bucket_original,
 							 spacePeakKb);
 		}
 		else
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
 			appendStringInfo(es->str,
-						   "Buckets: %d  Batches: %d  Memory Usage: %ldkB\n",
+						   "Buckets: %d  Batches: %d  Tuples: %d  Memory Usage: %ldkB\n",
 							 hashtable->nbuckets, hashtable->nbatch,
-							 spacePeakKb);
+							 hashtable->ntup_per_bucket, spacePeakKb);
 		}
 	}
 }
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 7c5bb77..fd4f56d 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -259,6 +259,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	Plan	   *outerNode;
 	int			nbuckets;
 	int			nbatch;
+	int			ntup;
 	int			num_skew_mcvs;
 	int			log2_nbuckets;
 	int			nkeys;
@@ -275,7 +276,7 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 
 	ExecChooseHashTableSize(outerNode->plan_rows, outerNode->plan_width,
 							OidIsValid(node->skewTable),
-							&nbuckets, &nbatch, &num_skew_mcvs);
+							&nbuckets, &nbatch, &ntup, &num_skew_mcvs);
 
 #ifdef HJDEBUG
 	printf("nbatch = %d, nbuckets = %d\n", nbatch, nbuckets);
@@ -320,6 +321,8 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
 	hashtable->spaceAllowedSkew =
 		hashtable->spaceAllowed * SKEW_WORK_MEM_PERCENT / 100;
 	hashtable->chunks = NULL;
+	hashtable->ntup_per_bucket = ntup;
+	hashtable->ntup_per_bucket_original = ntup;
 
 	/*
 	 * Get info about the hash functions to be used for each hash key. Also
@@ -410,13 +413,15 @@ ExecHashTableCreate(Hash *node, List *hashOperators, bool keepNulls)
  * This is exported so that the planner's costsize.c can use it.
  */
 
-/* Target bucket loading (tuples per bucket) */
+/* Target bucket loading (tuples per bucket) - keep the values 2^N */
 #define NTUP_PER_BUCKET			1
+#define MAX_NTUP_PER_BUCKET		8
 
 void
 ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
 						int *numbatches,
+						int *ntup,
 						int *num_skew_mcvs)
 {
 	int			tupsize;
@@ -428,6 +433,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	int			nbatch = 1;
 	int			nbuckets;
 	double		dbuckets;
+	int			ntup_per_bucket;
 
 	/* Force a plausible relation size if no info */
 	if (ntuples <= 0.0)
@@ -498,6 +504,51 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 	nbuckets = Max((int) dbuckets, 1024);
 	nbuckets = 1 << my_log2(nbuckets);
 	bucket_bytes = sizeof(HashJoinTuple) * nbuckets;
+	ntup_per_bucket = NTUP_PER_BUCKET;
+
+	/*
+	 * If there's not enough space to store the projected number of tupples,
+	 * we can try to increase the number of tuples per bucket first, up tp
+	 * MAX_NTUP_PER_BUCKET, in the hope that the smaller hash table will
+	 * fit into memory. This is likely still faster than batching,
+	 * especially with large outer tables that cause a lot of I/O.
+	 */
+	if (inner_rel_bytes + bucket_bytes > hash_table_bytes)
+	{
+		int		shrink_factor = 1;
+		int		tmp_per_bucket = ntup_per_bucket;
+		bool	fits_into_memory = false;
+
+		/*
+		 * Cut the number of buckets in half, until we reach the maximum
+		 * allowed value (MAX_NTUP_PER_BUCKET). This makes evaluating
+		 * bucket_bytes much easier (just cut the size in half).
+		 */
+		while (tmp_per_bucket * 2 <= MAX_NTUP_PER_BUCKET)
+		{
+			shrink_factor *= 2;
+			tmp_per_bucket *= 2;
+			if (inner_rel_bytes + bucket_bytes / shrink_factor <= hash_table_bytes)
+			{
+				fits_into_memory = true;
+				break;
+			}
+		}
+
+		/*
+		 * If resizing the hash table helps, we need to keep track of the
+		 * number of buckets, limit, etc.
+		 */
+		if (fits_into_memory)
+		{
+			elog(WARNING, "hash table fits into memory with shrink factor = %d",
+				shrink_factor);
+
+			nbuckets     /= shrink_factor;
+			bucket_bytes /= shrink_factor;
+			ntup_per_bucket = tmp_per_bucket;
+		}
+	}
 
 	/*
 	 * If there's not enough space to store the projected number of tuples
@@ -544,6 +595,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 
 	*numbuckets = nbuckets;
 	*numbatches = nbatch;
+	*ntup = ntup_per_bucket;
 }
 
 
@@ -610,6 +662,78 @@ ExecHashIncreaseNumBatches(HashJoinTable hashtable)
 		   nbatch, (unsigned long) hashtable->spaceUsed);
 #endif
 
+	/*
+	 * In a batching mode always shoot for NTUP_PER_BUCKET, without any
+	 * graceful degradation (allowing more tuples instead of batching).
+	 * If that's the case (can only happen once, on the first split),
+	 * then we need to compute the number of buckets for the batching
+	 * mode, using the same logic as in ExecChooseHashTableSize().
+	 *
+	 * At this point we don't know know the estimated tuple width etc.
+	 * But we do know how much space we're using and how many tuples
+	 * there are, so we can comptute the tuple size (maybe even more
+	 * accurately).
+	 *
+	 * The question is whether this increase of bucket space (we're
+	 * lowering NTUP_PER_BUCKET, thus requiring more space for the
+	 * buckets) can result in more than doubling the number of batches
+	 * in a single step. And the answer is no, because even if we
+	 * increase the number of buckets to the optimum value (with a load
+	 * factor 1), it still can't occupy more than 50% of work_mem, just
+	 * like the tuples after increasing the number of batches. So we're
+	 * still just doubling the number of batches.
+	 *
+	 * XXX Could this reasoning be broken by computing a different
+	 *     tuple width estimate?
+	 *
+	 * XXX We have already increased nbatch.
+	 */
+	if (hashtable->ntup_per_bucket != NTUP_PER_BUCKET)
+	{
+		long		lbuckets;
+		long		bucket_size;
+		long		tupsize;
+
+		long		hash_table_bytes = work_mem * 1024L;
+		long		max_pointers = (work_mem * 1024L) / sizeof(void *);
+
+		long		bucket_bytes;
+		long		nbuckets;
+
+		Assert(oldnbatch == 1);
+
+		/* use the optiomal number of tuples per bucket */
+		hashtable->ntup_per_bucket = NTUP_PER_BUCKET;
+
+		/* compute size of a single tuple (including overhead) */
+		tupsize = ceil((double)hashtable->spaceUsed / hashtable->totalTuples);
+
+		/*
+		 * Estimate the number of buckets we'll want to have when work_mem
+		 * is entirely full.  Each bucket will contain a bucket pointer plus
+		 * NTUP_PER_BUCKET tuples, whose projected size already includes
+		 * overhead for the hash code, pointer to the next tuple, etc.
+		 */
+		bucket_size = (tupsize * NTUP_PER_BUCKET + sizeof(HashJoinTuple));
+		lbuckets = 1 << my_log2(hash_table_bytes / bucket_size);
+		lbuckets = Min(lbuckets, max_pointers);
+		nbuckets = (int) lbuckets;
+		bucket_bytes = nbuckets * sizeof(HashJoinTuple);
+
+		/*
+		 * Buckets are simple pointers to hashjoin tuples, while tupsize
+		 * includes the pointer, hash code, and MinimalTupleData.  So buckets
+		 * should never really exceed 25% of work_mem (even for
+		 * NTUP_PER_BUCKET=1); except maybe * for work_mem values that are
+		 * not 2^N bytes, where we might get more * because of doubling.
+		 * So let's look for 50% here.
+		 */
+		Assert(bucket_bytes <= hash_table_bytes / 2);
+
+		hashtable->nbuckets_optimal = nbuckets;
+		hashtable->log2_nbuckets_optimal = my_log2(nbuckets);
+	}
+
 	oldcxt = MemoryContextSwitchTo(hashtable->hashCxt);
 
 	if (hashtable->innerBatchFile == NULL)
@@ -873,7 +997,7 @@ ExecHashTableInsert(HashJoinTable hashtable,
 		 */
 		if ((hashtable->nbatch == 1) &&
 			(hashtable->nbuckets_optimal <= INT_MAX/2) &&	/* overflow protection */
-			(ntuples >= (hashtable->nbuckets_optimal * NTUP_PER_BUCKET)))
+			(ntuples >= (hashtable->nbuckets_optimal * hashtable->ntup_per_bucket)))
 		{
 			hashtable->nbuckets_optimal *= 2;
 			hashtable->log2_nbuckets_optimal += 1;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 659daa2..ef6e0f7 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -2420,6 +2420,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 	int			num_hashclauses = list_length(hashclauses);
 	int			numbuckets;
 	int			numbatches;
+	int			ntup;
 	int			num_skew_mcvs;
 
 	/* cost of source data */
@@ -2456,6 +2457,7 @@ initial_cost_hashjoin(PlannerInfo *root, JoinCostWorkspace *workspace,
 							true,		/* useskew */
 							&numbuckets,
 							&numbatches,
+							&ntup,
 							&num_skew_mcvs);
 
 	/*
diff --git a/src/include/executor/hashjoin.h b/src/include/executor/hashjoin.h
index 0e1e0cd..b7eefac 100644
--- a/src/include/executor/hashjoin.h
+++ b/src/include/executor/hashjoin.h
@@ -131,6 +131,9 @@ typedef struct HashJoinTableData
 	int			nbuckets_optimal;	/* optimal # buckets (per batch) */
 	int			log2_nbuckets_optimal;	/* same as log2_nbuckets optimal */
 
+	int			ntup_per_bucket;	/* number of tuples per bucket */
+	int			ntup_per_bucket_original;	/* number of tuples per bucket */
+
 	/* buckets[i] is head of list of tuples in i'th in-memory bucket */
 	struct HashJoinTupleData **buckets;
 	/* buckets array is per-batch storage, as are all the tuples */
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 75be5bd..a96ad90 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -47,6 +47,7 @@ extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
 						int *numbuckets,
 						int *numbatches,
+						int *ntup,
 						int *num_skew_mcvs);
 extern int	ExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PATCH: hashjoin - gracefully increasing NTUP_PER_BUCKET instead of batching

Reply via email to