On Thu, Dec 8, 2016 at 5:23 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Nov 23, 2016 at 3:33 AM, Andres Freund <and...@anarazel.de> wrote:
>> On 2016-11-18 08:00:40 -0500, Robert Haas wrote:
>>> On Tue, Nov 15, 2016 at 2:28 PM, Andres Freund <and...@anarazel.de> wrote:
>>> > I've a working fix for this, and for a similar issue Robert found. I'm
>>> > still playing around with it, but basically the fix is to make the
>>> > growth policy a bit more adaptive.
>>>
>>> Any chance you can post a patch soon?
>>
>> Here's my WIP series addressing this and related problems. With this
>> we're again noticeably faster than the dynahash implementation, in both
>> the case here, and the query you brought up over IM.
>>
>> This definitely needs some more TLC, but the general approach seems
>> good. I particularly like that it apparently allows us to increase the
>> default fillfactor without much downside according to my measurements.
>
> Are you going to commit something here?  At least enough to make
> Finalize HashAgg -> Gather -> Partial HashAgg terminate in finite
> time?  Because the fact that it doesn't really sucks.

I took a look at Andres's patches today and saw that they can't really
be applied as-is, because they "temporarily" change the master's
ParallelWorkerNumber but have no provision for restoring it after an
ERROR.  I think that approach isn't right anyway; it seems to me that
what we want to do is set hash_iv based on we're in a Partial HashAgg,
not whether we're in a parallel worker.  For instance, if we're
somehow in a nodeSubplan.c there's no need for this just because we
happen to be in a worker -- I think.  That led me to develop the
attached patch.

This may not be perfect, but it causes TPC-H Q18 to complete instead
of hanging forever, so I'm going to commit it RSN unless there are
loud objections combined with promising steps toward some alternative
fix.  It's been over a month since these problems were reported, and
it is not good that the tree has been in a broken state for that
entire time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 94cc59d..3149fbe 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -18,6 +18,8 @@
  */
 #include "postgres.h"
 
+#include "access/hash.h"
+#include "access/parallel.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
 #include "utils/lsyscache.h"
@@ -289,7 +291,8 @@ BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
 					FmgrInfo *eqfunctions,
 					FmgrInfo *hashfunctions,
 					long nbuckets, Size additionalsize,
-					MemoryContext tablecxt, MemoryContext tempcxt)
+					MemoryContext tablecxt, MemoryContext tempcxt,
+					bool use_variable_hash_iv)
 {
 	TupleHashTable hashtable;
 	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
@@ -314,6 +317,19 @@ BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
 	hashtable->in_hash_funcs = NULL;
 	hashtable->cur_eq_funcs = NULL;
 
+	/*
+	 * If parallelism is in use, even if the master backend is performing the
+	 * scan itself, we don't want to create the hashtable exactly the same way
+	 * in all workers. As hashtables are iterated over in keyspace-order,
+	 * doing so in all processes in the same way is likely to lead to
+	 * "unbalanced" hashtables when the table size initially is
+	 * underestimated.
+	 */
+	if (use_variable_hash_iv)
+		hashtable->hash_iv = hash_uint32(ParallelWorkerNumber);
+	else
+		hashtable->hash_iv = 0;
+
 	hashtable->hashtab = tuplehash_create(tablecxt, nbuckets);
 	hashtable->hashtab->private_data = hashtable;
 
@@ -450,7 +466,7 @@ TupleHashTableHash(struct tuplehash_hash *tb, const MinimalTuple tuple)
 	TupleHashTable hashtable = (TupleHashTable) tb->private_data;
 	int			numCols = hashtable->numCols;
 	AttrNumber *keyColIdx = hashtable->keyColIdx;
-	uint32		hashkey = 0;
+	uint32		hashkey = hashtable->hash_iv;
 	TupleTableSlot *slot;
 	FmgrInfo   *hashfunctions;
 	int			i;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index eefb3d6..a093862 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1723,7 +1723,8 @@ build_hash_table(AggState *aggstate)
 											  node->numGroups,
 											  additionalsize,
 							 aggstate->aggcontexts[0]->ecxt_per_tuple_memory,
-											  tmpmem);
+											  tmpmem,
+								  !DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
 }
 
 /*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index acded07..5b734c0 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -43,7 +43,8 @@ build_hash_table(RecursiveUnionState *rustate)
 											 node->numGroups,
 											 0,
 											 rustate->tableContext,
-											 rustate->tempContext);
+											 rustate->tempContext,
+											 false);
 }
 
 
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index e94555e..760b935 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -130,7 +130,8 @@ build_hash_table(SetOpState *setopstate)
 												node->numGroups,
 												0,
 												setopstate->tableContext,
-												setopstate->tempContext);
+												setopstate->tempContext,
+												false);
 }
 
 /*
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 8ca8fc4..d343600 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -510,7 +510,8 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 										  nbuckets,
 										  0,
 										  node->hashtablecxt,
-										  node->hashtempcxt);
+										  node->hashtempcxt,
+										  false);
 
 	if (!subplan->unknownEqFalse)
 	{
@@ -529,7 +530,8 @@ buildSubPlanHash(SubPlanState *node, ExprContext *econtext)
 											  nbuckets,
 											  0,
 											  node->hashtablecxt,
-											  node->hashtempcxt);
+											  node->hashtempcxt,
+											  false);
 	}
 
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b4d09f9..3f649fa 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -143,7 +143,7 @@ extern TupleHashTable BuildTupleHashTable(int numCols, AttrNumber *keyColIdx,
 					FmgrInfo *hashfunctions,
 					long nbuckets, Size additionalsize,
 					MemoryContext tablecxt,
-					MemoryContext tempcxt);
+					MemoryContext tempcxt, bool use_variable_hash_iv);
 extern TupleHashEntry LookupTupleHashEntry(TupleHashTable hashtable,
 					 TupleTableSlot *slot,
 					 bool *isnew);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1de5c81..703604a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -533,6 +533,7 @@ typedef struct TupleHashTableData
 	TupleTableSlot *inputslot;	/* current input tuple's slot */
 	FmgrInfo   *in_hash_funcs;	/* hash functions for input datatype(s) */
 	FmgrInfo   *cur_eq_funcs;	/* equality functions for input vs. table */
+	uint32		hash_iv;		/* hash-function IV */
 }	TupleHashTableData;
 
 typedef tuplehash_iterator TupleHashIterator;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to