On Thu, 19 Feb 2026 at 21:01, Andres Freund <[email protected]> wrote:
> On 2026-02-19 20:24:17 +0200, Ants Aasma wrote:
> > On Thu, 19 Feb 2026 at 20:07, Ants Aasma <[email protected]> wrote:
> > > I was thinking more along the lines of hashing together the pointer
> > > value and worker number. But something more deterministic would indeed
> > > be better. How about this?
> > >
> > > --- a/src/backend/executor/execGrouping.c
> > > +++ b/src/backend/executor/execGrouping.c
> > > @@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent,
> > >         MemoryContext oldcontext;
> > > -       uint32          hash_iv = 0;
> > > +       uint32          hash_iv = parent->plan->plan_node_id;
> >
> > I can confirm that this fixes the issue. A standalone reproducer is here:
>
> I think this needs should use something that smears the bits from the plan_id
> more widely. The hash functions of some types are ... peculiar (partially due
> to cross-type hashjoin support).
>
> I'd also combine it with use_variable_hash_iv, rather than just have
> use_variable_hash_iv overwrite it.

This problem should only affect cases where multiple hash tables worth
of tuples are trying to fit into one. I think right now that limits it
to hash aggregates with parallel query. ExecBuildHash32FromAttrs
comment says that we should be using non-zero init_value only when
needed for a marginal performance gain. So I limited the plan node id
inclusion to use_variable_hash_iv, but made it unconditional for
aggregates.

I used hash_combine to smear together the bits of the two. I think
that should be good enough. Alternatively a xor of murmurhash32 of the
two should be better.

There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan
that specifies a hash_iv for a hashed sub plan execution. This must
match the value BuildTupleHashTable decides on. Worth adding a
comment?

Regards,
Ants Aasma
From 624b4827e0d480fc16e016c1ad7c5b26f358b6f3 Mon Sep 17 00:00:00 2001
From: Ants Aasma <[email protected]>
Date: Sat, 28 Feb 2026 15:08:50 +0200
Subject: [PATCH v1] Decorrelate nested hash tables

Because hash tables are iterated in hash order, using the same hash
function in nested hash tables can lead to excessive collisions. If the
parent hash table can be the same size as the child hash tables it is
not a problem as the parent will quickly grow to the same size as the
child, eliminating further collisions. But if there are more than one
child the collisions can cause the table to quickly grow until it is
below 10% fillfactor.

The problem is made worse by nodeAgg devolving into building single
entry batches once hash table size exceeds working memory.

To hit the problem an aggregate node without a final function above
multiple other aggregate nodes is needed. Because hash iv is already
initialized using parallel worker number when there is no final fn
the typical cases do not hit this problem. A hash aggregate implementing
DISTINCT above multiple parallel workers with more groups than fit into
memory is needed.

Initializing the hash function based on plan node id decorrelates hash
tables within a plan while still keeping behavior deterministic.

Author: Ants Aasma <[email protected]>
Discussion: https://postgr.es/m/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g%40mail.gmail.com
---
 src/backend/executor/execGrouping.c | 2 +-
 src/backend/executor/nodeAgg.c      | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index c107514a85d..21f5e4cabcc 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -246,7 +246,7 @@ BuildTupleHashTable(PlanState *parent,
 	 * underestimated.
 	 */
 	if (use_variable_hash_iv)
-		hash_iv = murmurhash32(ParallelWorkerNumber);
+		hash_iv = hash_combine(ParallelWorkerNumber, parent->plan->plan_node_id);
 
 	hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable);
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7d487a165fa..5c947fd151f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets)
 											 metacxt,
 											 tuplescxt,
 											 tmpcxt,
-											 DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
+											 true);
 }
 
 /*
-- 
2.51.0

Reply via email to