On Thu, 19 Feb 2026 at 21:01, Andres Freund <[email protected]> wrote: > On 2026-02-19 20:24:17 +0200, Ants Aasma wrote: > > On Thu, 19 Feb 2026 at 20:07, Ants Aasma <[email protected]> wrote: > > > I was thinking more along the lines of hashing together the pointer > > > value and worker number. But something more deterministic would indeed > > > be better. How about this? > > > > > > --- a/src/backend/executor/execGrouping.c > > > +++ b/src/backend/executor/execGrouping.c > > > @@ -201,3 +201,3 @@ BuildTupleHashTable(PlanState *parent, > > > MemoryContext oldcontext; > > > - uint32 hash_iv = 0; > > > + uint32 hash_iv = parent->plan->plan_node_id; > > > > I can confirm that this fixes the issue. A standalone reproducer is here: > > I think this needs should use something that smears the bits from the plan_id > more widely. The hash functions of some types are ... peculiar (partially due > to cross-type hashjoin support). > > I'd also combine it with use_variable_hash_iv, rather than just have > use_variable_hash_iv overwrite it.
This problem should only affect cases where multiple hash tables worth of tuples are trying to fit into one. I think right now that limits it to hash aggregates with parallel query. ExecBuildHash32FromAttrs comment says that we should be using non-zero init_value only when needed for a marginal performance gain. So I limited the plan node id inclusion to use_variable_hash_iv, but made it unconditional for aggregates. I used hash_combine to smear together the bits of the two. I think that should be good enough. Alternatively a xor of murmurhash32 of the two should be better. There is also a call to ExecBuildHash32FromAttrs from ExecInitSubPlan that specifies a hash_iv for a hashed sub plan execution. This must match the value BuildTupleHashTable decides on. Worth adding a comment? Regards, Ants Aasma
From 624b4827e0d480fc16e016c1ad7c5b26f358b6f3 Mon Sep 17 00:00:00 2001 From: Ants Aasma <[email protected]> Date: Sat, 28 Feb 2026 15:08:50 +0200 Subject: [PATCH v1] Decorrelate nested hash tables Because hash tables are iterated in hash order, using the same hash function in nested hash tables can lead to excessive collisions. If the parent hash table can be the same size as the child hash tables it is not a problem as the parent will quickly grow to the same size as the child, eliminating further collisions. But if there are more than one child the collisions can cause the table to quickly grow until it is below 10% fillfactor. The problem is made worse by nodeAgg devolving into building single entry batches once hash table size exceeds working memory. To hit the problem an aggregate node without a final function above multiple other aggregate nodes is needed. Because hash iv is already initialized using parallel worker number when there is no final fn the typical cases do not hit this problem. A hash aggregate implementing DISTINCT above multiple parallel workers with more groups than fit into memory is needed. Initializing the hash function based on plan node id decorrelates hash tables within a plan while still keeping behavior deterministic. Author: Ants Aasma <[email protected]> Discussion: https://postgr.es/m/CANwKhkPOZupu3PYQVdkMmYjquYVqG2v8XmCAuuVM9Eu13-Zw3g%40mail.gmail.com --- src/backend/executor/execGrouping.c | 2 +- src/backend/executor/nodeAgg.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c index c107514a85d..21f5e4cabcc 100644 --- a/src/backend/executor/execGrouping.c +++ b/src/backend/executor/execGrouping.c @@ -246,7 +246,7 @@ BuildTupleHashTable(PlanState *parent, * underestimated. */ if (use_variable_hash_iv) - hash_iv = murmurhash32(ParallelWorkerNumber); + hash_iv = hash_combine(ParallelWorkerNumber, parent->plan->plan_node_id); hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable); diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c index 7d487a165fa..5c947fd151f 100644 --- a/src/backend/executor/nodeAgg.c +++ b/src/backend/executor/nodeAgg.c @@ -1535,7 +1535,7 @@ build_hash_table(AggState *aggstate, int setno, double nbuckets) metacxt, tuplescxt, tmpcxt, - DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit)); + true); } /* -- 2.51.0
