Re: DSA overflow in hash join

Konstantin Knizhnik Sat, 16 Aug 2025 04:54:59 -0700

Thank you for review!

On 14/08/2025 10:34 PM, Heikki Linnakangas wrote:

On 31/07/2025 18:13, Konstantin Knizhnik wrote:
On 27/07/2025 8:24 PM, Konstantin Knizhnik wrote:
I still trying to understand the reason of DSA overflow in hash join.
In addition to two suspicious places where number of buckets isdoubled without chek for overflow (nodeHash.c:1668 andnodeHash.c:3290),there is one more place where number of batches is multiplied by`EstimateParallelHashJoinBatch(hashtable)` which is
sizeof(ParallelHashJoinBatch) + (sizeof(SharedTuplestore) +sizeof(SharedTuplestoreParticipant) * participants) * 2
which is 480 bytes!
But when we calculate maximal number of batches, we limit it bymacximal number of pointers (8 bytes):
    max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
max_pointers = Min(max_pointers, MaxAllocSize /sizeof(HashJoinTuple));
    /* If max_pointers isn't a power of 2, must round it down to one */
    max_pointers = pg_prevpower2_size_t(max_pointers);

    /* Also ensure we avoid integer overflow in nbatch and nbuckets */
/* (this step is redundant given the current value ofMaxAllocSize) */
    max_pointers = Min(max_pointers, INT_MAX / 2 + 1);

    dbuckets = ceil(ntuples / NTUP_PER_BUCKET);
    dbuckets = Min(dbuckets, max_pointers);
    nbuckets = (int) dbuckets;


But as we see, here multiplier is 480 bytes, not 8 bytes.
Below is script to reproduce the problem:
CREATE TABLE IF NOT EXISTS t0(c0 FLOAT, PRIMARY KEY(c0)) WITH(parallel_workers=966);
CREATE TABLE t2(c0 DECIMAL, c1 int4range ) WITH (parallel_workers=393);
CREATE TABLE t4(LIKE t2);
CREATE TABLE t5(LIKE t0);
INSERT INTO t4(c0) VALUES(0.5934077416223362);

set work_mem='10MB';
set max_parallel_workers_per_gather=5;
explain SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) ascount FROM ONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON(upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROMONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0)as res;
SELECT SUM(count) FROM (SELECT ALL CAST(FALSE AS INT) as count FROMONLY t5, ONLY t2 CROSS JOIN ONLY t0 LEFT OUTER JOIN t4* ON(upper(((t2.c1)+(t2.c1))))::BOOLEAN CROSS JOIN (SELECT t4.c0 FROMONLY t0, t2*, t5*, t4* WHERE (((t2.c1)*(t2.c1))) IN (t4.c1)) AS sub0)as res;
Great repro, thanks!
And attached please find patch fixing the issue.
There are a lot of allocations in hash join. For example this inExecParallelHashEnsureBatchAccessors():
    /* Allocate this backend's accessor array. */
    hashtable->nbatch = pstate->nbatch;
    hashtable->batches =
        palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch);
That could also exceed MaxAllocSize, right? I think we need to takethat into account in calculating the max number of batches, too.

Yes, I concentrated on DSA and didn't notice it. Also added it to limitnumber of batches.

Is this only a problem for parallel hash joins?
In general, it's really hard to follow the logic of where we enforcewhat limits in ExecChooseHashTableSize(). How can we be sure that weaccounted for all allocations? Could we have more (static) assertionsor something, to ensure we've covered everything?

Definitely I can not guarantee that there are no much such places inPostgres code.But at least in executor files, only nodeHash takes in accountMaxAllocSize when choosing hash table parameters.Certainly if some module doesn't take in account `MaxAllocSize`, itdoesn't means that there can not be OOM problems (either in Postgresmemory context or in shared memory). May be OOM problem was just notconsidered...

A different approach might be to make ExecChooseHashTableSize() returnjust estimates, and enforce the hard limits later. But that could leadto worse decisions: it's good for ExecChooseHashTableSize() to takethe maximum limits into account.

Yes, I think that number of batches should be correlated with number ofbuckets. So it is better to restrict their values from the verybeginning (in ExecChooseHashTableSize) rather then do it right beforeallocation.

On the proposed patch:
@@ -775,6 +777,16 @@ ExecChooseHashTableSize(double ntuples, inttupwidth, bool useskew,
     /* If max_pointers isn't a power of 2, must round it down to one */
     max_pointers = pg_prevpower2_size_t(max_pointers);

+    /*
+ * Prevent DSA overflow. This is expanded definition ofEstimateParallelHashJoinBatch function
+     * used in ExecParallelHashJoinSetUpBatches:
+     *     dsa_allocate0(hashtable->area,
+     * EstimateParallelHashJoinBatch(hashtable) * nbatch)
+     */
+ max_batches = MaxAllocSize /(MAXALIGN(sizeof(ParallelHashJoinBatch)) +
+ MAXALIGN(sts_estimate(max_parallel_workers_per_gather + 1) * 2));
+    max_batches = pg_prevpower2_size_t(max_batches);
+
     /* Also ensure we avoid integer overflow in nbatch and nbuckets */
/* (this step is redundant given the current value ofMaxAllocSize) */
     max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
Would be nice to not copy the macro here..


Yeh... extr4acting macro looks really ugly.

But I don't know how to avoid it, because macro is using pointer to hashtable which is not allocated yes. But what is worser,result depends on number of participants which is not known yet at thisstage.

So I have to use upper limit (`max_parallel_workers_per_gather`) instead.

BTW I don't like the name EstimateParallelHashJoinBatch(). It's notjust an estimate, it's used to allocate memory for an array, and itbetter be accurate. See also NthParallelHashJoinBatch.
@@ -910,7 +923,8 @@ ExecChooseHashTableSize(double ntuples, inttupwidth, bool useskew, * this during the initial sizing - once we start building thehash,
      * nbucket is fixed.
      */
-    while (nbatch > 0)
+    while (nbatch > 0 &&
+ nbuckets * 2 <= max_pointers) /* prevent allocation limitoverflow */
Hmm, that doesn't seem quite right. 'max_pointers' is derived fromwork_mem, but the point of this loop is to reduce memory usage, whenyou're already over the work_mem limit. I think we should use a hardlimit derived only from MaxAllocSize here.

Fixed

     {
         /* how much memory are we using with current nbatch value */
size_t current_space = hash_table_bytes + (2 * nbatch* BLCKSZ);
Hmm, should we add the memory usage calculated byEstimateParallelHashJoinBatch() to this per-batch overhead, too?

Should also take in account`palloc0_array(ParallelHashJoinBatchAccessor, hashtable->nbatch)` - itis ~80*nbatches.But may be this multiplier (80) as well multiplier calculated byEstimateParallelHashJoinBatch (480) can be ignore comparing with BLCKSZ=8kB?May be it will have not so much influence on choice of optimal rationbetween number of buckets and batches?

Attached please find new version of the patch (without change ofcurrent_space calculation).

diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8d2201ab67f..5d312c07159 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -35,6 +35,7 @@
 #include "executor/nodeHash.h"
 #include "executor/nodeHashjoin.h"
 #include "miscadmin.h"
+#include "optimizer/cost.h"
 #include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
@@ -667,7 +668,9 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool 
useskew,
        double          inner_rel_bytes;
        size_t          hash_table_bytes;
        size_t          bucket_bytes;
+       size_t          max_alloc_pointers;
        size_t          max_pointers;
+       size_t          max_batches;
        int                     nbatch = 1;
        int                     nbuckets;
        double          dbuckets;
@@ -771,10 +774,22 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, 
bool useskew,
         * ExecHashGetBucketAndBatch fast.
         */
        max_pointers = hash_table_bytes / sizeof(HashJoinTuple);
-       max_pointers = Min(max_pointers, MaxAllocSize / sizeof(HashJoinTuple));
+       max_alloc_pointers = MaxAllocSize / sizeof(HashJoinTuple);
+       max_pointers = Min(max_pointers, macx_alloca_pointers);
        /* If max_pointers isn't a power of 2, must round it down to one */
        max_pointers = pg_prevpower2_size_t(max_pointers);
 
+       /*
+        * Prevent DSA overflow. This is expanded definition of 
EstimateParallelHashJoinBatch function
+        * used in ExecParallelHashJoinSetUpBatches:
+        *     dsa_allocate0(hashtable->area,
+        *                   EstimateParallelHashJoinBatch(hashtable) * nbatch)
+        */
+       max_batches = MaxAllocSize / (MAXALIGN(sizeof(ParallelHashJoinBatch)) +
+                                                                 
MAXALIGN(sts_estimate(max_parallel_workers_per_gather + 1) * 2));
+       max_batches = Min(max_batches, MaxAllocSize / 
sizeof(ParallelHashJoinBatchAccessor));
+       max_batches = pg_prevpower2_size_t(max_batches);
+
        /* Also ensure we avoid integer overflow in nbatch and nbuckets */
        /* (this step is redundant given the current value of MaxAllocSize) */
        max_pointers = Min(max_pointers, INT_MAX / 2 + 1);
@@ -844,6 +859,7 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool 
useskew,
                /* Calculate required number of batches. */
                dbatch = ceil(inner_rel_bytes / (hash_table_bytes - 
bucket_bytes));
                dbatch = Min(dbatch, max_pointers);
+               dbatch = Min(dbatch, max_batches);
                minbatch = (int) dbatch;
                nbatch = pg_nextpower2_32(Max(2, minbatch));
        }
@@ -910,7 +926,8 @@ ExecChooseHashTableSize(double ntuples, int tupwidth, bool 
useskew,
         * this during the initial sizing - once we start building the hash,
         * nbucket is fixed.
         */
-       while (nbatch > 0)
+       while (nbatch > 0 &&
+                  nbuckets * 2 <= max_alloc_pointers) /* prevent allocation 
limit overflow */
        {
                /* how much memory are we using with current nbatch value */
                size_t          current_space = hash_table_bytes + (2 * nbatch 
* BLCKSZ);

Re: DSA overflow in hash join

Reply via email to