On 21.04.2023 1:51 AM, Melanie Plageman wrote:
On Thu, Apr 20, 2023 at 12:42 PM Konstantin Knizhnik <knizh...@garret.ru> wrote:
On 11.04.2023 8:14 PM, Jehan-Guillaume de Rorthais wrote:
On Sat, 8 Apr 2023 02:01:19 +0200
Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote:

On Fri, 31 Mar 2023 14:06:11 +0200
Jehan-Guillaume de Rorthais <j...@dalibo.com> wrote:

   [...]

After rebasing Tomas' memory balancing patch, I did some memory measures
to answer some of my questions. Please, find in attachment the resulting
charts "HJ-HEAD.png" and "balancing-v3.png" to compare memory consumption
between HEAD and Tomas' patch. They shows an alternance of numbers
before/after calling ExecHashIncreaseNumBatches (see the debug patch). I
didn't try to find the exact last total peak of memory consumption during the
join phase and before all the BufFiles are destroyed. So the last number
might be underestimated.
I did some more analysis about the total memory consumption in filecxt of HEAD,
v3 and v4 patches. My previous debug numbers only prints memory metrics during
batch increments or hash table destruction. That means:

* for HEAD: we miss the batches consumed during the outer scan
* for v3: adds twice nbatch in spaceUsed, which is a rough estimation
* for v4: batches are tracked in spaceUsed, so they are reflected in spacePeak

Using a breakpoint in ExecHashJoinSaveTuple to print "filecxt->mem_allocated"
from there, here are the maximum allocated memory for bufFile context for each
branch:

          batches   max bufFiles   total  spaceAllowed rise
    HEAD    16384      199966960  ~194MB
    v3       4096       65419456   ~78MB
    v4(*3)   2048       34273280    48MB  nbatch*sizeof(PGAlignedBlock)*3
    v4(*4)   1024       17170160  60.6MB  nbatch*sizeof(PGAlignedBlock)*4
    v4(*5)   2048       34273280  42.5MB  nbatch*sizeof(PGAlignedBlock)*5

It seems account for bufFile in spaceUsed allows a better memory balancing and
management. The precise factor to rise spaceAllowed is yet to be defined. *3 or
*4 looks good, but this is based on a single artificial test case.

Also, note that HEAD is currently reporting ~4MB of memory usage. This is by
far wrong with the reality. So even if we don't commit the balancing memory
patch in v16, maybe we could account for filecxt in spaceUsed as a bugfix?

Regards,
Thank you for the patch.
I  faced with the same problem (OOM caused by hash join).
I tried to create simplest test reproducing the problem:

create table t(pk int, val int);
insert into t values (generate_series(1,100000000),0);
set work_mem='64kB';
explain (analyze,buffers) select count(*) from t t1 join t t2 on
(t1.pk=t2.pk);


There are three workers and size of each exceeds 1.3Gb.

Plan is the following:

   Finalize Aggregate  (cost=355905977972.87..355905977972.88 rows=1
width=8) (actual time=2
12961.033..226097.513 rows=1 loops=1)
     Buffers: shared hit=32644 read=852474 dirtied=437947 written=426374,
temp read=944407 w
ritten=1130380
     ->  Gather  (cost=355905977972.65..355905977972.86 rows=2 width=8)
(actual time=212943.
505..226097.497 rows=3 loops=1)
           Workers Planned: 2
           Workers Launched: 2
           Buffers: shared hit=32644 read=852474 dirtied=437947
written=426374, temp read=94
4407 written=1130380
           ->  Partial Aggregate (cost=355905976972.65..355905976972.66
rows=1 width=8) (ac
tual time=212938.410..212940.035 rows=1 loops=3)
                 Buffers: shared hit=32644 read=852474 dirtied=437947
written=426374, temp r
ead=944407 written=1130380
                 ->  Parallel Hash Join (cost=1542739.26..303822614472.65
rows=20833345000002 width=0) (actual time=163268.274..207829.524
rows=33333333 loops=3)
                       Hash Cond: (t1.pk = t2.pk)
                       Buffers: shared hit=32644 read=852474
dirtied=437947 written=426374, temp read=944407 written=1130380
                       ->  Parallel Seq Scan on t t1
(cost=0.00..859144.78 rows=41666678 width=4) (actual
time=0.045..30828.051 rows=33333333 loops=3)
                             Buffers: shared hit=16389 read=426089 written=87
                       ->  Parallel Hash (cost=859144.78..859144.78
rows=41666678 width=4) (actual time=82202.445..82202.447 rows=33333333
loops=3)
                             Buckets: 4096 (originally 4096)  Batches:
32768 (originally 8192)  Memory Usage: 192kB
                             Buffers: shared hit=16095 read=426383
dirtied=437947 written=426287, temp read=267898 written=737164
                             ->  Parallel Seq Scan on t t2
(cost=0.00..859144.78 rows=41666678 width=4) (actual
time=0.054..12647.534 rows=33333333 loops=3)
                                   Buffers: shared hit=16095 read=426383
dirtied=437947 writ
ten=426287
   Planning:
     Buffers: shared hit=69 read=38
   Planning Time: 2.819 ms
   Execution Time: 226113.292 ms
(22 rows)



-----------------------------

So we have increased number of batches to 32k.
I applied your patches 0001-0004 but unfortunately them have not reduced
memory consumption - still size of each backend is more than 1.3Gb.
Is this EXPLAIN ANALYZE run on an instance with Jehan-Guillaume's
patchset applied or without?

I'm asking because the fourth patch in the series updates spaceUsed with
the size of the BufFile->buffer, but I notice in your EXPLAIN ANALZYE,
Memory Usage for the hashtable is reported as 192 kB, which, while
larger than the 64kB work_mem you set, isn't as large as I might expect.

- Melanie
Yes, this is explain analyze for the Postgres version with applied 4 patches:

0001-v4-Describe-hybrid-hash-join-implementation.patch
0002-v4-Allocate-hash-batches-related-BufFile-in-a-dedicated.patch
0003-v4-Add-some-debug-and-metrics.patch
0004-v4-Limit-BufFile-memory-explosion-with-bad-HashJoin.patch

Just as workaround I tried the attached patch - it prevents backups memory footprint growth by limiting number of created batches. I am not sure that it is right solution, because in any case we allocate more memory than specified by work_mem. The alternative is to prohibit hash join plan in this case. But it is also not so good solution, because merge join is used to be much slower.
49c49
< static void ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable);
---
> static bool ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable);
896a897,901
> 		/* If memory overhead of maintaining batches is expected to be larger than size of hash table itself,
> 		 * then try to split available memory equaly between hash table and batches
> 		 */
> 		if (dbatch*HASH_CHUNK_SIZE > work_mem*1024L)
> 			dbatch = (work_mem*1024LL + dbatch*HASH_CHUNK_SIZE) / HASH_CHUNK_SIZE;
1021a1027,1031
> 
> 	/* Prevent increasing number of batches if their memory overhead is too large */
> 	if (nbatch*HASH_CHUNK_SIZE > work_mem*1024L)
> 		return;
> 
1166c1176
< static void
---
> static bool
1172a1183,1186
> 	/* Prevent increasing number of batches if their memory overhead is too large */
> 	if (hashtable->nbatch*2*HASH_CHUNK_SIZE > work_mem*1024L)
> 		return false;
> 
1374a1389
> 	return true;
1831d1845
< 		HeapTupleHeaderClearMatch(HJTUPLE_MINTUPLE(hashTuple));
2234a2249
> 	Assert(BarrierParticipants(&batch->batch_barrier) == 1);
3105,3106c3120,3127
< 			ExecParallelHashIncreaseNumBatches(hashtable);
< 		else if (growth == PHJ_GROWTH_NEED_MORE_BUCKETS)
---
> 		{
> 			if (ExecParallelHashIncreaseNumBatches(hashtable))
> 				return NULL;
> 			LWLockAcquire(&pstate->lock, LW_EXCLUSIVE);
> 			pstate->growth = PHJ_GROWTH_DISABLED;
> 		}
> 		else
> 		{
3109,3110c3130,3132
< 		/* The caller must retry. */
< 		return NULL;
---
> 			/* The caller must retry. */
> 			return NULL;
> 		}
3654c3676
< 			ExecParallelHashIncreaseNumBatches(hashtable);
---
> 			return !ExecParallelHashIncreaseNumBatches(hashtable);
3655a3678
> 		{
3657,3658c3680,3681
< 
< 		return false;
---
> 			return false;
> 		}

Reply via email to