Em ter., 10 de ago. de 2021 às 05:53, Yura Sokolov <y.soko...@postgrespro.ru> escreveu:
> Good day, hackers. > > Our client caught process stuck in tuplehash_grow. There was a query > like > `select ts, count(*) from really_huge_partitioned_table group by ts`, > and > planner decided to use hash aggregation. > > Backtrace shows that oldsize were 2147483648 at the moment. While > newsize > were optimized, looks like it were SH_MAX_SIZE. > > #0 0x0000000000603d0c in tuplehash_grow (tb=0x7f18c3c284c8, > newsize=<optimized out>) at ../../../src/include/lib/simplehash.h:457 > hash = 2147483654 > startelem = 1 > curelem = 1 > oldentry = 0x7f00c299e0d8 > oldsize = 2147483648 > olddata = 0x7f00c299e048 > newdata = 0x32e0448 > i = 6 > copyelem = 6 > > EXPLAIN shows that there are 2604186278 rows in all partitions, but > planner > thinks there will be only 200 unique rows after group by. Looks like we > was > mistaken. > > Finalize GroupAggregate (cost=154211885.42..154211936.09 rows=200 > width=16) > Group Key: really_huge_partitioned_table.ts > -> Gather Merge (cost=154211885.42..154211932.09 rows=400 > width=16) > Workers Planned: 2 > -> Sort (cost=154210885.39..154210885.89 rows=200 width=16) > Sort Key: really_huge_partitioned_table.ts > -> Partial HashAggregate > (cost=154210875.75..154210877.75 rows=200 width=16) > Group Key: really_huge_partitioned_table.ts > -> Append (cost=0.43..141189944.36 > rows=2604186278 width=8) > -> Parallel Index Only Scan using > really_huge_partitioned_table_001_idx2 on > really_huge_partitioned_table_001 (cost=0.43..108117.92 rows=2236977 > width=8) > -> Parallel Index Only Scan using > really_huge_partitioned_table_002_idx2 on > really_huge_partitioned_table_002 (cost=0.43..114928.19 rows=2377989 > width=8) > .... and more than 400 partitions more > > After some investigation I found bug that is present in simplehash from > its > beginning: > > - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this > way: > > /* now set size */ > tb->size = size; > > if (tb->size == SH_MAX_SIZE) > tb->sizemask = 0; > else > tb->sizemask = tb->size - 1; > > that means, when we are resizing to SH_MAX_SIZE, sizemask becomes > zero. > > - then sizemask is used to SH_INITIAL_BUCKET and SH_NEXT to compute > initial and > next position: > > SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash) > return hash & tb->sizemask; > SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem) > curelem = (curelem + 1) & tb->sizemask; > > - and then SH_GROW stuck in element placing loop: > > startelem = SH_INITIAL_BUCKET(tb, hash); > curelem = startelem; > while (true) > curelem = SH_NEXT(tb, curelem, startelem); > > There is Assert(curelem != startelem) in SH_NEXT, but since no one test > it > with 2 billion elements, it were not triggered. And Assert is not > compiled > in production code. > > Attached patch fixes it with removing condition and type casting: > > /* now set size */ > tb->size = size; > tb->sizemask = (uint32)(size - 1); > > > OOOOOOPS > > While writting this letter, I looke at newdata in the frame of > tuplehash_grow: > > newdata = 0x32e0448 > > It is bellow 4GB border. Allocator does not allocate many-gigabytes > chunks > (and we certainly need 96GB in this case) in sub 4GB address space. > Because > mmap doesn't do this. > > I went to check SH_GROW and.... It is `SH_GROW(SH_TYPE *tb, uint32 > newsize)` > :-((( > Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb, > tb->size * 2)`, > then SH_GROW(tb, 0) is called due to truncation. > And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`. > > Ahh... ok, patch is updated to fix this as well. > It seems that we need to fix the function prototype too. /* void <prefix>_grow(<prefix>_hash *tb) */ -SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize); +SH_SCOPE void SH_GROW(SH_TYPE * tb, uint64 newsize); regards, Ranier Vilela