Em ter., 10 de ago. de 2021 às 05:53, Yura Sokolov <y.soko...@postgrespro.ru>
escreveu:

> Good day, hackers.
>
> Our client caught process stuck in tuplehash_grow. There was a query
> like
> `select ts, count(*) from really_huge_partitioned_table group by ts`,
> and
> planner decided to use hash aggregation.
>
> Backtrace shows that oldsize were 2147483648 at the moment. While
> newsize
> were optimized, looks like it were SH_MAX_SIZE.
>
> #0  0x0000000000603d0c in tuplehash_grow (tb=0x7f18c3c284c8,
> newsize=<optimized out>) at ../../../src/include/lib/simplehash.h:457
>          hash = 2147483654
>          startelem = 1
>          curelem = 1
>          oldentry = 0x7f00c299e0d8
>          oldsize = 2147483648
>          olddata = 0x7f00c299e048
>          newdata = 0x32e0448
>          i = 6
>          copyelem = 6
>
> EXPLAIN shows that there are 2604186278 rows in all partitions, but
> planner
> thinks there will be only 200 unique rows after group by. Looks like we
> was
> mistaken.
>
>   Finalize GroupAggregate  (cost=154211885.42..154211936.09 rows=200
> width=16)
>      Group Key: really_huge_partitioned_table.ts
>      ->  Gather Merge  (cost=154211885.42..154211932.09 rows=400
> width=16)
>            Workers Planned: 2
>            ->  Sort  (cost=154210885.39..154210885.89 rows=200 width=16)
>                  Sort Key: really_huge_partitioned_table.ts
>                  ->  Partial HashAggregate
> (cost=154210875.75..154210877.75 rows=200 width=16)
>                        Group Key: really_huge_partitioned_table.ts
>                        ->  Append  (cost=0.43..141189944.36
> rows=2604186278 width=8)
>                              ->  Parallel Index Only Scan using
> really_huge_partitioned_table_001_idx2 on
> really_huge_partitioned_table_001  (cost=0.43..108117.92 rows=2236977
> width=8)
>                              ->  Parallel Index Only Scan using
> really_huge_partitioned_table_002_idx2 on
> really_huge_partitioned_table_002  (cost=0.43..114928.19 rows=2377989
> width=8)
>                              .... and more than 400 partitions more
>
> After some investigation I found bug that is present in simplehash from
> its
> beginning:
>
> - sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this
> way:
>
>          /* now set size */
>          tb->size = size;
>
>          if (tb->size == SH_MAX_SIZE)
>                  tb->sizemask = 0;
>          else
>                  tb->sizemask = tb->size - 1;
>
>    that means, when we are resizing to SH_MAX_SIZE, sizemask becomes
> zero.
>
> - then sizemask is used to SH_INITIAL_BUCKET and SH_NEXT to compute
> initial and
>    next position:
>
>    SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
>      return hash & tb->sizemask;
>    SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem)
>      curelem = (curelem + 1) & tb->sizemask;
>
> - and then SH_GROW stuck in element placing loop:
>
>    startelem = SH_INITIAL_BUCKET(tb, hash);
>    curelem = startelem;
>    while (true)
>      curelem = SH_NEXT(tb, curelem, startelem);
>
> There is Assert(curelem != startelem) in SH_NEXT, but since no one test
> it
> with 2 billion elements, it were not triggered. And Assert is not
> compiled
> in production code.
>
> Attached patch fixes it with removing condition and type casting:
>
>          /* now set size */
>          tb->size = size;
>          tb->sizemask = (uint32)(size - 1);
>
>
> OOOOOOPS
>
> While writting this letter, I looke at newdata in the frame of
> tuplehash_grow:
>
>      newdata = 0x32e0448
>
> It is bellow 4GB border. Allocator does not allocate many-gigabytes
> chunks
> (and we certainly need 96GB in this case) in sub 4GB address space.
> Because
> mmap doesn't do this.
>
> I went to check SH_GROW and.... It is `SH_GROW(SH_TYPE *tb, uint32
> newsize)`
> :-(((
> Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb,
> tb->size * 2)`,
> then SH_GROW(tb, 0) is called due to truncation.
> And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`.
>
> Ahh... ok, patch is updated to fix this as well.
>
It seems that we need to fix the function prototype too.

/* void <prefix>_grow(<prefix>_hash *tb) */
-SH_SCOPE void SH_GROW(SH_TYPE * tb, uint32 newsize);
+SH_SCOPE void SH_GROW(SH_TYPE * tb, uint64 newsize);

regards,
Ranier Vilela

Reply via email to