Good day, hackers.

Our client caught process stuck in tuplehash_grow. There was a query like `select ts, count(*) from really_huge_partitioned_table group by ts`, and
planner decided to use hash aggregation.

Backtrace shows that oldsize were 2147483648 at the moment. While newsize
were optimized, looks like it were SH_MAX_SIZE.

#0 0x0000000000603d0c in tuplehash_grow (tb=0x7f18c3c284c8, newsize=<optimized out>) at ../../../src/include/lib/simplehash.h:457
        hash = 2147483654
        startelem = 1
        curelem = 1
        oldentry = 0x7f00c299e0d8
        oldsize = 2147483648
        olddata = 0x7f00c299e048
        newdata = 0x32e0448
        i = 6
        copyelem = 6

EXPLAIN shows that there are 2604186278 rows in all partitions, but planner thinks there will be only 200 unique rows after group by. Looks like we was
mistaken.

Finalize GroupAggregate (cost=154211885.42..154211936.09 rows=200 width=16)
    Group Key: really_huge_partitioned_table.ts
-> Gather Merge (cost=154211885.42..154211932.09 rows=400 width=16)
          Workers Planned: 2
          ->  Sort  (cost=154210885.39..154210885.89 rows=200 width=16)
                Sort Key: really_huge_partitioned_table.ts
-> Partial HashAggregate (cost=154210875.75..154210877.75 rows=200 width=16)
                      Group Key: really_huge_partitioned_table.ts
-> Append (cost=0.43..141189944.36 rows=2604186278 width=8) -> Parallel Index Only Scan using really_huge_partitioned_table_001_idx2 on really_huge_partitioned_table_001 (cost=0.43..108117.92 rows=2236977 width=8) -> Parallel Index Only Scan using really_huge_partitioned_table_002_idx2 on really_huge_partitioned_table_002 (cost=0.43..114928.19 rows=2377989 width=8)
                            .... and more than 400 partitions more

After some investigation I found bug that is present in simplehash from its
beginning:

- sizemask is set only in SH_COMPUTE_PARAMETERS . And it is set in this way:

        /* now set size */
        tb->size = size;

        if (tb->size == SH_MAX_SIZE)
                tb->sizemask = 0;
        else
                tb->sizemask = tb->size - 1;

that means, when we are resizing to SH_MAX_SIZE, sizemask becomes zero.

- then sizemask is used to SH_INITIAL_BUCKET and SH_NEXT to compute initial and
  next position:

  SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
    return hash & tb->sizemask;
  SH_NEXT(SH_TYPE * tb, uint32 curelem, uint32 startelem)
    curelem = (curelem + 1) & tb->sizemask;

- and then SH_GROW stuck in element placing loop:

  startelem = SH_INITIAL_BUCKET(tb, hash);
  curelem = startelem;
  while (true)
    curelem = SH_NEXT(tb, curelem, startelem);

There is Assert(curelem != startelem) in SH_NEXT, but since no one test it with 2 billion elements, it were not triggered. And Assert is not compiled
in production code.

Attached patch fixes it with removing condition and type casting:

        /* now set size */
        tb->size = size;
        tb->sizemask = (uint32)(size - 1);


OOOOOOPS

While writting this letter, I looke at newdata in the frame of tuplehash_grow:

    newdata = 0x32e0448

It is bellow 4GB border. Allocator does not allocate many-gigabytes chunks (and we certainly need 96GB in this case) in sub 4GB address space. Because
mmap doesn't do this.

I went to check SH_GROW and.... It is `SH_GROW(SH_TYPE *tb, uint32 newsize)`
:-(((
Therefore when `tb->size == SH_MAX_SIZE/2` and we call `SH_GROW(tb, tb->size * 2)`,
then SH_GROW(tb, 0) is called due to truncation.
And SH_COMPUTE_PARAMETERS is also accepts `uint32 newsize`.

Ahh... ok, patch is updated to fix this as well.


regards.

-----

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com
From a8283d3a17c630a65e1b42f8617e07873a30fbc7 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Tue, 10 Aug 2021 11:51:16 +0300
Subject: [PATCH] Fix new size and sizemask computaton in simplehash.h

Fix couple of 32/64bit related errors in simplehash.h:
- size of SH_GROW and SH_COMPUTE_PARAMETERS arguments
- computation of tb->sizemask.
---
 src/include/lib/simplehash.h | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index da51781e98e..2287601cfa1 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -302,7 +302,7 @@ SH_SCOPE void SH_STAT(SH_TYPE * tb);
  * the hashtable.
  */
 static inline void
-SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
+SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		size;
 
@@ -322,11 +322,7 @@ SH_COMPUTE_PARAMETERS(SH_TYPE * tb, uint32 newsize)
 
 	/* now set size */
 	tb->size = size;
-
-	if (tb->size == SH_MAX_SIZE)
-		tb->sizemask = 0;
-	else
-		tb->sizemask = tb->size - 1;
+	tb->sizemask = (uint32)(size - 1);
 
 	/*
 	 * Compute the next threshold at which we need to grow the hash table
@@ -476,7 +472,7 @@ SH_RESET(SH_TYPE * tb)
  * performance-wise, when known at some point.
  */
 SH_SCOPE void
-SH_GROW(SH_TYPE * tb, uint32 newsize)
+SH_GROW(SH_TYPE * tb, uint64 newsize)
 {
 	uint64		oldsize = tb->size;
 	SH_ELEMENT_TYPE *olddata = tb->data;
-- 
2.32.0

Reply via email to