Re: Do we want a hashset type?

Joel Jacobson Wed, 28 Jun 2023 01:50:56 -0700

On Wed, Jun 28, 2023, at 08:26, jian he wrote:

> Hi there.
> I changed the function hashset_contains to strict.


Changing hashset_contains to STRICT would cause it to return NULL
if any of the operands are NULL, which I don't believe is correct, since:

SELECT NULL = ANY('{}'::int4[]);
 ?column?
----------
 f
(1 row)

Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE,
to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023.

Did you try running `make installcheck` after your change?
You would then have seen one of the tests failing:

test array-and-multiset-semantics ... FAILED       21 ms

Check the content of `regression.diffs` to see why:

% cat regression.diffs
diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 
/Users/joel/src/hashset/results/array-and-multiset-semantics.out
--- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out      
2023-06-27 10:07:38
+++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out    
2023-06-28 10:13:27
@@ -158,7 +158,7 @@
         |      | {NULL}      | {NULL}       |                  |
         |    1 | {1}         | {1}          |                  |
         |    4 | {4}         | {4}          |                  |
- {}     |      | {NULL}      | {NULL}       | f                | f
+ {}     |      | {NULL}      | {NULL}       |                  | f
  {}     |    1 | {1}         | {1}          | f                | f
  {}     |    4 | {4}         | {4}          | f                | f
  {NULL} |      | {NULL}      | {NULL,NULL}  |                  |
@@ -284,7 +284,8 @@
     "= ANY(...)";
  arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...)
 ------+------+-------------+--------------+------------------+------------
-(0 rows)
+ {}   |      | {NULL}      | {NULL}       |                  | f
+(1 row)


> also change the way to return an empty array.

Nice.
I agree the `Datum d` variable was unnecessary.
I also removed the unused includes.

> in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that
> int4hashset can speed distinct aggregate and distinct counts?
> like the following:
>
> explain(analyze, costs off, timing off, buffers)
> SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3
>
> explain(analyze, costs off, timing off, buffers)
> SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3

The 100k tables seems to be too small to give any meaningful results,
when trying to measure individual queries:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_100k;
 Execution Time: 26.790 ms
 Execution Time: 30.616 ms
 Execution Time: 33.253 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_100k;
 Execution Time: 32.797 ms
 Execution Time: 27.605 ms
 Execution Time: 26.228 ms

If we instead try the 10M tables, it looks like array_agg(DISTINCT ...)
is actually faster for the `i` column where all input integers are unique:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT i) FROM benchmark_input_10M;
 Execution Time: 799.017 ms
 Execution Time: 796.008 ms
 Execution Time: 799.121 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(i) FROM benchmark_input_10M;
 Execution Time: 1204.873 ms
 Execution Time: 1221.822 ms
 Execution Time: 1216.340 ms

For random integers, hashset is a win though:

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1874.722 ms
 Execution Time: 1878.760 ms
 Execution Time: 1861.640 ms

EXPLAIN(analyze, costs off, timing off, buffers)
SELECT hashset_agg(rnd) FROM benchmark_input_10M;
 Execution Time: 1253.709 ms
 Execution Time: 1222.651 ms
 Execution Time: 1237.849 ms

> explain(costs off,timing off, analyze,buffers)
> select count(distinct rnd) from benchmark_input_100k \watch c=3
>
> explain(costs off,timing off, analyze,buffers)
> SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM
> benchmark_input_100k) sub(x) \watch c=3

I tried these with 10M:

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M;
 Execution Time: 1733.320 ms
 Execution Time: 1725.214 ms
 Execution Time: 1716.636 ms

EXPLAIN(costs off,timing off, analyze,buffers)
SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM 
benchmark_input_10M) sub(x);
 Execution Time: 1249.612 ms
 Execution Time: 1240.558 ms
 Execution Time: 1252.103 ms

Not sure what I think of the current benchmark suite.

I think it would be better to only include some realistic examples from
real-life, such as the graph query which was the reason I personally started
working on this. Otherwise there is a risk we optimise for some hypothetical
scenario that is not relevant in practise.

Would be good with more examples of typical work loads for when the hashset
type would be useful.

/Joel

Re: Do we want a hashset type?

Reply via email to