On Wed, Jun 28, 2023, at 08:26, jian he wrote: > Hi there. > I changed the function hashset_contains to strict.
Changing hashset_contains to STRICT would cause it to return NULL if any of the operands are NULL, which I don't believe is correct, since: SELECT NULL = ANY('{}'::int4[]); ?column? ---------- f (1 row) Hence, `hashset_contains('{}'::int4hashset, NULL)` should also return FALSE, to mimic the semantics of arrays and MULTISET's MEMBER OF predicate in SQL:2023. Did you try running `make installcheck` after your change? You would then have seen one of the tests failing: test array-and-multiset-semantics ... FAILED 21 ms Check the content of `regression.diffs` to see why: % cat regression.diffs diff -U3 /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out /Users/joel/src/hashset/results/array-and-multiset-semantics.out --- /Users/joel/src/hashset/test/expected/array-and-multiset-semantics.out 2023-06-27 10:07:38 +++ /Users/joel/src/hashset/results/array-and-multiset-semantics.out 2023-06-28 10:13:27 @@ -158,7 +158,7 @@ | | {NULL} | {NULL} | | | 1 | {1} | {1} | | | 4 | {4} | {4} | | - {} | | {NULL} | {NULL} | f | f + {} | | {NULL} | {NULL} | | f {} | 1 | {1} | {1} | f | f {} | 4 | {4} | {4} | f | f {NULL} | | {NULL} | {NULL,NULL} | | @@ -284,7 +284,8 @@ "= ANY(...)"; arg1 | arg2 | hashset_add | array_append | hashset_contains | = ANY(...) ------+------+-------------+--------------+------------------+------------ -(0 rows) + {} | | {NULL} | {NULL} | | f +(1 row) > also change the way to return an empty array. Nice. I agree the `Datum d` variable was unnecessary. I also removed the unused includes. > in benchmark.sql, would it be ok to use EXPLAIN to demonstrate that > int4hashset can speed distinct aggregate and distinct counts? > like the following: > > explain(analyze, costs off, timing off, buffers) > SELECT array_agg(DISTINCT i) FROM benchmark_input_100k \watch c=3 > > explain(analyze, costs off, timing off, buffers) > SELECT hashset_agg(i) FROM benchmark_input_100k \watch c=3 The 100k tables seems to be too small to give any meaningful results, when trying to measure individual queries: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT i) FROM benchmark_input_100k; Execution Time: 26.790 ms Execution Time: 30.616 ms Execution Time: 33.253 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(i) FROM benchmark_input_100k; Execution Time: 32.797 ms Execution Time: 27.605 ms Execution Time: 26.228 ms If we instead try the 10M tables, it looks like array_agg(DISTINCT ...) is actually faster for the `i` column where all input integers are unique: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT i) FROM benchmark_input_10M; Execution Time: 799.017 ms Execution Time: 796.008 ms Execution Time: 799.121 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(i) FROM benchmark_input_10M; Execution Time: 1204.873 ms Execution Time: 1221.822 ms Execution Time: 1216.340 ms For random integers, hashset is a win though: EXPLAIN(analyze, costs off, timing off, buffers) SELECT array_agg(DISTINCT rnd) FROM benchmark_input_10M; Execution Time: 1874.722 ms Execution Time: 1878.760 ms Execution Time: 1861.640 ms EXPLAIN(analyze, costs off, timing off, buffers) SELECT hashset_agg(rnd) FROM benchmark_input_10M; Execution Time: 1253.709 ms Execution Time: 1222.651 ms Execution Time: 1237.849 ms > explain(costs off,timing off, analyze,buffers) > select count(distinct rnd) from benchmark_input_100k \watch c=3 > > explain(costs off,timing off, analyze,buffers) > SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM > benchmark_input_100k) sub(x) \watch c=3 I tried these with 10M: EXPLAIN(costs off,timing off, analyze,buffers) SELECT COUNT(DISTINCT rnd) FROM benchmark_input_10M; Execution Time: 1733.320 ms Execution Time: 1725.214 ms Execution Time: 1716.636 ms EXPLAIN(costs off,timing off, analyze,buffers) SELECT hashset_cardinality(x) FROM (SELECT hashset_agg(rnd) FROM benchmark_input_10M) sub(x); Execution Time: 1249.612 ms Execution Time: 1240.558 ms Execution Time: 1252.103 ms Not sure what I think of the current benchmark suite. I think it would be better to only include some realistic examples from real-life, such as the graph query which was the reason I personally started working on this. Otherwise there is a risk we optimise for some hypothetical scenario that is not relevant in practise. Would be good with more examples of typical work loads for when the hashset type would be useful. /Joel