On Sat, Jun 17, 2023 at 8:38 AM Joel Jacobson <j...@compiler.org> wrote: > > On Fri, Jun 16, 2023, at 17:42, Joel Jacobson wrote: > > I realise int4hashset_hash() is broken, > > since two int4hashset's that are considered equal, > > can by coincidence get different hashes: > ... > > Do we have any ideas on how to fix this without sacrificing performance? > > The problem was due to hashset_hash() function accumulating the hashes > of individual elements in a non-commutative manner. As a consequence, the > final hash value was sensitive to the order in which elements were inserted > into the hashset. This behavior led to inconsistencies, as logically > equivalent sets (i.e., sets with the same elements but in different orders) > produced different hash values. > > Solved by modifying the hashset_hash() function to use a commutative operation > when combining the hashes of individual elements. This change ensures that the > final hash value is independent of the element insertion order, and logically > equivalent sets produce the same hash. > > An somewhat unfortunate side-effect of this fix, is that we can no longer > visually sort the hashset output format, since it's not lexicographically > sorted. > I think this is an acceptable trade-off for a hashset type, > since the only alternative I see would be to sort the elements, > but then it wouldn't be a hashset, but a treeset, which different > Big-O complexity. > > New patch is attached, which will henceforth always be a complete patch, > to avoid the hassle of having to assemble incremental patches. > > /Joel
select hashset_contains('{1,2}'::int4hashset,NULL::int); should return null? --------------------------------------------------------------------------------- SELECT attname ,pc.relname ,CASE attstorage WHEN 'p' THEN 'plain' WHEN 'e' THEN 'external' WHEN 'm' THEN 'main' WHEN 'x' THEN 'extended' END AS storage FROM pg_attribute pa join pg_class pc on pc.oid = pa.attrelid where attnum > 0 and pa.attstorage = 'e'; In my system catalog, it seems only the hashset type storage = 'external'. most is extended..... I am not sure the consequence of switch from external to extended. ------------------------------------------------------------------------------------------------------------ select hashset_hash('{-1,1}') as a1 ,hashset_hash('{1,-2}') as a2 ,hashset_hash('{-3,1}') as a3 ,hashset_hash('{4,1}') as a4; returns: a1 | a2 | a3 | a4 -------------+-----------+------------+------------ -1735582196 | 998516167 | 1337000903 | 1305426029 (1 row) values {a1,a2,a3,a4} should be monotone increasing, based on the function int4hashset_cmp, but now it's not. so the following queries failed. --should return only one row. select hashset_cmp('{2,1}','{3,1}') union select hashset_cmp('{3,1}','{4,1}') union select hashset_cmp('{1,3}','{4,1}'); select hashset_cmp('{9,10,11}','{10,9,-11}') = hashset_cmp('{9,10,11}','{10,9,-1}'); --should be true select '{2,1}'::int4hashset > '{7}'::int4hashset; --should be false. based on int array comparison,. ----------------------------------------------------------------------------------------- I comment out following lines in hashset-api.c somewhere between {810,829} // if (a->hash < b->hash) // PG_RETURN_INT32(-1); // else if (a->hash > b->hash) // PG_RETURN_INT32(1); // if (a->nelements < b->nelements) // PG_RETURN_INT32(-1); // else if (a->nelements > b->nelements) // PG_RETURN_INT32(1); // Assert(a->nelements == b->nelements); So hashset_cmp will directly compare int array. the above queries works. {int4hashset_equals,int4hashset_neq} two special cases of hashset_cmp. maybe we can just wrap it just like int4hashset_le? now store 10 element int4hashset need 99 bytes, similar one dimension bigint array with length 10, occupy 101 byte.... in int4hashset_send, newly add struct attributes/member {load_factor growth_factor ncollisions hash} also need send to buf?