Re: Do we want a hashset type?

Tomas Vondra Tue, 06 Jun 2023 04:20:58 -0700


On 6/5/23 21:52, Joel Jacobson wrote:
> On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote:
>> Anyway, I hacked together a trivial set backed by an open addressing
>> hash table:
>>
>>   https://github.com/tvondra/hashset
>>
>> It's super-basic, providing just some bare minimum of functions, but
>> hopefully good enough for experiments.
>>
>> - hashset data type - hash table in varlena
>> - hashset_init - create new hashset
>> - hashset_add / hashset_contains - add value / check membership
>> - hashset_merge - merge two hashsets
>> - hashset_count - count elements
>> - hashset_to_array - return
>> - hashset(int) aggregate
>>
>> This allows rewriting the recursive query like this:
>>
>>     WITH RECURSIVE friends_of_friends AS (
> ...
> 
> Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg).
> 
> I think if you just add one more hashset function, it will be a win against 
> roaringbitmap, which is 400 ms.
> 
> The missing function is an agg that takes hashset as input and returns 
> hashset, similar to roaringbitmap's rb_or_agg().
> 
> With such a function, we could add an adjacent nodes hashset column to the 
> `nodes` table, which would eliminate the need to scan the edges table for 
> graph traversal:
>


I added a trivial version of such aggregate hashset(hashset), and if I
rewrite the CTE like this:

    WITH RECURSIVE friends_of_friends AS (
        SELECT
            (select hashset(v) from values (5867) as s(v)) AS current,
            0 AS depth
        UNION ALL
        SELECT
            new_current,
            friends_of_friends.depth + 1
        FROM
            friends_of_friends
        CROSS JOIN LATERAL (
            SELECT
                hashset(edges.to_node) AS new_current
            FROM
                edges
            WHERE
                from_node =
ANY(hashset_to_array(friends_of_friends.current))
        ) q
        WHERE
            friends_of_friends.depth < 3
    )
    SELECT
        depth,
        hashset_count(current)
    FROM
        friends_of_friends
    WHERE
        depth = 3;

it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms
on your system. There's a bunch of opportunities for more improvements,
as the hash table implementation is pretty naive/silly, the on-disk
format is wasteful and so on.

But before spending more time on that, it'd be interesting to know what
would be a competitive timing. I mean, what would be "good enough"? What
timings are achievable with graph databases?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Do we want a hashset type?

Reply via email to