On 6/5/23 21:52, Joel Jacobson wrote: > On Mon, Jun 5, 2023, at 01:44, Tomas Vondra wrote: >> Anyway, I hacked together a trivial set backed by an open addressing >> hash table: >> >> https://github.com/tvondra/hashset >> >> It's super-basic, providing just some bare minimum of functions, but >> hopefully good enough for experiments. >> >> - hashset data type - hash table in varlena >> - hashset_init - create new hashset >> - hashset_add / hashset_contains - add value / check membership >> - hashset_merge - merge two hashsets >> - hashset_count - count elements >> - hashset_to_array - return >> - hashset(int) aggregate >> >> This allows rewriting the recursive query like this: >> >> WITH RECURSIVE friends_of_friends AS ( > ... > > Nice! I get similar results, 737 ms (hashset) vs 1809 ms (array_agg). > > I think if you just add one more hashset function, it will be a win against > roaringbitmap, which is 400 ms. > > The missing function is an agg that takes hashset as input and returns > hashset, similar to roaringbitmap's rb_or_agg(). > > With such a function, we could add an adjacent nodes hashset column to the > `nodes` table, which would eliminate the need to scan the edges table for > graph traversal: >
I added a trivial version of such aggregate hashset(hashset), and if I rewrite the CTE like this: WITH RECURSIVE friends_of_friends AS ( SELECT (select hashset(v) from values (5867) as s(v)) AS current, 0 AS depth UNION ALL SELECT new_current, friends_of_friends.depth + 1 FROM friends_of_friends CROSS JOIN LATERAL ( SELECT hashset(edges.to_node) AS new_current FROM edges WHERE from_node = ANY(hashset_to_array(friends_of_friends.current)) ) q WHERE friends_of_friends.depth < 3 ) SELECT depth, hashset_count(current) FROM friends_of_friends WHERE depth = 3; it cuts the timing to about 50% on my laptop, so maybe it'll be ~300ms on your system. There's a bunch of opportunities for more improvements, as the hash table implementation is pretty naive/silly, the on-disk format is wasteful and so on. But before spending more time on that, it'd be interesting to know what would be a competitive timing. I mean, what would be "good enough"? What timings are achievable with graph databases? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company