On 6/8/23 11:41, Joel Jacobson wrote: > On Wed, Jun 7, 2023, at 19:37, Tomas Vondra wrote: >> Interesting, considering how dumb the the hash table implementation is. > > That's promising. >
Yeah, not bad for sleep-deprived on-plane hacking ... There's a bunch of stuff that needs to be improved to make this properly usable, like: 1) better hash table implementation 2) input/output functions 3) support for other types (now it only works with int32) 4) I wonder if this might be done as an array-like polymorphic type. 5) more efficient storage format, with versioning etc. 6) regression tests Would you be interested in helping with / working on some of that? I don't have immediate need for this stuff, so it's not very high on my TODO list. >>> I tested Neo4j and the results are surprising; it appears to be >>> significantly *slower*. >>> However, I've probably misunderstood something, maybe I need to add some >>> index or something. >>> Even so, it's interesting it's apparently not fast "by default". >>> >> >> No idea how to fix that, but it's rather suspicious. > > I've had a graph-db expert review my benchmark, and he suggested adding an > index: > > CREATE INDEX FOR (n:User) ON (n.id) > > This did improve the execution time for Neo4j a bit, down from 819 ms to 528 > ms, but PostgreSQL 299 ms is still a win. > > Benchmark here: https://github.com/joelonsql/graph-query-benchmarks > Note, in this benchmark, I only test the naive RECURSIVE CTE approach using > array_agg(DISTINCT ...). > And I couldn't even test the most connected user with Neo4j, the query never > finish for some reason, > so I had to test with a less connected user. > Interesting. I'd have expected the graph db to be much faster. > The graph expert also said that other more realistic graph use-cases might be > "multi-relational", > and pointed me to a link: > https://github.com/totogo/awesome-knowledge-graph#knowledge-graph-dataset > No idea how such multi-relational datasets would affect the benchmarks. > Not sure either, but I don't have ambition to improve everything at once. If the hashset improves one practical use case, fine with me. > I think we have already strong indicators that PostgreSQL with a hashset type > will from a relative > performance perspective, do just fine processing basic graph queries, even > with large datasets. > > Then there will always be the case when users primarily write very different > graph queries all day long, > who might prefer a graph query *language*, like SQL/PGQ in SQL:2023, Cypher > or Gremlin. > Right. IMHO the query language is a separate thing, you still need to evaluate the query somehow - which is where hashset applies. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company