karlovnv commented on issue #12454: URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2356209473
> > We had a discussion about join of huge table with small table here: [#7000 (comment)](https://github.com/apache/datafusion/issues/7000#issuecomment-2094813305) > > There are several approaches discussed: > I don't think either of these would actually solve the issue with outer joins. I understand your point. I mentioned ClickHouse functionality and Dictionaries just to show that sometimes it is not necessary to do join when you'd like to do join. I agree that this will work only for **inner** joins (the we can replace join by getting data from dictionary). ClickHouse has copies of dictionaries (as files, as connections to an external DB, etc) at each node to avoid sync process of join / dictionary replacements across the cluster > in a distributed environment there is no way to share that state without some external mechanism Yes, this is true. There are some examples: Greenplum for instance sends portions of data (and state) across the shards to parallelize execution. It's such a difficult thing to develop. Shards of ClickHouse don't send their data in that manner, so CH allows to do join only two distributed tables, no more. That why some people choose Greenplum for its ability to perform any query despite the fact that it can be slow, some chooses CH for its speed despite limitations. So I think that if we can consider small Facts table as a some kind of dictionary, and build HashJoin index on each node using copies on this table. This can help us to do join without resort to develop an external mechanism for communication. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
