thinkharderdev commented on issue #12454: URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2355848857
> We had a discussion about join of huge table with small table here: [#7000 (comment)](https://github.com/apache/datafusion/issues/7000#issuecomment-2094813305) > > There are several approaches discussed: > > 1. Use Dictionaries (like in Clickhouse, [details](https://clickhouse.com/docs/en/sql-reference/dictionaries)) with dictionary api (UDF like `get_dict('users', user_id)`) > > > ClickHouse supports special functions for working with dictionaries that can be used in queries. It is easier and more efficient to use dictionaries with functions than a JOIN with reference tables. > > 2. Use row-based layout for small table with btree/hash index for getting data by a key while joining (in poll method) > This may lead us to create a special type of tables like RowBasedTable with special RowBasedTableProvider, which can give an access to data using hash-based indices or iterator for btree ones. I don't think either of these would actually solve the issue with outer joins. The problem is that there is shared state across the partitions in the join stage about which build-side rows have been matched across the entire stage. In single-node execution you can deal with this as shared in-memory state (which is what `HashJoinStream` does) but in a distributed environment there is no way to share that state without some external mechanism for the execution nodes to communicate. The only way to deal with it (in the current execution model) is to coalesce back to a single node which is impractical with datasets of sufficient size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
