thinkharderdev commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2355848857

   > We had a discussion about join of huge table with small table here: [#7000 
(comment)](https://github.com/apache/datafusion/issues/7000#issuecomment-2094813305)
   > 
   > There are several approaches discussed:
   > 
   > 1. Use Dictionaries (like in Clickhouse, 
[details](https://clickhouse.com/docs/en/sql-reference/dictionaries)) with 
dictionary api (UDF like `get_dict('users', user_id)`)
   > 
   > > ClickHouse supports special functions for working with dictionaries that 
can be used in queries. It is easier and more efficient to use dictionaries 
with functions than a JOIN with reference tables.
   > 
   > 2. Use row-based layout for small table with btree/hash index for getting 
data by a key while joining (in poll method)
   >    This may lead us to create a special type of tables like RowBasedTable 
with special RowBasedTableProvider, which can give an access to data using 
hash-based indices or iterator for btree ones.
   
   I don't think either of these would actually solve the issue with outer 
joins. The problem is that there is shared state across the partitions in the 
join stage about which build-side rows have been matched across the entire 
stage. In single-node execution you can deal with this as shared in-memory 
state (which is what `HashJoinStream` does) but in a distributed environment 
there is no way to share that state without some external mechanism for the 
execution nodes to communicate. The only way to deal with it (in the current 
execution model) is to coalesce back to a single node which is impractical with 
datasets of sufficient size. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to