alamb commented on issue #12454:
URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2349892215
One challenge I predict with the above scenario is that it seems to assume
that the order of rows from the build side will be the same on all nodes across
all partitions (so you can match up the BooleanBuffer across ndoes)
> This sort of goes against the idea that DataFusion itself is not a library
for distributed query execution, b
I think adding hooks for making distributed engines is a very reasonable
discussion
> The correct way to do this in a distributed execution is to use a
partitioned join and repartition data but this is a problem because data is
huge and the repartition would require shuffling a potentially massive amount
of data.
I am not sure this is the "correct" way though it is certainly one way to do
it.
It seems like the core challenge you are describing is finding all rows in a
small `facts` table that did not match any rows when joined with all rows of an
arbitrarily distributed `data` table
You could manage this via a distributed state as you suggest. Another way
might be to rewrite the query (automatically) to do the check of which rows
didn't match on a single node
The single node might seem like a bad idea, but if the `facts` is really
small the cost of rehashing it is probably low
So do something like
```
WITH
(SELECT facts.fact_value fact_value, data.id did, data.fact_id fact_id
FROM facts JOIN data -- Note this is now INNER join, done in distributed
fashion
ON data.fact_id = fact.id)
as join_result
SELECT
f1.fv fv,
did,
f1.fid fid
FROM facts f1 OUTER JOIN join_result --- this outer query fills in all
missing fact rows
ON f1.fact_id = join_result.fact_id
```
The idea is that you run the outer query either on a single node or after
redistributing both sides on fid
This does mean you have to hash `facts` again, but you don't have to move
the `data` tables around
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]