thinkharderdev commented on issue #12454: URL: https://github.com/apache/datafusion/issues/12454#issuecomment-2350024323
> I am not sure this is the "correct" way though it is certainly one way to do it. Yeah, fair. I meant basically that it seems like the way to do it that does not require any distributed state shared across distributed execution nodes. > So do something like This was an approach we tried (or something very similar in spirit at least) but it breaks down when you have a join + aggregation. E.g. with a query like ``` SELECT data.key, SUM(CASE WHEN facts.fact_value IS NULL THEN 0 ELSE 1) as matched, SUM(CASE WHEN facts.fact_value IS NULL THEN 1 ELSE 0) as unmatched FROM facts OUTER JOIN data ON data.fact_id = fact.id GROUP BY data.key ``` that tries to calculate the matched vs unmatched rows from `data`. By adding a coalesce to do the outer join you can no longer fuse the partial aggregation with the hash join and you end up back in the position of having to shuffle huge amounts of data FWIW, I implemented my proposal from above on our internal fork of DataFusion and it's not terribly intrusive in the DataFusion code (IMO): https://github.com/coralogix/arrow-datafusion/pull/269 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
