adriangb commented on code in PR #16641: URL: https://github.com/apache/datafusion/pull/16641#discussion_r2179100056
########## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ########## @@ -668,6 +668,15 @@ fn handle_hash_join( plan: &HashJoinExec, parent_required: OrderingRequirements, ) -> Result<Option<Vec<Option<OrderingRequirements>>>> { + // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input order, + // so sorting beforehand cannot be relied on. Bail out early for both flavors: + match plan.join_type() { + JoinType::LeftAnti | JoinType::RightAnti => { + return Ok(None); + } + _ => {} + } Review Comment: It makes sense that in that example no global order is guaranteed in the output, there was no sorting being done before the anti-join! I think if you pushed a SortExec you'd get the right order. Having thought about it a bit I do think this is a real bug that needs fixing, but I'm not sure it's about sorting, it's about limits: if you push the TopK below the anti join is the limit is applied before the anti join filters out rows, which means you'll end up getting less or the wrong rows out. For example, if the batch into the TopK(2) is | a | b | c | |---|---|---| | 2 | 3 | 5 | | 1 | 1 | 2 | | 2 | 4 | 1 | It will output | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 3 | 5 | But then if the anti join filters out `c = 5` we get: | a | b | c | |---|---|---| | 1 | 1 | 2 | Which is a single row despite the `LIMIT 2`. If instead we apply the anti-join first we get: | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 4 | 1 | Which is correct. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org