adriangb commented on code in PR #16641: URL: https://github.com/apache/datafusion/pull/16641#discussion_r2178887207
########## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ########## @@ -668,6 +668,15 @@ fn handle_hash_join( plan: &HashJoinExec, parent_required: OrderingRequirements, ) -> Result<Option<Vec<Option<OrderingRequirements>>>> { + // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input order, + // so sorting beforehand cannot be relied on. Bail out early for both flavors: + match plan.join_type() { + JoinType::LeftAnti | JoinType::RightAnti => { + return Ok(None); + } + _ => {} + } Review Comment: Okay interesting. Thank you for the in-depth explenation. Could you show with a simple table representing a batch how the anti join filtering out rows causes ordering to be lost? I imagine something like this flowing into the anti join: | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 3 | 5 | | 2 | 4 | 1 | For the query `SELECT a, b, c FROM t1 WHERE c NOT IN (SELECT n FROM t2) ORDER BY t1.a, t2.b` I expect the anti join to be created and for it to remove some rows, let's say `SELECT n FROM t2` returns just `5`, then the output would be: | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 4 | 1 | Which is still ordered. Are you saying it might be | a | b | c | |---|---|---| | 2 | 4 | 1 | | 1 | 1 | 2 | instead? ########## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ########## @@ -668,6 +668,15 @@ fn handle_hash_join( plan: &HashJoinExec, parent_required: OrderingRequirements, ) -> Result<Option<Vec<Option<OrderingRequirements>>>> { + // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input order, + // so sorting beforehand cannot be relied on. Bail out early for both flavors: + match plan.join_type() { + JoinType::LeftAnti | JoinType::RightAnti => { + return Ok(None); + } + _ => {} + } Review Comment: Okay interesting. Thank you for the in-depth explanation. Could you show with a simple table representing a batch how the anti join filtering out rows causes ordering to be lost? I imagine something like this flowing into the anti join: | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 3 | 5 | | 2 | 4 | 1 | For the query `SELECT a, b, c FROM t1 WHERE c NOT IN (SELECT n FROM t2) ORDER BY t1.a, t2.b` I expect the anti join to be created and for it to remove some rows, let's say `SELECT n FROM t2` returns just `5`, then the output would be: | a | b | c | |---|---|---| | 1 | 1 | 2 | | 2 | 4 | 1 | Which is still ordered. Are you saying it might be | a | b | c | |---|---|---| | 2 | 4 | 1 | | 1 | 1 | 2 | instead? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org