adriangb commented on code in PR #16641:
URL: https://github.com/apache/datafusion/pull/16641#discussion_r2179100056


##########
datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs:
##########
@@ -668,6 +668,15 @@ fn handle_hash_join(
     plan: &HashJoinExec,
     parent_required: OrderingRequirements,
 ) -> Result<Option<Vec<Option<OrderingRequirements>>>> {
+    // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input 
order,
+    // so sorting beforehand cannot be relied on. Bail out early for both 
flavors:
+    match plan.join_type() {
+        JoinType::LeftAnti | JoinType::RightAnti => {
+            return Ok(None);
+        }
+        _ => {}
+    }

Review Comment:
   It makes sense that in that example no global order is guaranteed in the 
output, there was no sorting being done before the anti-join! I think if you 
pushed a SortExec you'd get the right order.
   
   Having thought about it a bit I do think this is a real bug that needs 
fixing, but I'm not sure it's about sorting, it's about limits: if you push the 
TopK below the anti join is the limit is applied before the anti join filters 
out rows, which means you'll end up getting less or the wrong rows out. For 
example, if the batch into the TopK(2) is
   
   | a | b | c |
   |---|---|---|
   | 2 | 3 | 5 |
   | 1 | 1 | 2 |
   | 2 | 4 | 1 |
   
   It will output
   
   | a | b | c |
   |---|---|---|
   | 1 | 1 | 2 |
   | 2 | 3 | 5 |
   
   But then if the anti join filters out `c = 5` we get:
   
   | a | b | c |
   |---|---|---|
   | 1 | 1 | 2 |
   
   Which is a single row despite the `LIMIT 2`.
   
   If instead we apply the anti-join first we get:
   
   | a | b | c |
   |---|---|---|
   | 1 | 1 | 2 |
   | 2 | 4 | 1 |
   
   Which is correct.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to