zhuqi-lucas commented on code in PR #16641:
URL: https://github.com/apache/datafusion/pull/16641#discussion_r2178965043


##########
datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs:
##########
@@ -668,6 +668,15 @@ fn handle_hash_join(
     plan: &HashJoinExec,
     parent_required: OrderingRequirements,
 ) -> Result<Option<Vec<Option<OrderingRequirements>>>> {
+    // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input 
order,
+    // so sorting beforehand cannot be relied on. Bail out early for both 
flavors:
+    match plan.join_type() {
+        JoinType::LeftAnti | JoinType::RightAnti => {
+            return Ok(None);
+        }
+        _ => {}
+    }

Review Comment:
   Thank you @adriangb, exactly you are right. Correct me if i am wrong.
   
   Why Anti‑Joins Are Different
   An anti‑join (LEFT ANTI or RIGHT ANTI) filters out all rows that do have 
matches, emitting only the non‑matching ones. Although maintains_input_order() 
still returns true—because if you emitted every probe row in match order, 
they’d be ordered—the act of dropping rows breaks that guarantee:
   
   Per‑batch filtering
   In a batched execution, each batch is sorted, then some rows are 
independently discarded by the anti‑join filter.
   
   No final merge
   When you push the sort below the anti‑join, you end up just concatenating 
each sorted batch’s output. A small key in a later batch can easily slip behind 
a larger key from an earlier batch, violating the global ORDER BY.
   
   That is why we must short‑circuit and refuse to push the sort beneath any 
anti‑join: without a final merge or a full sort after filtering, the 
concatenation of filtered, sorted batches cannot preserve the overall ordering 
guarantee.
   
   
   
   **Why Anti‑Joins Break Global Ordering: A Batch Example**
   
   Suppose we split the probe-side input into two batches, each sorted by `a`, 
and then simply concatenate their outputs after filtering. Here’s how the 
global order can be lost:
   
   ```sql
   -- Batch 1 input (sorted by a)
   a | b | c
   --+---+---
   1 | x | …
   3 | x | …
   
   -- Batch 2 input (sorted by a)
   a | b | c
   --+---+---
   2 | y | …
   4 | y | …
   
   -- Assume the anti‑join filter keeps all rows (no actual filtering for 
simplicity).
   -- So each batch output remains:
   --   Batch 1 → [ (1, x, …), (3, x, …) ]
   --   Batch 2 → [ (2, y, …), (4, y, …) ]
   
   -- Naively concatenating Batch 1 then Batch 2:
   1 | x | …  
   3 | x | …   ← (3 > 2) but appears before the row with a=2  
   2 | y | …  
   4 | y | …
   
   -- Final sequence: [1, 3, 2, 4], which is *not* globally sorted by `a`.
   
   -- In contrast, a true global merge of two sorted streams:
   --   merge([1,3], [2,4]) → [1,2,3,4]
   -- would preserve order.
   
   1. Sorting each batch in isolation only guarantees local order.  
   2. An anti‑join may filter out rows, but even without filtering, 
concatenating sorted batches breaks global ordering.  
   3. Therefore, we must prevent pushing a sort below an anti‑join unless we 
perform a final merge or full sort after filtering.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to