zhuqi-lucas commented on code in PR #16641: URL: https://github.com/apache/datafusion/pull/16641#discussion_r2178965043
########## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ########## @@ -668,6 +668,15 @@ fn handle_hash_join( plan: &HashJoinExec, parent_required: OrderingRequirements, ) -> Result<Option<Vec<Option<OrderingRequirements>>>> { + // Anti-joins (LeftAnti or RightAnti) do not preserve meaningful input order, + // so sorting beforehand cannot be relied on. Bail out early for both flavors: + match plan.join_type() { + JoinType::LeftAnti | JoinType::RightAnti => { + return Ok(None); + } + _ => {} + } Review Comment: Thank you @adriangb, exactly you are right. Correct me if i am wrong. Why Anti‑Joins Are Different An anti‑join (LEFT ANTI or RIGHT ANTI) filters out all rows that do have matches, emitting only the non‑matching ones. Although maintains_input_order() still returns true—because if you emitted every probe row in match order, they’d be ordered—the act of dropping rows breaks that guarantee: Per‑batch filtering In a batched execution, each batch is sorted, then some rows are independently discarded by the anti‑join filter. No final merge When you push the sort below the anti‑join, you end up just concatenating each sorted batch’s output. A small key in a later batch can easily slip behind a larger key from an earlier batch, violating the global ORDER BY. That is why we must short‑circuit and refuse to push the sort beneath any anti‑join: without a final merge or a full sort after filtering, the concatenation of filtered, sorted batches cannot preserve the overall ordering guarantee. **Why Anti‑Joins Break Global Ordering: A Batch Example** Suppose we split the probe-side input into two batches, each sorted by `a`, and then simply concatenate their outputs after filtering. Here’s how the global order can be lost: ```sql -- Batch 1 input (sorted by a) a | b | c --+---+--- 1 | x | … 3 | x | … -- Batch 2 input (sorted by a) a | b | c --+---+--- 2 | y | … 4 | y | … -- Assume the anti‑join filter keeps all rows (no actual filtering for simplicity). -- So each batch output remains: -- Batch 1 → [ (1, x, …), (3, x, …) ] -- Batch 2 → [ (2, y, …), (4, y, …) ] -- Naively concatenating Batch 1 then Batch 2: 1 | x | … 3 | x | … ← (3 > 2) but appears before the row with a=2 2 | y | … 4 | y | … -- Final sequence: [1, 3, 2, 4], which is *not* globally sorted by `a`. -- In contrast, a true global merge of two sorted streams: -- merge([1,3], [2,4]) → [1,2,3,4] -- would preserve order. 1. Sorting each batch in isolation only guarantees local order. 2. An anti‑join may filter out rows, but even without filtering, concatenating sorted batches breaks global ordering. 3. Therefore, we must prevent pushing a sort below an anti‑join unless we perform a final merge or full sort after filtering. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org