kosiew opened a new pull request, #21068:
URL: https://github.com/apache/datafusion/pull/21068

   
   ## Which issue does this PR close?
   
   * Closes #20492.
   
   ## Rationale for this change
   
   `HashJoinExec` currently continues polling and consuming the probe side even 
after the build side has completed with zero rows.
   
   For join types whose output is guaranteed to be empty when the build side is 
empty, this work is unnecessary. In practice, it can trigger large avoidable 
scans and extra compute despite producing no output. This is especially costly 
for cases such as INNER, LEFT, LEFT SEMI, LEFT ANTI, LEFT MARK, and RIGHT SEMI 
joins.
   
   This change makes the stream state machine aware of that condition so 
execution can terminate as soon as the build side is known to be empty and no 
probe rows are needed to determine the final result.
   
   The change also preserves the existing behavior for join types that still 
require probe-side rows even when the build side is empty, such as RIGHT, FULL, 
RIGHT ANTI, and RIGHT MARK joins.
   
   ## What changes are included in this PR?
   
   This PR introduces a small refactor and an execution-path optimization for 
hash join empty-build handling:
   
   * adds `empty_build_side_produces_empty_result(join_type)` in 
`joins/utils.rs` as a shared source of truth for join types that can 
short-circuit when the build side is empty
   * updates `HashJoinStream` to use a new `next_state_after_build_ready(...)` 
helper so that, once build-side collection and any required coordination 
finish, the stream can transition directly to `Completed` instead of always 
entering `FetchProbeBatch`
   * applies the same shared helper inside `build_batch_empty_build_side(...)` 
to keep output semantics aligned with the stream short-circuit logic
   * factors dynamic-filter join setup in tests into a reusable helper, 
reducing duplicated test code
   * adds regression tests covering:
   
     * join types that should not consume the probe side when the build side is 
empty
     * join types that must still consume the probe side because probe rows are 
needed for correct output
     * dynamic-filter completion behavior, including the path that waits for 
partition-bounds reporting before deciding whether probe polling is necessary
   
   ## Are these changes tested?
   
   Yes.
   
   This PR adds targeted async tests covering both the optimized and 
non-optimized cases:
   
   * `join_does_not_consume_probe_when_empty_build_fixes_output`
   * `join_still_consumes_probe_when_empty_build_needs_probe_rows`
   * `test_hash_join_skips_probe_on_empty_build_after_partition_bounds_report`
   
   The tests also verify that errors from the probe side are not observed when 
the join can correctly short-circuit, and that they are still surfaced for join 
types that must continue consuming probe input.
   
   Existing dynamic-filter tests were also cleaned up to use a shared helper 
while preserving the completion checks.
   
   ## Are there any user-facing changes?
   
   Yes, in execution behavior and performance.
   
   For affected hash join types, queries can now stop earlier when the build 
side is empty, avoiding unnecessary probe-side scans and reducing wasted I/O 
and compute. There are no intended API changes.
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to