andygrove opened a new pull request, #20478: URL: https://github.com/apache/datafusion/pull/20478
## Which issue does this PR close? N/A - performance optimization ## Rationale for this change In the SMJ tight loop (`join_partial`), `num_unfrozen_pairs()` was called **twice per iteration**: once in the loop guard and once inside `append_output_pair`. This method iterates all chunks in `output_indices` and sums their lengths — O(num_chunks). Over a full batch of `batch_size` iterations, this makes the inner loop O(batch_size * num_chunks) instead of O(batch_size). ## What changes are included in this PR? Add a `num_output_rows` field to `StreamedBatch` that is incremented on each append and reset on freeze, replacing the O(n) summation with an O(1) field read. - Added `num_output_rows: usize` field to `StreamedBatch`, initialized to `0` - Increment `num_output_rows` in `append_output_pair()` after each append - `num_output_rows()` now returns the cached field directly - Reset to `0` in `freeze_streamed()` when `output_indices` is cleared - Removed the `num_unfrozen_pairs` parameter from `append_output_pair()` since it can now read `self.num_output_rows` directly ## Are these changes tested? Yes — all 48 existing `sort_merge_join` tests pass. This is a pure refactor of an internal counter with no behavioral change. ## Are there any user-facing changes? No. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
