wirybeaver commented on issue #1359: URL: https://github.com/apache/datafusion-ballista/issues/1359#issuecomment-4474497591
Adding **`SplitPartitionsRule`** as the inverse of `CoalescePartitionsRule` (#1684): #1718. When upstream stats show one shuffle partition is far larger than the median, the rule fans that partition out across multiple reader tasks via round-robin assignment over its file list, instead of folding small partitions together. Same per-stage invocation, same alignment-group leaf walk, same carrier-slot-on-`ExchangeExec` pattern as #1684 — strict architectural mirror. **Scope limitation, called out for v1 honesty.** File-list sharding produces `UnknownPartitioning(K')` output, so the rule bails on any stage whose subtree contains a node requiring `HashPartitioned` or `SinglePartition` input (joins, `FinalPartitioned` aggregates). v1 helps stages where the consumer is distribution-agnostic (`Filter`/`Projection`/`LocalLimit` over a hash exchange). The TPC-H Q2 SF1000 skew that originally motivated this work (#1643) sits behind a `FinalPartitioned` aggregate and is not addressed by v1 — v2 (row-range reads + aggregate-aware plan rewriting) is the path that lands #1643. Task doc cross-linked from #1718. Stacked on #1684; once that lands, #1718's diff reduces to the single feat commit on rebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
