berkaysynnada commented on issue #14287: URL: https://github.com/apache/datafusion/issues/14287#issuecomment-2614018742
We have designed a poll-based repartition mechanism that polls its input whenever any of the output partitions are polled. This approach deviates from the round-robin pattern, and instead ensures a truly even workload distribution for consumer partitions. A batch is sent to the partition that has completed its computation and is ready to process the next data. This mechanism also exhibits prefetching behavior, similar to SortPreservingMerge, although the prefetching is limited to a single batch (or potentially up to the number of partitions—this will be finalized based on benchmark results). The implementation is currently underway, and the initial benchmark results are very promising. Theoretically, this approach should perform better especially in scenarios where the producer pace is higher than consumer side, which is the case I believe @westonpace mentions in the issue description. @Weijun-H is working on the implementation, and I hope we open the PR in the coming weeks once it is in a robust and optimized state. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org