zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2804362823
> I think `concat` followed by `sort` is slower in some cases because > > * Concat involves copying the entire batch (rather than only the keys to be sorted) > * `sort_batch_stream` Can be slower as `lexsort_to_indices` is in cases with many columns slower than the Row Format > > I think for `ExternalSorter` we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help). > > The core improvements that I think are important: > > * Minimizing copying of the input batches to one (only once for the output) > * Sorting once on the input batches rather than sort followed by merge > * A good heuristic on when to switch from `lexsort_to_indices`-like sorting to RowConverter + sorting. Good explain. > I think for ExternalSorter we don't want any additional parallelism as the sort is already executed per partition (so additional parallelism is likely to hurt rather than help). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org