Dandandan commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2993451425
Thank you @zhuqi-lucas for experimenting on this. Maybe it's a good idea to do some profiling to see the hots spots? For example, this is the profile I get from the sort-tpch benchmark. <img width="1728" alt="image" src="https://github.com/user-attachments/assets/88a72c7b-472e-438f-964b-ee43101df958" /> * You can see here most of the work is concentrated in SortPreservingMerge, rather than the sorts, so perhaps in this case making the `SortExec` faster won't help a ton to improve the total performance. Maybe we can use `target_partitions=1` to concentrate more work on `SortExec` so we can have a look. * I made a change here that's https://github.com/apache/arrow-rs/pull/7695 that will probably help a quite a bit with the performance of `SortPreserveMergeExec` and`SortExec`, maybe we can look at where the next hotspots after this change, I think probably a lot in converting to `Row`, doing comparison on byte slices and doing allocations. But also some parts seem related that we don't handle views as efficiently as possible. * One example I see is for example we do call `.gc()` which currently does a not-fast implementation. <img width="1179" alt="image" src="https://github.com/user-attachments/assets/07e3de93-9b3d-4f63-8d08-c328b8e39f73" /> * Another one, compare_unchecked: <img width="1070" alt="image" src="https://github.com/user-attachments/assets/fdddbf69-c176-4adc-9a05-c8e44c23ad3d" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org