rluvaton commented on code in PR #22945: URL: https://github.com/apache/datafusion/pull/22945#discussion_r3412692644
########## datafusion/physical-plan/src/sorts/multi_level_merge.rs: ########## Review Comment: Another optimization ideas (not in this PR): scenerio Let's say you have 8 files: ``` M,L,M,M,M,M,M,M ``` (`M` is file with medium max batch size, `L` is file with large max batch size) in your implementation: you see that you can't merge the first 2 files since `L` is very large so you spilt L batch size by half and try again so after spilt you have: ``` M,L_Split,M,M,M,M,M,M ``` and now you can merge the first 4 streams: `M,L_Split,M,M` and continue as usual. but what you did was changing batch size for everything (which is required so you don't go back to the same large batch in the worst case scenerio) but this also harm the performance of the entire multi level merge since you now spill and merge in smaller batches. My idea is that you can delay the split of `L` to last so all the sort and spill files before it will use the old batch size but only the last one will use the smaller batch size and this will increase performance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
