Re: [PR] feat: Re-spill sort stream if unable to reserve for 2 streams [datafusion]

via GitHub Mon, 15 Jun 2026 03:18:42 -0700


rluvaton commented on code in PR #22945:
URL: https://github.com/apache/datafusion/pull/22945#discussion_r3412692644



##########
datafusion/physical-plan/src/sorts/multi_level_merge.rs:
##########


Review Comment:
   Another optimization ideas (not in this PR):
   
   scenerio 
   
   Let's say you have 8 files:
   ```
   M,L,M,M,M,M,M,M
   ```
   (`M` is file with medium max batch size, `L` is file with large max batch 
size)
   
   in your implementation: you see that you can't merge the first 2 files since 
`L` is very large so you spilt L batch size by half and try again
   so after spilt you have:
   ```
   M,L_Split,M,M,M,M,M,M
   ```
   
   and now you can merge the first 4 streams: `M,L_Split,M,M`
   and continue as usual.
   
   but what you did was changing batch size for everything (which is required 
so you don't go back to the same large batch in the worst case scenerio)
   
   but this also harm the performance of the entire multi level merge since you 
now spill and merge in smaller batches.
   
   My idea is that you can delay the split of `L` to last so all the sort and 
spill files before it will use the old batch size but only the last one will 
use the smaller batch size and this will increase performance.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Re-spill sort stream if unable to reserve for 2 streams [datafusion]

Reply via email to