Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

via GitHub Thu, 03 Jul 2025 02:39:27 -0700


ding-young commented on PR #15700:
URL: https://github.com/apache/datafusion/pull/15700#issuecomment-3031592289


   I've rebased this branch on the latest main and tested whether estimated 
size changes after we load `RecordBatch` which was compressed with `lz4_frame` 
into memory. The result of `get_actually_used_size()` was identical before and 
after (arrow-ipc `StreamReader` will return decoded array). Of course, since 
buffer allocations and copies happen internally during decoding, actual system 
memory usage (which DataFusion doesn't track) may temporarily be higher. 
Anyway, I've only tested for primitive type array + compression so I'll run a 
few more tests and try to see if I can reproduce any of the problematic cases 
discussed above.
   
   
   
   > Hi @adriangb, thanks for raising this point. I'm currently reviewing both 
this PR and the other cascading merge sort PR (#15610). I'm not taking sides 
between the two approaches, but I agree that accurately estimating memory 
consumption is tricky considering issues discussed above and the fact that now 
compression is supported in spill files. We may need to think more about 
whether we can special-case scenarios where the memory size changes after 
spilling and reloading, or perhaps add some kind of backup logic to handle such 
situations more gracefully.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

Reply via email to