Kontinuation commented on PR #14644:
URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660794608

   > Edge case: let's say input is a deduplicated `StringViewArray` (like a 10k 
rows batch with only 100 distinct values, but payload content are stored 
without duplication, the array elements are just referencing to the payload 
range), after converting to `Row` format, every row will be materialized, then 
the `Row` format will have 100X expansion I think we need some mechanism to 
deal with this kind of edge case, perhaps this also applies to dictionary 
representation
   
   I agree that the current implementation uses a very rough estimation, and it 
could be way off from the actual memory consumption.
   
   A better approach is to sort and generate the row representation of the 
batch right after we ingesting it, then we would know the exact size of sorted 
batches and their row representations held in memory. The merge phase for 
handling spilling could simply take away these data and perform merging without 
reserving more memory. However, this conflicts with some optimizations we did 
in the past:
   
   * https://github.com/apache/datafusion/pull/6308: sort are performed 
concurrently right before merging.
   
   > For point 4, are the memory budget to hold merged batches come from 
`sort_spill_reservation_bytes`? Small sorted runs, and converted rows should 
have taken up all memory spaces at this stage.
   
   Yes. It may come from `sort_spill_reservation_bytes`, or the reduced memory 
usage because of the fetch option.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to