Kontinuation commented on PR #14644: URL: https://github.com/apache/datafusion/pull/14644#issuecomment-2660794608
> Edge case: let's say input is a deduplicated `StringViewArray` (like a 10k rows batch with only 100 distinct values, but payload content are stored without duplication, the array elements are just referencing to the payload range), after converting to `Row` format, every row will be materialized, then the `Row` format will have 100X expansion I think we need some mechanism to deal with this kind of edge case, perhaps this also applies to dictionary representation I agree that the current implementation uses a very rough estimation, and it could be way off from the actual memory consumption. A better approach is to sort and generate the row representation of the batch right after we ingesting it, then we would know the exact size of sorted batches and their row representations held in memory. The merge phase for handling spilling could simply take away these data and perform merging without reserving more memory. However, this conflicts with some optimizations we did in the past: * https://github.com/apache/datafusion/pull/6308: sort are performed concurrently right before merging. > For point 4, are the memory budget to hold merged batches come from `sort_spill_reservation_bytes`? Small sorted runs, and converted rows should have taken up all memory spaces at this stage. Yes. It may come from `sort_spill_reservation_bytes`, or the reduced memory usage because of the fetch option. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org