ding-young commented on PR #16192: URL: https://github.com/apache/datafusion/pull/16192#issuecomment-2936483760
@2010YOUY01 Hi, I’ve been struggling a bit with tracking peak memory in SPM step, and I was wondering if I could ask for some help. ### 1. Can we add the memory for converted (row) batches to previous `peak_mem_used`? Since `ExternalSorter` creates `SortPreservingMergeStream` for 2nd step, SPM, so I tried updating the peak memory metric inside `maybe_poll_stream` in `SortPreservingMergeStream` (which internally calls `poll_next` where `convert_batch` is done, and pushes batches into a `BatchBuilder`). But here’s my concern: if we keep adding the new reservation from this second step to the previous peak memory value, we might be overestimating. That’s because by the time the second step runs, some batches from the first step might have already been dropped. So, summing them might inflate the reported peak memory. I tried printing the total reserved size from the global memory pool manually (with tons of `println`) during execution, and it seems like there was a difference between the first and second steps, but it didn’t seem as large as the total size of all converted batches combined. ### 2. Parent Operator's memory reservation Also, when the parent operator (e.g., `SortPreservingMergeExec`) executes, the reservation created by the earlier `SortExec` is not yet released. In this case, should `SortPreservingMergeExec` only track the peak memory of its own reservation? And please let me know if I’ve misunderstood when the reservation is supposed to be dropped. Maybe that’s where my confusion is coming from. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org