ding-young commented on PR #15610:
URL: https://github.com/apache/datafusion/pull/15610#issuecomment-3031719770

   This PR has a significant strength in that it works reliably even under a 
fairly conservative memory limit, which is impressive. I also learned a lot 
while reviewing it :). However, I do have the following concern about the 
approach based on user-configured `MAX_SPILL_MERGE_DEGREE`:
   
   I don’t think adding a user-configurable merge degree is inherently a bad 
idea. The problem is that the appropriate value for `MAX_SPILL_MERGE_DEGREE` 
can vary significantly depending on the query being executed and the 
characteristics of the spilled `RecordBatches`. For example, merging many thin 
batches is very different from merging a few wide batches in terms of memory 
consumption, and the optimal degree will differ accordingly.
   
   Unless the degree is defined in a way that consistently reflects some notion 
of aggressiveness or is based on actual memory consumption (e.g., in bytes), it 
would be hard to offer meaningful tuning guidance to users. In the end, as 
@rluvaton suggested, I think we do need to estimate something like 
`max_memory_bytes` per batch—even if the estimate isn’t very reliable.
   
   Though relying on memory estimation may not be very reliable, it seems 
unavoidable in the context of multi-pass merge to estimate the memory size of 
the RecordBatches to be read from each spill file. This is because we need at 
least a rough estimate to decide how many spill files (or streams) can be read 
and merged simultaneously, so that we could automatically perform multi-pass 
merge without user's manual debugging.
   
   What do you think about this? :thinking: 
   
   Of course, I’m also currently trying to reproduce the case you pointed out 
in #15700, so if I discover a significant issue there, my opinion may very well 
change. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to