ding-young commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-3031719770
This PR has a significant strength in that it works reliably even under a fairly conservative memory limit, which is impressive. I also learned a lot while reviewing it :). However, I do have the following concern about the approach based on user-configured `MAX_SPILL_MERGE_DEGREE`: I don’t think adding a user-configurable merge degree is inherently a bad idea. The problem is that the appropriate value for `MAX_SPILL_MERGE_DEGREE` can vary significantly depending on the query being executed and the characteristics of the spilled `RecordBatches`. For example, merging many thin batches is very different from merging a few wide batches in terms of memory consumption, and the optimal degree will differ accordingly. Unless the degree is defined in a way that consistently reflects some notion of aggressiveness or is based on actual memory consumption (e.g., in bytes), it would be hard to offer meaningful tuning guidance to users. In the end, as @rluvaton suggested, I think we do need to estimate something like `max_memory_bytes` per batch—even if the estimate isn’t very reliable. Though relying on memory estimation may not be very reliable, it seems unavoidable in the context of multi-pass merge to estimate the memory size of the RecordBatches to be read from each spill file. This is because we need at least a rough estimate to decide how many spill files (or streams) can be read and merged simultaneously, so that we could automatically perform multi-pass merge without user's manual debugging. What do you think about this? :thinking: Of course, I’m also currently trying to reproduce the case you pointed out in #15700, so if I discover a significant issue there, my opinion may very well change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org