korowa commented on issue #14238: URL: https://github.com/apache/datafusion/issues/14238#issuecomment-2611636777
I'd suggest to rename "splitting" part of the problem to "restricting" -- if join is able to produce a batch that needs to be splitted (event if this batch exists only internally), than it already may be issue, which may hurt on some specific cases. I also think that `BatchSplitter` in it's current implementation (when it already has a batch to split) is not solving the problem, but just covers it (in addition if these batches for splitting are large enough, to start causing memory issues, `BatchSplitter` doesn't seem to be able to help). In this case (for splitting / restricting), I think, what @berkaysynnada suggests: > to make all join operators capable of performing both coalescing and splitting in a built-in manner is a better fit -- each join operator should be able to limit / restrict its internally created record batches to prevent excessive accumulation of data in memory (or at least, if it's required, to track them via memory reservations). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org