rluvaton commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2213289055
########## datafusion/physical-plan/src/aggregates/row_hash.rs: ########## @@ -1067,14 +1074,13 @@ impl GroupedHashAggregateStream { sort_batch(&batch, &expr, None) })), ))); - for spill in self.spill_state.spills.drain(..) { - let stream = self.spill_state.spill_manager.read_spill_as_stream(spill)?; - streams.push(stream); - } + self.spill_state.is_stream_merging = true; self.input = StreamingMergeBuilder::new() .with_streams(streams) .with_schema(schema) + .with_spill_manager(self.spill_state.spill_manager.clone()) + .with_sorted_spill_files(std::mem::take(&mut self.spill_state.spills)) Review Comment: > The issue is, without special handling, it's possible that in-mem batches will take most of the available memory budget, and leave only a very small memory part for multi-pass spilling to continue. This can cause slow downs or even prevent some cases to finish. Actually this case is already handled, so if I have 4 in memory streams and 5 spill files. and I don't have enough memory to merge with any spill file we will merge the in memory streams and spill to disk and next iteration will be the same as if we got 6 spill files. this is the same behavior as spilling before but optimized See: https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L178-L203 we first `take` the `in_memory_streams` (and don't add them back anymore) https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L180 and we try to get spill files to merge but the required number of spill files needed is 2 - `in_memory_stream.len()` https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L185-L186 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org