Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

via GitHub Thu, 17 Jul 2025 06:02:22 -0700


rluvaton commented on code in PR #15700:
URL: https://github.com/apache/datafusion/pull/15700#discussion_r2213289055



##########
datafusion/physical-plan/src/aggregates/row_hash.rs:
##########
@@ -1067,14 +1074,13 @@ impl GroupedHashAggregateStream {
                 sort_batch(&batch, &expr, None)
             })),
         )));
-        for spill in self.spill_state.spills.drain(..) {
-            let stream = 
self.spill_state.spill_manager.read_spill_as_stream(spill)?;
-            streams.push(stream);
-        }
+
         self.spill_state.is_stream_merging = true;
         self.input = StreamingMergeBuilder::new()
             .with_streams(streams)
             .with_schema(schema)
+            .with_spill_manager(self.spill_state.spill_manager.clone())
+            .with_sorted_spill_files(std::mem::take(&mut 
self.spill_state.spills))

Review Comment:
   > The issue is, without special handling, it's possible that in-mem batches 
will take most of the available memory budget, and leave only a very small 
memory part for multi-pass spilling to continue. This can cause slow downs or 
even prevent some cases to finish.
   
   Actually this case is already handled, so if I have 4 in memory streams and 
5 spill files.
   
   and I don't have enough memory to merge with any spill file we will merge 
the in memory streams and spill to disk and next iteration will be the same as 
if we got 6 spill files.
   
   this is the same behavior as spilling before but optimized
   
   See:
   
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L178-L203
   
   
   we first `take` the `in_memory_streams` (and don't add them back anymore) 
   
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L180
   
   and we try to get spill files to merge but the required number of spill 
files needed is 2 - `in_memory_stream.len()`
   
https://github.com/rluvaton/datafusion/blob/af6b5c52afe316127cd007cd8464d07c3038a0f9/datafusion/physical-plan/src/sorts/multi_level_merge.rs#L185-L186
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

Reply via email to