EmilyMatt commented on code in PR #22945:
URL: https://github.com/apache/datafusion/pull/22945#discussion_r3412991349


##########
datafusion/physical-plan/src/sorts/multi_level_merge.rs:
##########
@@ -427,8 +479,111 @@ impl MultiLevelMergeBuilder {
             .drain(..number_of_spills_to_read_for_current_phase)
             .collect::<Vec<_>>();
 
-        Ok((spills, buffer_len))
+        Ok(SpillFilesToMerge::Ready(spills, buffer_len))
+    }
+
+    /// Re-spill the spill file at `index` with half its batch size, putting 
it back
+    /// at the same position. We read the file back and re-spill it through 
the normal
+    /// spill API (which owns batch layout).
+    /// Slicing each batch in two halves the largest written batch,
+    /// which lowers the per-stream merge reservation so the
+    /// next attempt can seat both streams. One stream's worth of memory is 
reserved
+    /// for the duration and freed afterwards. Makes the merge resilient to 
skew.
+    async fn split_spill_file_in_half(&mut self, index: usize) -> Result<()> {
+        let target = self.sorted_spill_files.remove(index);

Review Comment:
   Applied against my better judgment, as this is both negligible compared to 
the work done here and makes the code more complex.
   (and besides the point may even be optimized out by the compiler, but that 
is untested)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to