EmilyMatt commented on code in PR #22945:
URL: https://github.com/apache/datafusion/pull/22945#discussion_r3412991349
##########
datafusion/physical-plan/src/sorts/multi_level_merge.rs:
##########
@@ -427,8 +479,111 @@ impl MultiLevelMergeBuilder {
.drain(..number_of_spills_to_read_for_current_phase)
.collect::<Vec<_>>();
- Ok((spills, buffer_len))
+ Ok(SpillFilesToMerge::Ready(spills, buffer_len))
+ }
+
+ /// Re-spill the spill file at `index` with half its batch size, putting
it back
+ /// at the same position. We read the file back and re-spill it through
the normal
+ /// spill API (which owns batch layout).
+ /// Slicing each batch in two halves the largest written batch,
+ /// which lowers the per-stream merge reservation so the
+ /// next attempt can seat both streams. One stream's worth of memory is
reserved
+ /// for the duration and freed afterwards. Makes the merge resilient to
skew.
+ async fn split_spill_file_in_half(&mut self, index: usize) -> Result<()> {
+ let target = self.sorted_spill_files.remove(index);
Review Comment:
Applied against my better judgment, as this is both negligible compared to
the work done here and makes the code more complex.
(and besides the point may even be optimized out by the compiler, but that
is untested)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]