alamb commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2020125627
########## datafusion/physical-plan/src/sorts/sort.rs: ########## @@ -397,16 +393,13 @@ impl ExternalSorter { self.metrics.spill_metrics.spill_file_count.value() } - /// When calling, all `in_mem_batches` must be sorted (*), and then all of them will - /// be appended to the in-progress spill file. - /// - /// (*) 'Sorted' here means globally sorted for all buffered batches when the - /// memory limit is reached, instead of partially sorted within the batch. - async fn spill_append(&mut self) -> Result<()> { - assert!(self.in_mem_batches_sorted); - - // we could always get a chance to free some memory as long as we are holding some - if self.in_mem_batches.is_empty() { + /// Appending globally sorted batches to the in-progress spill file, and clears + /// the `globally_sorted_batches` (also its memory reservation) afterwards. + async fn consume_and_spill_append( + &mut self, + globally_sorted_batches: &mut Vec<RecordBatch>, Review Comment: thank you for the name here -- much better ########## datafusion/physical-plan/src/sorts/sort.rs: ########## @@ -216,10 +216,8 @@ struct ExternalSorter { // STATE BUFFERS: // Fields that hold intermediate data during sorting // ======================================================================== - /// Potentially unsorted in memory buffer + /// Unsorted input batches stored in the memory buffer Review Comment: Removing this field is an improvement 👍 thanks @2010YOUY01 and @comphead Given there are still three fields whose relationship is tied, I wonder if it would improve the code if we encoded that relationship in an actual rust `enum` -- for example ```rust enum SortState { /// All data is in memory Memory { batches: Vec<RecordBatch>, }, /// intermediate data is spilling to disk Spilling { batches: Vec<RecordBatch>, in_progress_spill_file: InProgressSpillFile, } ... } ``` 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org